[jira] [Commented] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException
[ https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022775#comment-15022775 ] Huaxin Gao commented on SPARK-11788: I will change to timestampValue and dateValue. In the test case, I intentionally make the date and timestamp 1 year less than the value in the table because I am using > $"B" > date && $"C" > timestamp > Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC > dataframes causes SQLServerException > > > Key: SPARK-11788 > URL: https://issues.apache.org/jira/browse/SPARK-11788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Martin Tapp > > I have a MSSQL table that has a timestamp column and am reading it using > DataFrameReader.jdbc. Adding a where clause which compares a timestamp range > causes a SQLServerException. > The problem is in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264 > (compileValue) which should surround timestamps/dates with quotes (only does > it for strings). > Sample pseudo-code: > val beg = new java.sql.Timestamp(...) > val end = new java.sql.Timestamp(...) > val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" > < end) > Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" > Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 > 00:00:00.0'" > Fallback is to filter client-side which is extremely inefficient as the whole > table needs to be downloaded to each Spark executor. > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11836) Register a Python function creates a new SQLContext
[ https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Davies Liu reassigned SPARK-11836: -- Assignee: Davies Liu > Register a Python function creates a new SQLContext > --- > > Key: SPARK-11836 > URL: https://issues.apache.org/jira/browse/SPARK-11836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.0 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > > You can try it with {{sqlContext.registerFunction("stringLengthString", > lambda x: len)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException
[ https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022845#comment-15022845 ] Huaxin Gao edited comment on SPARK-11788 at 11/23/15 7:49 PM: -- Sorry, I just realized that actually I checked in the $"B" === date && $"C" === timestamp instead of the $"B" > date && $"C" > timestamp I will change. Never mind. It is $"B" > date && $"C" > timestamp I have multiple test case and I confused myself. was (Author: huaxing): Sorry, I just realized that actually I checked in the $"B" === date && $"C" === timestamp instead of the $"B" > date && $"C" > timestamp I will change. > Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC > dataframes causes SQLServerException > > > Key: SPARK-11788 > URL: https://issues.apache.org/jira/browse/SPARK-11788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Martin Tapp > > I have a MSSQL table that has a timestamp column and am reading it using > DataFrameReader.jdbc. Adding a where clause which compares a timestamp range > causes a SQLServerException. > The problem is in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264 > (compileValue) which should surround timestamps/dates with quotes (only does > it for strings). > Sample pseudo-code: > val beg = new java.sql.Timestamp(...) > val end = new java.sql.Timestamp(...) > val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" > < end) > Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" > Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 > 00:00:00.0'" > Fallback is to filter client-side which is extremely inefficient as the whole > table needs to be downloaded to each Spark executor. > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Issue Comment Deleted] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException
[ https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Huaxin Gao updated SPARK-11788: --- Comment: was deleted (was: Sorry, I just realized that actually I checked in the $"B" === date && $"C" === timestamp instead of the $"B" > date && $"C" > timestamp I will change. Never mind. It is $"B" > date && $"C" > timestamp I have multiple test case and I confused myself. ) > Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC > dataframes causes SQLServerException > > > Key: SPARK-11788 > URL: https://issues.apache.org/jira/browse/SPARK-11788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Martin Tapp > > I have a MSSQL table that has a timestamp column and am reading it using > DataFrameReader.jdbc. Adding a where clause which compares a timestamp range > causes a SQLServerException. > The problem is in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264 > (compileValue) which should surround timestamps/dates with quotes (only does > it for strings). > Sample pseudo-code: > val beg = new java.sql.Timestamp(...) > val end = new java.sql.Timestamp(...) > val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" > < end) > Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" > Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 > 00:00:00.0'" > Fallback is to filter client-side which is extremely inefficient as the whole > table needs to be downloaded to each Spark executor. > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
[ https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley updated SPARK-11920: -- Assignee: Yanbo Liang Target Version/s: 1.6.0 > ML LinearRegression should use correct dataset in examples and user guide doc > - > > Key: SPARK-11920 > URL: https://issues.apache.org/jira/browse/SPARK-11920 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > > ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in > examples and user guide doc, but it's actually classification dataset rather > than regression dataset. We should use > data/mllib/sample_linear_regression_data.txt instead. > The deeper causes is that LinearRegression with "normal" solver can not solve > this dataset correctly, may be due to the ill condition and unreasonable > label. This issue has been reported at SPARK-11918. > It will confuse users if they run the example code but get exception, so we > should make this change which can clearly illustrate the usage of > LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11886) R function name conflicts with base or stats package ones
[ https://issues.apache.org/jira/browse/SPARK-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022787#comment-15022787 ] Felix Cheung commented on SPARK-11886: -- We could also create a generic function to redirect the more generic calls to base:: or stats:: ones, similar to how [~sunrui] handles rank SPARK-7499 might affect how we have method signature too. > R function name conflicts with base or stats package ones > - > > Key: SPARK-11886 > URL: https://issues.apache.org/jira/browse/SPARK-11886 > Project: Spark > Issue Type: Bug > Components: SparkR >Affects Versions: 1.5.2 >Reporter: Felix Cheung >Priority: Minor > > See https://github.com/apache/spark/pull/9785 > Currently these are masked: > stats::cov > stats::filter > base::sample > base::table > [~shivaram] suggested: > " > If we have same name but the param types completely don't match (and no room > for ...) then we override those functions but (This is true for sample, > table, cov right now I guess) we should try to limit the number of functions > where we do this. Also we should revisit some of these to see if we can avoid > it (for example table can be renamed ?) > " -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11234) What's cooking classification
[ https://issues.apache.org/jira/browse/SPARK-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022830#comment-15022830 ] Joseph K. Bradley commented on SPARK-11234: --- [~yinxusen] Thank you for working through this task! Here are some of my thoughts: {quote}1. Currently, multi-line per record JSON file is hard to handle, I have to load the data with JsonInputFormat in the json-pxf-ext package. {quote} * WIP, but no clear ETA [SPARK-7366] {quote}2. String indexer is easy to use. But it is hard to do beyond existing transformers. Like in the code, when I want to add all vectors that belong to the same id together, I have to write an aggregate function. {quote} * Does the SQLTransformer help? If you could pick any API to write this operation, what would be ideal for you? (I'm envisioning something analogous to a UDF for ML Pipelines, but that is almost provided by the SQLTransformer.) {quote}3. ParamGridBuilder accepts discrete parameter candidates, but I need to add some parameters with guess like Array(1.0, 0.1, 0.01). I don't know which parameter is suitable and how to fill in the array will get a better result. How about giving a range of real numbers so that the ParamGridBuilder can generate candidates for me like [0.0001, 1]? {quote} Do you mean it should automatically zoom in on regions which seem to get good results? I agree this can help in practice; I did something like this for a different ML library. {quote}4. The evaluator forces me to select a metric method. But sometimes I want to see all the evaluation results, say F1, precision-recall, AUC, etc. {quote} Do you want the metrics (a) for the sake of viewing performance at the end of a test? Or do you want the metrics (b) for model selection? If it's for (a) viewing at the end of a test, then model summaries are probably the way to go. Only LinearRegression and LogisticRegression have summaries currently, but we should add them for other models too. {quote}5. ML transformers will get stuck when facing with Int type. It's strange that we have to transform all Int values to double values before hand. I think a wise auto casting is helpful. {quote} I agree that too many Transformers are brittle when it comes to accepting multiple Numeric types. I had made an umbrella here [SPARK-11107], but perhaps we can think of a way to make this change everywhere, rather than case-by-case. > What's cooking classification > - > > Key: SPARK-11234 > URL: https://issues.apache.org/jira/browse/SPARK-11234 > Project: Spark > Issue Type: Sub-task > Components: ML >Reporter: Xusen Yin > > I add the subtask to post the work on this dataset: > https://www.kaggle.com/c/whats-cooking -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11836) Register a Python function creates a new SQLContext
[ https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11836: Assignee: Davies Liu (was: Apache Spark) > Register a Python function creates a new SQLContext > --- > > Key: SPARK-11836 > URL: https://issues.apache.org/jira/browse/SPARK-11836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.0 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > > You can try it with {{sqlContext.registerFunction("stringLengthString", > lambda x: len)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11836) Register a Python function creates a new SQLContext
[ https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022841#comment-15022841 ] Apache Spark commented on SPARK-11836: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9914 > Register a Python function creates a new SQLContext > --- > > Key: SPARK-11836 > URL: https://issues.apache.org/jira/browse/SPARK-11836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.0 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > > You can try it with {{sqlContext.registerFunction("stringLengthString", > lambda x: len)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-3215) Add remote interface for SparkContext
[ https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-3215. --- Resolution: Won't Fix I think there isn't enough interest in getting this into Spark itself, so I'll just close the bug instead. > Add remote interface for SparkContext > - > > Key: SPARK-3215 > URL: https://issues.apache.org/jira/browse/SPARK-3215 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Marcelo Vanzin > Labels: hive > Attachments: RemoteSparkContext.pdf > > > A quick description of the issue: as part of running Hive jobs on top of > Spark, it's desirable to have a SparkContext that is running in the > background and listening for job requests for a particular user session. > Running multiple contexts in the same JVM is not a very good solution. Not > only SparkContext currently has issues sharing the same JVM among multiple > instances, but that turns the JVM running the contexts into a huge bottleneck > in the system. > So I'm proposing a solution where we have a SparkContext that is running in a > separate process, and listening for requests from the client application via > some RPC interface (most probably Akka). > I'll attach a document shortly with the current proposal. Let's use this bug > to discuss the proposal and any other suggestions. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11836) Register a Python function creates a new SQLContext
[ https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11836: Assignee: Apache Spark (was: Davies Liu) > Register a Python function creates a new SQLContext > --- > > Key: SPARK-11836 > URL: https://issues.apache.org/jira/browse/SPARK-11836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.0 >Reporter: Yin Huai >Assignee: Apache Spark >Priority: Critical > > You can try it with {{sqlContext.registerFunction("stringLengthString", > lambda x: len)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-7539) Perf tests for Python MLlib
[ https://issues.apache.org/jira/browse/SPARK-7539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley reassigned SPARK-7539: Assignee: Joseph K. Bradley (was: Xiangrui Meng) > Perf tests for Python MLlib > --- > > Key: SPARK-7539 > URL: https://issues.apache.org/jira/browse/SPARK-7539 > Project: Spark > Issue Type: Test > Components: MLlib, PySpark, Tests >Reporter: Joseph K. Bradley >Assignee: Joseph K. Bradley > > As new perf-tests are added to Scala, we should added equivalent ones in > Python. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException
[ https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022845#comment-15022845 ] Huaxin Gao commented on SPARK-11788: Sorry, I just realized that actually I checked in the $"B" === date && $"C" === timestamp instead of the $"B" > date && $"C" > timestamp I will change. > Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC > dataframes causes SQLServerException > > > Key: SPARK-11788 > URL: https://issues.apache.org/jira/browse/SPARK-11788 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Martin Tapp > > I have a MSSQL table that has a timestamp column and am reading it using > DataFrameReader.jdbc. Adding a where clause which compares a timestamp range > causes a SQLServerException. > The problem is in > https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264 > (compileValue) which should surround timestamps/dates with quotes (only does > it for strings). > Sample pseudo-code: > val beg = new java.sql.Timestamp(...) > val end = new java.sql.Timestamp(...) > val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" > < end) > Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0" > Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 > 00:00:00.0'" > Fallback is to filter client-side which is extremely inefficient as the whole > table needs to be downloaded to each Spark executor. > Thanks -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
[ https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Joseph K. Bradley resolved SPARK-11920. --- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9905 [https://github.com/apache/spark/pull/9905] > ML LinearRegression should use correct dataset in examples and user guide doc > - > > Key: SPARK-11920 > URL: https://issues.apache.org/jira/browse/SPARK-11920 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Assignee: Yanbo Liang >Priority: Minor > Fix For: 1.6.0 > > > ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in > examples and user guide doc, but it's actually classification dataset rather > than regression dataset. We should use > data/mllib/sample_linear_regression_data.txt instead. > The deeper causes is that LinearRegression with "normal" solver can not solve > this dataset correctly, may be due to the ill condition and unreasonable > label. This issue has been reported at SPARK-11918. > It will confuse users if they run the example code but get exception, so we > should make this change which can clearly illustrate the usage of > LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022634#comment-15022634 ] Sandy Ryza commented on SPARK-: --- [~nchammas] it's not clear that it makes sense to add a similar API for Python and R. The main point of the Dataset API, as I understand it, is to extend DataFrames to take advantage of Java / Scala's static typing systems. This means recovering compile-time type safety, integration with existing Java / Scale object frameworks, and Scala syntactic sugar like pattern matching. Python and R are dynamically typed so can't take advantage of these. > Dataset API on top of Catalyst/DataFrame > > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] > The initial version of the Dataset API has been merged in Spark 1.6. However, > it will take a few more future releases to flush everything out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-6328) Python API for StreamingListener
[ https://issues.apache.org/jira/browse/SPARK-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022690#comment-15022690 ] Daniel Jalova commented on SPARK-6328: -- @tdas Could you change the assignee to me please? > Python API for StreamingListener > > > Key: SPARK-6328 > URL: https://issues.apache.org/jira/browse/SPARK-6328 > Project: Spark > Issue Type: Improvement > Components: PySpark >Reporter: Yifan Wang >Assignee: Yifan Wang > Fix For: 1.6.0 > > > StreamingListener API is only available in Java/Scala. It will be useful to > make it available in Python so that Spark application written in python can > check the status of ongoing streaming computation. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11913) support typed aggregate for complex buffer schema
[ https://issues.apache.org/jira/browse/SPARK-11913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11913. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9898 [https://github.com/apache/spark/pull/9898] > support typed aggregate for complex buffer schema > - > > Key: SPARK-11913 > URL: https://issues.apache.org/jira/browse/SPARK-11913 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11894) Incorrect results are returned when using null
[ https://issues.apache.org/jira/browse/SPARK-11894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11894. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9904 [https://github.com/apache/spark/pull/9904] > Incorrect results are returned when using null > -- > > Key: SPARK-11894 > URL: https://issues.apache.org/jira/browse/SPARK-11894 > Project: Spark > Issue Type: Sub-task > Components: SQL >Affects Versions: 1.6.0 >Reporter: Xiao Li > Fix For: 1.6.0 > > > In DataSet APIs, the following two datasets are the same. > Seq((new java.lang.Integer(0), "1"), (new java.lang.Integer(22), > "2")).toDS() > Seq((null.asInstanceOf[java.lang.Integer],, "1"), (new > java.lang.Integer(22), "2")).toDS() > Note: java.lang.Integer is Nullable. > It could generate an incorrect result. For example, > val ds1 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new > java.lang.Integer(22), "2")).toDS() > val ds2 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new > java.lang.Integer(22), "2")).toDS()//toDF("key", "value").as('df2) > val res1 = ds1.joinWith(ds2, lit(true)).collect() > The expected result should be > ((null,1),(null,1)) > ((22,2),(null,1)) > ((null,1),(22,2)) > ((22,2),(22,2)) > The actual result is > ((0,1),(0,1)) > ((22,2),(0,1)) > ((0,1),(22,2)) > ((22,2),(22,2)) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11921) fix `nullable` of encoder schema
[ https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Armbrust resolved SPARK-11921. -- Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9906 [https://github.com/apache/spark/pull/9906] > fix `nullable` of encoder schema > > > Key: SPARK-11921 > URL: https://issues.apache.org/jira/browse/SPARK-11921 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-7173) Support YARN node label expressions for the application master
[ https://issues.apache.org/jira/browse/SPARK-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Marcelo Vanzin resolved SPARK-7173. --- Resolution: Fixed Assignee: Saisai Shao Fix Version/s: 1.6.0 > Support YARN node label expressions for the application master > -- > > Key: SPARK-7173 > URL: https://issues.apache.org/jira/browse/SPARK-7173 > Project: Spark > Issue Type: Improvement > Components: YARN >Affects Versions: 1.3.1 >Reporter: Sandy Ryza >Assignee: Saisai Shao > Fix For: 1.6.0 > > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11837) spark_ec2.py breaks with python3 and m3 instances
[ https://issues.apache.org/jira/browse/SPARK-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen resolved SPARK-11837. Resolution: Fixed Fix Version/s: 1.6.0 Issue resolved by pull request 9797 [https://github.com/apache/spark/pull/9797] > spark_ec2.py breaks with python3 and m3 instances > - > > Key: SPARK-11837 > URL: https://issues.apache.org/jira/browse/SPARK-11837 > Project: Spark > Issue Type: Bug > Components: EC2 >Reporter: Mortada Mehyar >Priority: Minor > Fix For: 1.6.0 > > > The `spark_ec2.py` script breaks when launching an m3 instance with python3 > because `string.letters` is for python2 only. For python3 > `string.ascii_letters` should be used instead. > The PR for fixing this is here: https://github.com/apache/spark/pull/9797 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11836) Register a Python function creates a new SQLContext
[ https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022870#comment-15022870 ] Apache Spark commented on SPARK-11836: -- User 'davies' has created a pull request for this issue: https://github.com/apache/spark/pull/9915 > Register a Python function creates a new SQLContext > --- > > Key: SPARK-11836 > URL: https://issues.apache.org/jira/browse/SPARK-11836 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.3.0, 1.4.0, 1.5.0 >Reporter: Yin Huai >Assignee: Davies Liu >Priority: Critical > > You can try it with {{sqlContext.registerFunction("stringLengthString", > lambda x: len)}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11837) spark_ec2.py breaks with python3 and m3 instances
[ https://issues.apache.org/jira/browse/SPARK-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Josh Rosen updated SPARK-11837: --- Assignee: Mortada Mehyar > spark_ec2.py breaks with python3 and m3 instances > - > > Key: SPARK-11837 > URL: https://issues.apache.org/jira/browse/SPARK-11837 > Project: Spark > Issue Type: Bug > Components: EC2 >Reporter: Mortada Mehyar >Assignee: Mortada Mehyar >Priority: Minor > Fix For: 1.6.0 > > > The `spark_ec2.py` script breaks when launching an m3 instance with python3 > because `string.letters` is for python2 only. For python3 > `string.ascii_letters` should be used instead. > The PR for fixing this is here: https://github.com/apache/spark/pull/9797 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-9999) Dataset API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022735#comment-15022735 ] Nicholas Chammas edited comment on SPARK- at 11/23/15 8:06 PM: --- [~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will provide no performance advantages over DataFrames, and that they will just help in terms of catching type errors early? {quote} Python and R are dynamically typed so can't take advantage of these. {quote} I can't speak for R, but Python has supported type hints since 3.0. More recently, Python 3.5 introduced a [typing module|https://docs.python.org/3/library/typing.html#module-typing] to standardize how type hints are specified, which facilitates the use of static type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer a statically type checked API, but practically speaking it would have to be limited to Python 3+. I suppose people don't generally expect static type checking when they use Python, so perhaps it makes sense not to support Datasets in PySpark. was (Author: nchammas): [~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will provide no performance advantages over DataFrames, and that they will just help in terms of catching type errors early? {quote} Python and R are dynamically typed so can't take advantage of these. {quote} I can't speak for R, but Python as supported type hints since 3.0. More recently, Python 3.5 introduced a [typing module|https://docs.python.org/3/library/typing.html#module-typing] to standardize how type hints are specified, which facilitates the use of static type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer a statically type checked API, but practically speaking it would have to be limited to Python 3+. I suppose people don't generally expect static type checking when they use Python, so perhaps it makes sense not to support Datasets in PySpark. > Dataset API on top of Catalyst/DataFrame > > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] > The initial version of the Dataset API has been merged in Spark 1.6. However, > it will take
[jira] [Commented] (SPARK-11329) Expand Star when creating a struct
[ https://issues.apache.org/jira/browse/SPARK-11329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022876#comment-15022876 ] Maciej BryĆski commented on SPARK-11329: [~yhuai] How can I execute this query: ``` SELECT max(struct(timestamp, *)) as mostRecentRecord GROUP BY key ``` without Spark SQL. I'd like to make run this from pyspark > Expand Star when creating a struct > -- > > Key: SPARK-11329 > URL: https://issues.apache.org/jira/browse/SPARK-11329 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Yin Huai >Assignee: Nong Li > Fix For: 1.6.0 > > > It is pretty common for customers to do regular extractions of update data > from an external datasource (e.g. mysql or postgres). While this is possible > today, the syntax is a little onerous. With some small improvements to the > analyzer I think we could make this much easier. > Goal: Allow users to execute the following two queries as well as their > dataframe equivalents > to find the most recent record for each key > {{SELECT max(struct(timestamp, *)) as mostRecentRecord GROUP BY key}} > to unnest the struct from above. > {{SELECT mostRecentRecord.* FROM data}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame
[ https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022883#comment-15022883 ] Xiao Li commented on SPARK-: Agree. The major performance gain of Dataset should be from Catalyst Optimizer. > Dataset API on top of Catalyst/DataFrame > > > Key: SPARK- > URL: https://issues.apache.org/jira/browse/SPARK- > Project: Spark > Issue Type: Story > Components: SQL >Reporter: Reynold Xin >Assignee: Michael Armbrust > > The RDD API is very flexible, and as a result harder to optimize its > execution in some cases. The DataFrame API, on the other hand, is much easier > to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to > use UDFs, lack of strong types in Scala/Java). > The goal of Spark Datasets is to provide an API that allows users to easily > express transformations on domain objects, while also providing the > performance and robustness advantages of the Spark SQL execution engine. > h2. Requirements > - *Fast* - In most cases, the performance of Datasets should be equal to or > better than working with RDDs. Encoders should be as fast or faster than > Kryo and Java serialization, and unnecessary conversion should be avoided. > - *Typesafe* - Similar to RDDs, objects and functions that operate on those > objects should provide compile-time safety where possible. When converting > from data where the schema is not known at compile-time (for example data > read from an external source such as JSON), the conversion function should > fail-fast if there is a schema mismatch. > - *Support for a variety of object models* - Default encoders should be > provided for a variety of object models: primitive types, case classes, > tuples, POJOs, JavaBeans, etc. Ideally, objects that follow standard > conventions, such as Avro SpecificRecords, should also work out of the box. > - *Java Compatible* - Datasets should provide a single API that works in > both Scala and Java. Where possible, shared types like Array will be used in > the API. Where not possible, overloaded functions should be provided for > both languages. Scala concepts, such as ClassTags should not be required in > the user-facing API. > - *Interoperates with DataFrames* - Users should be able to seamlessly > transition between Datasets and DataFrames, without specifying conversion > boiler-plate. When names used in the input schema line-up with fields in the > given class, no extra mapping should be necessary. Libraries like MLlib > should not need to provide different interfaces for accepting DataFrames and > Datasets as input. > For a detailed outline of the complete proposed API: > [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files] > For an initial discussion of the design considerations in this API: [design > doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#] > The initial version of the Dataset API has been merged in Spark 1.6. However, > it will take a few more future releases to flush everything out. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and with underdetermined equation), the WLS failed. (was: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and with underdetermined equation), the WLS may failure. ) > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and with > underdetermined equation), the WLS failed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11918) WLS can not resolve some kinds of equation
Yanbo Liang created SPARK-11918: --- Summary: WLS can not resolve some kinds of equation Key: SPARK-11918 URL: https://issues.apache.org/jira/browse/SPARK-11918 Project: Spark Issue Type: Bug Components: ML Reporter: Yanbo Liang Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the label of the dataset is very ill condition (such as 0-1 based label used for classification), the WLS may failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and with underdetermined equation), the WLS may failure. (was: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the label of the dataset is very ill condition (such as 0-1 based label used for classification), the WLS may failure. ) > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and with > underdetermined equation), the WLS may failure. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath Cholesky Decomposition. (was: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and with underdetermined equation), the WLS failed. ) > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath Cholesky Decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} was: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath Cholesky Decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} {code} > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath Cholesky Decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} {code} was:Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath Cholesky Decomposition. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11919) graphx should be supported with java
[ https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11919. --- Resolution: Duplicate [~benedict jin] please search JIRA before opening one. This was easy to find. > graphx should be supported with java > > > Key: SPARK-11919 > URL: https://issues.apache.org/jira/browse/SPARK-11919 > Project: Spark > Issue Type: New Feature > Components: Examples, GraphX, Java API >Reporter: benedict jin > > Please make the graphx component to be supported with java, hope appear demo > and java api for graphx as soon as possible. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021787#comment-15021787 ] Sean Owen commented on SPARK-11918: --- [~yanboliang] yes this is true in general of ill-conditioned problems. What are you proposing? to propagate the error from lapack in a different way? check the condition number? it's roughly speaking the correct behavior in that there's no real answer here. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11782) Master Web UI should link to correct Application UI in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-11782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021811#comment-15021811 ] Jean-Baptiste Onofré commented on SPARK-11782: -- Ah ok, understood. Let me try to reproduce and submit a fix. > Master Web UI should link to correct Application UI in cluster mode > --- > > Key: SPARK-11782 > URL: https://issues.apache.org/jira/browse/SPARK-11782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.1 >Reporter: Matthias Niehoff > > - Running a standalone cluster, with node1 as master > - Submit an application to cluster with deploy-mode=cluster > - Application driver is on node other than node1 (i.e. node3) > => master WebUI links to node1:4040 for Application Detail UI and not to > node3:4040 > As the master knows on which worker the driver is running, it should be > possible to show the correct link to the Application Detail UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11921) fix `nullable` of encoder schema
[ https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11921: Assignee: (was: Apache Spark) > fix `nullable` of encoder schema > > > Key: SPARK-11921 > URL: https://issues.apache.org/jira/browse/SPARK-11921 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11921) fix `nullable` of encoder schema
[ https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11921: Assignee: Apache Spark > fix `nullable` of encoder schema > > > Key: SPARK-11921 > URL: https://issues.apache.org/jira/browse/SPARK-11921 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11921) fix `nullable` of encoder schema
[ https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021833#comment-15021833 ] Apache Spark commented on SPARK-11921: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/9906 > fix `nullable` of encoder schema > > > Key: SPARK-11921 > URL: https://issues.apache.org/jira/browse/SPARK-11921 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-6531) An Information Theoretic Feature Selection Framework
[ https://issues.apache.org/jira/browse/SPARK-6531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-6531. -- Resolution: Won't Fix > An Information Theoretic Feature Selection Framework > > > Key: SPARK-6531 > URL: https://issues.apache.org/jira/browse/SPARK-6531 > Project: Spark > Issue Type: New Feature > Components: MLlib >Affects Versions: 1.3.0 >Reporter: Sergio RamĂrez > > **Information Theoretic Feature Selection Framework** > The present framework implements Feature Selection (FS) on Spark for its > application on Big Data problems. This package contains a generic > implementation of greedy Information Theoretic Feature Selection methods. The > implementation is based on the common theoretic framework presented in [1]. > Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are > provided. In addition, the framework can be extended with other criteria > provided by the user as long as the process complies with the framework > proposed in [1]. > -- Main features: > * Support for sparse data (in progress). > * Pool optimization for high-dimensional. > * Improved performance from previous version. > This work has associated two submitted contributions to international > journals which will be attached to this request as soon as they are accepted > This software has been proved with two large real-world datasets such as: > - A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 > competition, which comes from the Protein Structure Prediction field > (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, > 631 attributes, 2 classes, 98% of negative examples and occupies, when > uncompressed, about 56GB of disk space. > - Epsilon dataset: > http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. > 400K instances and 2K attributes. > -- Brief benchmark results: > * 150 seconds by selected feature for a 65M dataset with 631 attributes. > * For epsilon dataset, we have outperformed the results without FS for three > classifers (from MLLIB) using only 2.5% of original features. > Design doc: > https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing > References > [1] Brown, G., Pocock, A., Zhao, M. J., & LujĂĄn, M. (2012). > "Conditional likelihood maximisation: a unifying framework for information > theoretic feature selection." > The Journal of Machine Learning Research, 13(1), 27-66. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11915) Fix flaky python test pyspark.sql.group
[ https://issues.apache.org/jira/browse/SPARK-11915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Liang-Chi Hsieh closed SPARK-11915. --- Resolution: Not A Problem > Fix flaky python test pyspark.sql.group > --- > > Key: SPARK-11915 > URL: https://issues.apache.org/jira/browse/SPARK-11915 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Reporter: Liang-Chi Hsieh > > The python test pyspark.sql.group will fail due to items' order in returned > array. We should sort the aggregation results to make the test stable. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11925) Add PySpark missing methods for ml.feature
Yanbo Liang created SPARK-11925: --- Summary: Add PySpark missing methods for ml.feature Key: SPARK-11925 URL: https://issues.apache.org/jira/browse/SPARK-11925 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add PySpark missing methods for ml.feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA
[ https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11925: Assignee: Apache Spark > Add PySpark missing methods for ml.feature during Spark 1.6 QA > -- > > Key: SPARK-11925 > URL: https://issues.apache.org/jira/browse/SPARK-11925 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > Add PySpark missing methods for ml.feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA
[ https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11925: Assignee: (was: Apache Spark) > Add PySpark missing methods for ml.feature during Spark 1.6 QA > -- > > Key: SPARK-11925 > URL: https://issues.apache.org/jira/browse/SPARK-11925 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add PySpark missing methods for ml.feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA
[ https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021955#comment-15021955 ] Apache Spark commented on SPARK-11925: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/9908 > Add PySpark missing methods for ml.feature during Spark 1.6 QA > -- > > Key: SPARK-11925 > URL: https://issues.apache.org/jira/browse/SPARK-11925 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add PySpark missing methods for ml.feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11604: Description: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala & Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. List the found issues: * Inconsistency: * Docs: ** ml.classification SPARK-11875 * Missing classes ** ml.feature *** QuantileDiscretizer SPARK-11922 *** ChiSqSelector SPARK-11923 * Missing methods/parameters ** ml.classification SPARK-11815 SPARK-11820 ** ml.feature SPARK-11925 was: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala & Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. List the found issues: * Inconsistency: * Docs: ** ml.classification SPARK-11875 * Missing classes ** ml.feature *** QuantileDiscretizer SPARK-11922 *** ChiSqSelector SPARK-11923 * Missing methods/parameters ** ml.classification SPARK-11815 SPARK-11820 ** ml.feature > ML 1.6 QA: API: Python API coverage > --- > > Key: SPARK-11604 > URL: https://issues.apache.org/jira/browse/SPARK-11604 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below) for this list of to-do items. > List the found issues: > * Inconsistency: > * Docs: > ** ml.classification SPARK-11875 > * Missing classes > ** ml.feature > *** QuantileDiscretizer SPARK-11922 > *** ChiSqSelector SPARK-11923 > * Missing methods/parameters > ** ml.classification SPARK-11815 SPARK-11820 > ** ml.feature SPARK-11925 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11924) SparkContext stop method does not close HiveContexts
Sylvain Lequeux created SPARK-11924: --- Summary: SparkContext stop method does not close HiveContexts Key: SPARK-11924 URL: https://issues.apache.org/jira/browse/SPARK-11924 Project: Spark Issue Type: Bug Affects Versions: 1.5.1 Reporter: Sylvain Lequeux In my unit tests, I create one HiveContext inside each test suite using a HiveContext creation in beforeAll and a SparkContext stop call in afterAll. If I have 2 test suites, then it raise an exception at runtime : "Another instance of Derby may have already booted the database" It works with Spark 1.4.1 but does not work in 1.5.1. It works fine using standard SQLContext instead of HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA
[ https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11925: Summary: Add PySpark missing methods for ml.feature during Spark 1.6 QA (was: Add PySpark missing methods for ml.feature during 1.6 QA) > Add PySpark missing methods for ml.feature during Spark 1.6 QA > -- > > Key: SPARK-11925 > URL: https://issues.apache.org/jira/browse/SPARK-11925 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add PySpark missing methods for ml.feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11924) SparkContext stop method does not close HiveContexts
[ https://issues.apache.org/jira/browse/SPARK-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021953#comment-15021953 ] Sean Owen commented on SPARK-11924: --- I'm not sure this works in general outside tests, or is supposed to, but clearly the tests are able to test {{HiveContext}} somehow. Are you able to use {{TestHiveContext}} instead or adopt its way of dealing with this? > SparkContext stop method does not close HiveContexts > > > Key: SPARK-11924 > URL: https://issues.apache.org/jira/browse/SPARK-11924 > Project: Spark > Issue Type: Bug >Affects Versions: 1.5.1 >Reporter: Sylvain Lequeux > > In my unit tests, I create one HiveContext inside each test suite using a > HiveContext creation in beforeAll and a SparkContext stop call in afterAll. > If I have 2 test suites, then it raise an exception at runtime : > "Another instance of Derby may have already booted the database" > It works with Spark 1.4.1 but does not work in 1.5.1. > It works fine using standard SQLContext instead of HiveContext. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11925) Add PySpark missing methods for ml.feature during 1.6 QA
[ https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11925: Summary: Add PySpark missing methods for ml.feature during 1.6 QA (was: Add PySpark missing methods for ml.feature) > Add PySpark missing methods for ml.feature during 1.6 QA > > > Key: SPARK-11925 > URL: https://issues.apache.org/jira/browse/SPARK-11925 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add PySpark missing methods for ml.feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA
[ https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11925: Description: Add PySpark missing methods and params for ml.feature * RegexTokenizer should support setting toLowercase. * MinMaxScalerModel should support output originalMin and originalMax. * PCAModel should support output pc. was:Add PySpark missing methods for ml.feature > Add PySpark missing methods for ml.feature during Spark 1.6 QA > -- > > Key: SPARK-11925 > URL: https://issues.apache.org/jira/browse/SPARK-11925 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > > Add PySpark missing methods and params for ml.feature > * RegexTokenizer should support setting toLowercase. > * MinMaxScalerModel should support output originalMin and originalMax. > * PCAModel should support output pc. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11782) Master Web UI should link to correct Application UI in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-11782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021730#comment-15021730 ] Matthias Niehoff commented on SPARK-11782: -- I submit the App with deploy-mode cluster, then the driver gets started inside the cluster. This could be any node then and does not necessarily have to be the node where spark-submit was executed. > Master Web UI should link to correct Application UI in cluster mode > --- > > Key: SPARK-11782 > URL: https://issues.apache.org/jira/browse/SPARK-11782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.1 >Reporter: Matthias Niehoff > > - Running a standalone cluster, with node1 as master > - Submit an application to cluster with deploy-mode=cluster > - Application driver is on node other than node1 (i.e. node3) > => master WebUI links to node1:4040 for Application Detail UI and not to > node3:4040 > As the master knows on which worker the driver is running, it should be > possible to show the correct link to the Application Detail UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021729#comment-15021729 ] Yanbo Liang commented on SPARK-11918: - Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11919) graphx should be supported with java
benedict jin created SPARK-11919: Summary: graphx should be supported with java Key: SPARK-11919 URL: https://issues.apache.org/jira/browse/SPARK-11919 Project: Spark Issue Type: Bug Components: Examples, GraphX, Java API Reporter: benedict jin Please make the graphx component to be supported with java, hope appear demo and java api for graphx as soon as possible :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
Yanbo Liang created SPARK-11920: --- Summary: ML LinearRegression should use correct dataset in examples and user guide doc Key: SPARK-11920 URL: https://issues.apache.org/jira/browse/SPARK-11920 Project: Spark Issue Type: Improvement Components: Documentation, ML Reporter: Yanbo Liang Priority: Minor ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use data/mllib/sample_linear_regression_data.txt instead. Another reason is that LinearRegression with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at SPARK-11918. So we should make this change in examples and user guides, that can clearly illustrate the usage of LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Resolved] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)
[ https://issues.apache.org/jira/browse/SPARK-11909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen resolved SPARK-11909. --- Resolution: Won't Fix > Spark Standalone's master URL accepts URLs without port (assuming default > 7077) > --- > > Key: SPARK-11909 > URL: https://issues.apache.org/jira/browse/SPARK-11909 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Affects Versions: 1.6.0 >Reporter: Jacek Laskowski >Priority: Trivial > > It's currently impossible to use {{spark://localhost}} URL for Spark > Standalone's master. With the feature supported, it'd be less to know to get > started with the mode (and hence improve user friendliness). > I think no-port master URL should be supported and assume the default port > {{7077}}. > {code} > org.apache.spark.SparkException: Invalid master URL: spark://localhost > at > org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088) > at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47) > at > org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48) > at > org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245) > at > scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) > at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186) > at scala.collection.TraversableLike$class.map(TraversableLike.scala:245) > at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186) > at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48) > at > org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93) > at > org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144) > at org.apache.spark.SparkContext.(SparkContext.scala:530) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights
[ https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021867#comment-15021867 ] Apache Spark commented on SPARK-11520: -- User 'Lewuathe' has created a pull request for this issue: https://github.com/apache/spark/pull/9907 > RegressionMetrics should support instance weights > - > > Key: SPARK-11520 > URL: https://issues.apache.org/jira/browse/SPARK-11520 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This will be important to improve LinearRegressionSummary, which currently > has a mix of weighted and unweighted metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11520) RegressionMetrics should support instance weights
[ https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11520: Assignee: Apache Spark > RegressionMetrics should support instance weights > - > > Key: SPARK-11520 > URL: https://issues.apache.org/jira/browse/SPARK-11520 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley >Assignee: Apache Spark > > This will be important to improve LinearRegressionSummary, which currently > has a mix of weighted and unweighted metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11922) Python API for ml.feature.QuantileDiscretizer
[ https://issues.apache.org/jira/browse/SPARK-11922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11922: Labels: starter (was: ) > Python API for ml.feature.QuantileDiscretizer > -- > > Key: SPARK-11922 > URL: https://issues.apache.org/jira/browse/SPARK-11922 > Project: Spark > Issue Type: Improvement > Components: ML, PySpark >Reporter: Yanbo Liang >Priority: Minor > Labels: starter > > Add Python API for ml.feature.QuantileDiscretizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11604: Description: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala & Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. List the found issues: * Inconsistency: * Docs: ** ml.classification SPARK-11875 * Missing classes ** ml.feature *** QuantileDiscretizer SPARK-11922 *** ChiSqSelector SPARK-11923 * Missing methods/parameters ** ml.classification SPARK-11815 SPARK-11820 ** ml.feature was: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala & Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. List the found issues: * Inconsistency: * Docs: ** ml.classification SPARK-11875 * Missing classes ** ml.classification SPARK-11815 SPARK-11820 ** ml.feature *** QuantileDiscretizer SPARK-11922 *** ChiSqSelector SPARK-11923 * Missing methods/parameters ** ml.feature > ML 1.6 QA: API: Python API coverage > --- > > Key: SPARK-11604 > URL: https://issues.apache.org/jira/browse/SPARK-11604 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below) for this list of to-do items. > List the found issues: > * Inconsistency: > * Docs: > ** ml.classification SPARK-11875 > * Missing classes > ** ml.feature > *** QuantileDiscretizer SPARK-11922 > *** ChiSqSelector SPARK-11923 > * Missing methods/parameters > ** ml.classification SPARK-11815 SPARK-11820 > ** ml.feature -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11926) unify GetStructField and GetInternalRowField
Wenchen Fan created SPARK-11926: --- Summary: unify GetStructField and GetInternalRowField Key: SPARK-11926 URL: https://issues.apache.org/jira/browse/SPARK-11926 Project: Spark Issue Type: Improvement Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11926) unify GetStructField and GetInternalRowField
[ https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021984#comment-15021984 ] Apache Spark commented on SPARK-11926: -- User 'cloud-fan' has created a pull request for this issue: https://github.com/apache/spark/pull/9909 > unify GetStructField and GetInternalRowField > > > Key: SPARK-11926 > URL: https://issues.apache.org/jira/browse/SPARK-11926 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11926) unify GetStructField and GetInternalRowField
[ https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11926: Assignee: Apache Spark > unify GetStructField and GetInternalRowField > > > Key: SPARK-11926 > URL: https://issues.apache.org/jira/browse/SPARK-11926 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan >Assignee: Apache Spark > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11926) unify GetStructField and GetInternalRowField
[ https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11926: Assignee: (was: Apache Spark) > unify GetStructField and GetInternalRowField > > > Key: SPARK-11926 > URL: https://issues.apache.org/jira/browse/SPARK-11926 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Wenchen Fan > -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-3947) Support Scala/Java UDAF
[ https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Milad Bourhani updated SPARK-3947: -- Attachment: logs.zip > Support Scala/Java UDAF > --- > > Key: SPARK-3947 > URL: https://issues.apache.org/jira/browse/SPARK-3947 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Pei-Lun Lee >Assignee: Yin Huai > Fix For: 1.5.0 > > Attachments: logs.zip, spark-udaf-adapted-1.5.2.zip, spark-udaf.zip > > > Right now only Hive UDAFs are supported. It would be nice to have UDAF > similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-3947) Support Scala/Java UDAF
[ https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022051#comment-15022051 ] Milad Bourhani commented on SPARK-3947: --- Just for completeness, I'm attaching the [^logs.zip]. For the record, it looks as if the first time you run the clustered computation (right after you started {{./sbin/start-all.sh}}), the computation is OK, even though the race condition error shows up in the log. After that, it fails. So the attached logs contain exactly two executions: the first gives a correct answer, the second doesn't. To reproduce, run these commands on the unzipped project [^spark-udaf-adapted-1.5.2.zip]: {noformat} mvn clean install java -jar `ls target/uber*.jar` `ls target/uber*.jar` spark://master_host:7077 java -jar `ls target/uber*.jar` `ls target/uber*.jar` spark://master_host:7077 {noformat} where {{spark://master_host:7077}} is your master URL. > Support Scala/Java UDAF > --- > > Key: SPARK-3947 > URL: https://issues.apache.org/jira/browse/SPARK-3947 > Project: Spark > Issue Type: Sub-task > Components: SQL >Reporter: Pei-Lun Lee >Assignee: Yin Huai > Fix For: 1.5.0 > > Attachments: logs.zip, spark-udaf-adapted-1.5.2.zip, spark-udaf.zip > > > Right now only Hive UDAFs are supported. It would be nice to have UDAF > similar to UDF through SQLContext.registerFunction. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-939) Allow user jars to take precedence over Spark jars, if desired
[ https://issues.apache.org/jira/browse/SPARK-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022081#comment-15022081 ] Jayson Minard commented on SPARK-939: - Ok, but how does this actually work? Any time we add Jackson 2.6.x to our classes then it crashes Spark when run on AWS EMR. Obviously our JAR is not isolated from Spark. > Allow user jars to take precedence over Spark jars, if desired > -- > > Key: SPARK-939 > URL: https://issues.apache.org/jira/browse/SPARK-939 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: holdenk >Priority: Blocker > Labels: starter > Fix For: 1.0.0 > > > Sometimes a user may want to include their own version of a jar that spark > itself uses. For example, if their code requires a newer version of that jar > than Spark offers. It would be good to have an option to give the users > dependencies precedence over Spark. This options should be disabled by > default, since it could lead to some odd behavior (e.g. parts of Spark not > working). But I think we should have it. > From an implementation perspective, this would require modifying the way we > do class loading inside of an Executor. The default behavior of the > URLClassLoader is to delegate to it's parent first and, if that fails, to > find a class locally. We want to have the opposite behavior. This is > sometimes referred to as "parent-last" (as opposed to "parent-first") class > loading precedence. There is an example of how to do this here: > http://stackoverflow.com/questions/5445511/how-do-i-create-a-parent-last-child-first-classloader-in-java-or-how-to-overr > We should write a similar class which can encapsulate a URL classloader and > change the delegation order. Or if possible, maybe we could find a more > elegant way to do this. See relevant discussion on the user list here: > https://groups.google.com/forum/#!topic/spark-users/b278DW3e38g > Also see the corresponding option in Hadoop: > https://issues.apache.org/jira/browse/MAPREDUCE-4521 > Some other relevant Hadoop JIRA's: > https://issues.apache.org/jira/browse/MAPREDUCE-1700 > https://issues.apache.org/jira/browse/MAPREDUCE-1938 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-939) Allow user jars to take precedence over Spark jars, if desired
[ https://issues.apache.org/jira/browse/SPARK-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022084#comment-15022084 ] Jayson Minard commented on SPARK-939: - I see, this now reverses the problem. We can crash Spark, but Spark can't crash us. Maybe Spark can shade/rename common libraries it uses that are likely to conflict (Jackson is one example). > Allow user jars to take precedence over Spark jars, if desired > -- > > Key: SPARK-939 > URL: https://issues.apache.org/jira/browse/SPARK-939 > Project: Spark > Issue Type: New Feature > Components: Spark Core >Reporter: Patrick Wendell >Assignee: holdenk >Priority: Blocker > Labels: starter > Fix For: 1.0.0 > > > Sometimes a user may want to include their own version of a jar that spark > itself uses. For example, if their code requires a newer version of that jar > than Spark offers. It would be good to have an option to give the users > dependencies precedence over Spark. This options should be disabled by > default, since it could lead to some odd behavior (e.g. parts of Spark not > working). But I think we should have it. > From an implementation perspective, this would require modifying the way we > do class loading inside of an Executor. The default behavior of the > URLClassLoader is to delegate to it's parent first and, if that fails, to > find a class locally. We want to have the opposite behavior. This is > sometimes referred to as "parent-last" (as opposed to "parent-first") class > loading precedence. There is an example of how to do this here: > http://stackoverflow.com/questions/5445511/how-do-i-create-a-parent-last-child-first-classloader-in-java-or-how-to-overr > We should write a similar class which can encapsulate a URL classloader and > change the delegation order. Or if possible, maybe we could find a more > elegant way to do this. See relevant discussion on the user list here: > https://groups.google.com/forum/#!topic/spark-users/b278DW3e38g > Also see the corresponding option in Hadoop: > https://issues.apache.org/jira/browse/MAPREDUCE-4521 > Some other relevant Hadoop JIRA's: > https://issues.apache.org/jira/browse/MAPREDUCE-1700 > https://issues.apache.org/jira/browse/MAPREDUCE-1938 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results
[ https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022053#comment-15022053 ] Milad Bourhani commented on SPARK-11885: I've attached the logs on SPARK-3947, just to keep all the attachments there, of course feel free to move/copy them here :) > UDAF may nondeterministically generate wrong results > > > Key: SPARK-11885 > URL: https://issues.apache.org/jira/browse/SPARK-11885 > Project: Spark > Issue Type: Bug > Components: SQL >Affects Versions: 1.5.2 >Reporter: Yin Huai >Assignee: Yin Huai >Priority: Critical > > I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). > I think it is an issue in 1.5 branch -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11927) configure log4j properties with spark-submit
Alex Kazantsev created SPARK-11927: -- Summary: configure log4j properties with spark-submit Key: SPARK-11927 URL: https://issues.apache.org/jira/browse/SPARK-11927 Project: Spark Issue Type: Question Components: Spark Submit Affects Versions: 1.5.1 Reporter: Alex Kazantsev Priority: Minor How to properly configure log4j properties on worker per single application using spark-submit script? Currently setting --conf 'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:"log4j.properties"' and --files log4j.properties does not work, because according to worker logs loading of specified log4j configuration happens before any files are downloaded from driver. Is it a bug or a feature? Is it possible to reconfigure log4j properties after properties file was downloaded from driver? Application was submitted with following script {noformat} exec /opt/spark/current/bin/spark-submit \ --name App \ --master spark://master:17079 \ --executor-memory 4G \ --total-executor-cores 4 \ --driver-java-options '-Dspark.ui.port=4056 -Dconfig.file=application.conf -Dlog4j.configuration=file:"./log4j.properties"' \ --conf 'spark.executor.extraJavaOptions=-XX:+UseParallelGC -Duser.timezone=GMT -Dconfig.file=application.conf -Dlog4j.configuration=file:"log4j.properties"' \ --files application.conf,log4j.properties \ --class default.Main \ App.jar $* {noformat} Worker logs: {noformat} log4j:ERROR Could not read configuration file from URL [file:log4j.properties]. java.io.FileNotFoundException: log4j.properties (No such file or directory) at java.io.FileInputStream.open(Native Method) at java.io.FileInputStream.(FileInputStream.java:146) at java.io.FileInputStream.(FileInputStream.java:101) at sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90) at sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188) at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557) at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526) at org.apache.log4j.LogManager.(LogManager.java:127) at org.apache.spark.Logging$class.initializeLogging(Logging.scala:122) at org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:107) at org.apache.spark.Logging$class.log(Logging.scala:51) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.log(CoarseGrainedExecutorBackend.scala:136) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:147) at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:250) at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala) log4j:ERROR Ignoring configuration file [file:log4j.properties]. Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties 15/11/23 11:47:30 INFO CoarseGrainedExecutorBackend: Registered signal handlers for [TERM, HUP, INT] 15/11/23 11:47:30 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 15/11/23 11:47:30 INFO SecurityManager: Changing view acls to: root 15/11/23 11:47:30 INFO SecurityManager: Changing modify acls to: root 15/11/23 11:47:30 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 15/11/23 11:47:31 INFO Slf4jLogger: Slf4jLogger started 15/11/23 11:47:31 INFO Remoting: Starting remoting 15/11/23 11:47:31 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://driverPropsFetcher@10.1.1.102:33953] 15/11/23 11:47:31 INFO Utils: Successfully started service 'driverPropsFetcher' on port 33953. 15/11/23 11:47:31 INFO SecurityManager: Changing view acls to: root 15/11/23 11:47:31 INFO SecurityManager: Changing modify acls to: root 15/11/23 11:47:31 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root) 15/11/23 11:47:31 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down remote daemon. 15/11/23 11:47:32 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon shut down; proceeding with flushing remote transports. 15/11/23 11:47:32 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut down. 15/11/23 11:47:32 INFO Slf4jLogger: Slf4jLogger started 15/11/23 11:47:32 INFO Remoting: Starting remoting 15/11/23 11:47:32 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@10.1.1.102:39111] 15/11/23 11:47:32 INFO Utils: Successfully started service
[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext
[ https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022061#comment-15022061 ] Dmytro Bielievtsov commented on SPARK-10872: Removing {{metastore_db/dbex.lck}} right before {{sc = SparkContext("local\[*]", "app2")}} precludes the error, but it's a dangerous workaround. Having something like {{HiveContext.stop()}} that releases the locks would be best. > Derby error (XSDB6) when creating new HiveContext after restarting > SparkContext > --- > > Key: SPARK-10872 > URL: https://issues.apache.org/jira/browse/SPARK-10872 > Project: Spark > Issue Type: Bug > Components: PySpark, SQL >Affects Versions: 1.4.0, 1.4.1, 1.5.0 >Reporter: Dmytro Bielievtsov > > Starting from spark 1.4.0 (works well on 1.3.1), the following code fails > with "XSDB6: Another instance of Derby may have already booted the database > ~/metastore_db": > {code:python} > from pyspark import SparkContext, HiveContext > sc = SparkContext("local[*]", "app1") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() > sc.stop() > sc = SparkContext("local[*]", "app2") > sql = HiveContext(sc) > sql.createDataFrame([[1]]).collect() # Py4J error > {code} > This is related to [#SPARK-9539], and I intend to restart spark context > several times for isolated jobs to prevent cache cluttering and GC errors. > Here's a larger part of the full error trace: > {noformat} > Failed to start database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > org.datanucleus.exceptions.NucleusDataStoreException: Failed to start > database 'metastore_db' with class loader > org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see > the next exception for details. > at > org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516) > at > org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298) > at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:526) > at > org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631) > at > org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301) > at > org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187) > at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333) > at > org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > at java.lang.reflect.Method.invoke(Method.java:606) > at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965) > at java.security.AccessController.doPrivileged(Native Method) > at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960) > at > javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808) > at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365) > at > org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394) > at > org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291) > at > org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258) > at > org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73) > at > org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57) > at > org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66) > at >
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath lapack library return error value when Cholesky decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} was: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath Cholesky Decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} It's caused by the underneath lapack library return error value. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > lapack library return error value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed (But "l-bfgs" can train and get the model). The failure is caused by the underneath lapack library return error value when Cholesky decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} was: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed (But the "l-bfgs" can train and get the model). The failure is caused by the underneath lapack library return error value when Cholesky decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11919) graphx should be supported with java
[ https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benedict jin updated SPARK-11919: - Issue Type: New Feature (was: Bug) > graphx should be supported with java > > > Key: SPARK-11919 > URL: https://issues.apache.org/jira/browse/SPARK-11919 > Project: Spark > Issue Type: New Feature > Components: Examples, GraphX, Java API >Reporter: benedict jin > > Please make the graphx component to be supported with java, hope appear demo > and java api for graphx as soon as possible :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021756#comment-15021756 ] Yanbo Liang commented on SPARK-11918: - Until now, I suspect this is not a bug of MLlib but may be very ill condition problem is not suitable to be solved by "normal" equation method. If this assumption is right, I think we should document this issue. Looking forward your comments [~mengxr]. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
[ https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11920: Description: ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use data/mllib/sample_linear_regression_data.txt instead. The deeper causes is that LinearRegression with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at SPARK-11918. So we should make this change in examples and user guides, that can clearly illustrate the usage of LinearRegression algorithm. was: ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use data/mllib/sample_linear_regression_data.txt instead. The deeper level reason is that LinearRegression with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at SPARK-11918. So we should make this change in examples and user guides, that can clearly illustrate the usage of LinearRegression algorithm. > ML LinearRegression should use correct dataset in examples and user guide doc > - > > Key: SPARK-11920 > URL: https://issues.apache.org/jira/browse/SPARK-11920 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Priority: Minor > > ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in > examples and user guide doc, but it's actually classification dataset rather > than regression dataset. We should use > data/mllib/sample_linear_regression_data.txt instead. > The deeper causes is that LinearRegression with "normal" solver can not solve > this dataset correctly, may be due to the ill condition and unreasonable > label. This issue has been reported at SPARK-11918. > So we should make this change in examples and user guides, that can clearly > illustrate the usage of LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
[ https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11920: Description: ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use data/mllib/sample_linear_regression_data.txt instead. The deeper level reason is that LinearRegression with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at SPARK-11918. So we should make this change in examples and user guides, that can clearly illustrate the usage of LinearRegression algorithm. was: ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use data/mllib/sample_linear_regression_data.txt instead. Another reason is that LinearRegression with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at SPARK-11918. So we should make this change in examples and user guides, that can clearly illustrate the usage of LinearRegression algorithm. > ML LinearRegression should use correct dataset in examples and user guide doc > - > > Key: SPARK-11920 > URL: https://issues.apache.org/jira/browse/SPARK-11920 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Priority: Minor > > ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in > examples and user guide doc, but it's actually classification dataset rather > than regression dataset. We should use > data/mllib/sample_linear_regression_data.txt instead. > The deeper level reason is that LinearRegression with "normal" solver can not > solve this dataset correctly, may be due to the ill condition and > unreasonable label. This issue has been reported at SPARK-11918. > So we should make this change in examples and user guides, that can clearly > illustrate the usage of LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
[ https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11920: Assignee: (was: Apache Spark) > ML LinearRegression should use correct dataset in examples and user guide doc > - > > Key: SPARK-11920 > URL: https://issues.apache.org/jira/browse/SPARK-11920 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Priority: Minor > > ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in > examples and user guide doc, but it's actually classification dataset rather > than regression dataset. We should use > data/mllib/sample_linear_regression_data.txt instead. > The deeper causes is that LinearRegression with "normal" solver can not solve > this dataset correctly, may be due to the ill condition and unreasonable > label. This issue has been reported at SPARK-11918. > So we should make this change in examples and user guides, that can clearly > illustrate the usage of LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Assigned] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
[ https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11920: Assignee: Apache Spark > ML LinearRegression should use correct dataset in examples and user guide doc > - > > Key: SPARK-11920 > URL: https://issues.apache.org/jira/browse/SPARK-11920 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Assignee: Apache Spark >Priority: Minor > > ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in > examples and user guide doc, but it's actually classification dataset rather > than regression dataset. We should use > data/mllib/sample_linear_regression_data.txt instead. > The deeper causes is that LinearRegression with "normal" solver can not solve > this dataset correctly, may be due to the ill condition and unreasonable > label. This issue has been reported at SPARK-11918. > So we should make this change in examples and user guides, that can clearly > illustrate the usage of LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11919) graphx should be supported with java
[ https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021807#comment-15021807 ] benedict jin commented on SPARK-11919: -- Thanks a lot, my bad. Please help me let this jira to be closed. > graphx should be supported with java > > > Key: SPARK-11919 > URL: https://issues.apache.org/jira/browse/SPARK-11919 > Project: Spark > Issue Type: New Feature > Components: Examples, GraphX, Java API >Reporter: benedict jin > > Please make the graphx component to be supported with java, hope appear demo > and java api for graphx as soon as possible. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
[ https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021806#comment-15021806 ] Apache Spark commented on SPARK-11920: -- User 'yanboliang' has created a pull request for this issue: https://github.com/apache/spark/pull/9905 > ML LinearRegression should use correct dataset in examples and user guide doc > - > > Key: SPARK-11920 > URL: https://issues.apache.org/jira/browse/SPARK-11920 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Priority: Minor > > ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in > examples and user guide doc, but it's actually classification dataset rather > than regression dataset. We should use > data/mllib/sample_linear_regression_data.txt instead. > The deeper causes is that LinearRegression with "normal" solver can not solve > this dataset correctly, may be due to the ill condition and unreasonable > label. This issue has been reported at SPARK-11918. > So we should make this change in examples and user guides, that can clearly > illustrate the usage of LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc
[ https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11920: Description: ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use data/mllib/sample_linear_regression_data.txt instead. The deeper causes is that LinearRegression with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at SPARK-11918. It will confuse users if they run the example code but get exception, so we should make this change which can clearly illustrate the usage of LinearRegression algorithm. was: ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in examples and user guide doc, but it's actually classification dataset rather than regression dataset. We should use data/mllib/sample_linear_regression_data.txt instead. The deeper causes is that LinearRegression with "normal" solver can not solve this dataset correctly, may be due to the ill condition and unreasonable label. This issue has been reported at SPARK-11918. So we should make this change in examples and user guides, that can clearly illustrate the usage of LinearRegression algorithm. > ML LinearRegression should use correct dataset in examples and user guide doc > - > > Key: SPARK-11920 > URL: https://issues.apache.org/jira/browse/SPARK-11920 > Project: Spark > Issue Type: Improvement > Components: Documentation, ML >Reporter: Yanbo Liang >Priority: Minor > > ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in > examples and user guide doc, but it's actually classification dataset rather > than regression dataset. We should use > data/mllib/sample_linear_regression_data.txt instead. > The deeper causes is that LinearRegression with "normal" solver can not solve > this dataset correctly, may be due to the ill condition and unreasonable > label. This issue has been reported at SPARK-11918. > It will confuse users if they run the example code but get exception, so we > should make this change which can clearly illustrate the usage of > LinearRegression algorithm. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11919) graphx should be supported with java
[ https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021807#comment-15021807 ] benedict jin edited comment on SPARK-11919 at 11/23/15 9:16 AM: Thanks a lot, my bad. This jira will be closed, right now. was (Author: benedict jin): Thanks a lot, my bad. Please help me let this jira to be closed. > graphx should be supported with java > > > Key: SPARK-11919 > URL: https://issues.apache.org/jira/browse/SPARK-11919 > Project: Spark > Issue Type: New Feature > Components: Examples, GraphX, Java API >Reporter: benedict jin > > Please make the graphx component to be supported with java, hope appear demo > and java api for graphx as soon as possible. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Closed] (SPARK-11919) graphx should be supported with java
[ https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benedict jin closed SPARK-11919. Closing... > graphx should be supported with java > > > Key: SPARK-11919 > URL: https://issues.apache.org/jira/browse/SPARK-11919 > Project: Spark > Issue Type: New Feature > Components: Examples, GraphX, Java API >Reporter: benedict jin > > Please make the graphx component to be supported with java, hope appear demo > and java api for graphx as soon as possible. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021830#comment-15021830 ] Yanbo Liang commented on SPARK-11918: - [~sowen] Thanks for your comments. I think you have got part of my proposal at https://github.com/apache/spark/pull/9905. I also wonder that whether we can give better hint for users if they are in the same condition. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11921) fix `nullable` of encoder schema
Wenchen Fan created SPARK-11921: --- Summary: fix `nullable` of encoder schema Key: SPARK-11921 URL: https://issues.apache.org/jira/browse/SPARK-11921 Project: Spark Issue Type: Sub-task Components: SQL Reporter: Wenchen Fan -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11326) Support for authentication and encryption in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021848#comment-15021848 ] Jacek Lewandowski edited comment on SPARK-11326 at 11/23/15 9:41 AM: - [~pwendell] - are you (DB) interested in reviewing this patch at all? was (Author: jlewandowski): [~pwendell] - are you interested in reviewing this patch at all? > Support for authentication and encryption in standalone mode > > > Key: SPARK-11326 > URL: https://issues.apache.org/jira/browse/SPARK-11326 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Jacek Lewandowski > > h3.The idea > Currently, in standalone mode, all components, for all network connections > need to use the same secure token if they want to have any security ensured. > This ticket is intended to split the communication in standalone mode to make > it more like in Yarn mode - application internal communication and scheduler > communication. > Such refactoring will allow for the scheduler (master, workers) to use a > distinct secret, which will remain unknown for the users. Similarly, it will > allow for better security in applications, because each application will be > able to use a distinct secret as well. > By providing SASL authentication/encryption for connections between a client > (Client or AppClient) and Spark Master, it becomes possible introducing > pluggable authentication for standalone deployment mode. > h3.Improvements introduced by this patch > This patch introduces the following changes: > * Spark driver or submission client do not have to use the same secret as > workers use to communicate with Master > * Master is able to authenticate individual clients with the following rules: > ** When connecting to the master, the client needs to specify > {{spark.authenticate.secret}} which is an authentication token for the user > specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default) > ** Master configuration may include additional > {{spark.authenticate.secrets.}} entries for specifying > authentication token for particular users or > {{spark.authenticate.authenticatorClass}} which specify an implementation of > external credentials provider (which is able to retrieve the authentication > token for a given user). > ** Workers authenticate with Master as default user {{sparkSaslUser}}. > * The authorization rules are as follows: > ** A regular user is able to manage only his own application (the application > which he submitted) > ** A regular user is not able to register or manager workers > ** Spark default user {{sparkSaslUser}} can manage all the applications > h3.User facing changes when running application > h4.General principles: > - conf: {{spark.authenticate.secret}} is *never sent* over the wire > - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire > - In all situations env variable will overwrite conf variable if present. > - In all situations when a user has to pass a secret, it is better (safer) to > do this through env variable > - In work modes with multiple secrets we assume encrypted communication > between client and master, between driver and master, between master and > workers > > h4.Work modes and descriptions > h5.Client mode, single secret > h6.Configuration > - env: {{SPARK_AUTH_SECRET=secret}} or conf: > {{spark.authenticate.secret=secret}} > h6.Description > - The driver is running locally > - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: > {{spark.authenticate.secret}} > - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: > {{spark.authenticate.secret}} for connection to the master > - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it > will look for it in the worker configuration and it will find it there (its > presence is implied). > > h5.Client mode, multiple secrets > h6.Configuration > - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: > {{spark.app.authenticate.secret=secret}} > - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: > {{spark.submission.authenticate.secret=scheduler_secret}} > h6.Description > - The driver is running locally > - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: > {{spark.submission.authenticate.secret}} to connect to the master > - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor > conf: {{spark.submission.authenticate.secret}} > - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: > {{spark.app.authenticate.secret}} for communication with the executors > - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so > that the executors can use it to communicate with the driver > -
[jira] [Commented] (SPARK-11326) Support for authentication and encryption in standalone mode
[ https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021848#comment-15021848 ] Jacek Lewandowski commented on SPARK-11326: --- [~pwendell] - are you interested in reviewing this patch at all? > Support for authentication and encryption in standalone mode > > > Key: SPARK-11326 > URL: https://issues.apache.org/jira/browse/SPARK-11326 > Project: Spark > Issue Type: Improvement > Components: Spark Core >Reporter: Jacek Lewandowski > > h3.The idea > Currently, in standalone mode, all components, for all network connections > need to use the same secure token if they want to have any security ensured. > This ticket is intended to split the communication in standalone mode to make > it more like in Yarn mode - application internal communication and scheduler > communication. > Such refactoring will allow for the scheduler (master, workers) to use a > distinct secret, which will remain unknown for the users. Similarly, it will > allow for better security in applications, because each application will be > able to use a distinct secret as well. > By providing SASL authentication/encryption for connections between a client > (Client or AppClient) and Spark Master, it becomes possible introducing > pluggable authentication for standalone deployment mode. > h3.Improvements introduced by this patch > This patch introduces the following changes: > * Spark driver or submission client do not have to use the same secret as > workers use to communicate with Master > * Master is able to authenticate individual clients with the following rules: > ** When connecting to the master, the client needs to specify > {{spark.authenticate.secret}} which is an authentication token for the user > specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default) > ** Master configuration may include additional > {{spark.authenticate.secrets.}} entries for specifying > authentication token for particular users or > {{spark.authenticate.authenticatorClass}} which specify an implementation of > external credentials provider (which is able to retrieve the authentication > token for a given user). > ** Workers authenticate with Master as default user {{sparkSaslUser}}. > * The authorization rules are as follows: > ** A regular user is able to manage only his own application (the application > which he submitted) > ** A regular user is not able to register or manager workers > ** Spark default user {{sparkSaslUser}} can manage all the applications > h3.User facing changes when running application > h4.General principles: > - conf: {{spark.authenticate.secret}} is *never sent* over the wire > - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire > - In all situations env variable will overwrite conf variable if present. > - In all situations when a user has to pass a secret, it is better (safer) to > do this through env variable > - In work modes with multiple secrets we assume encrypted communication > between client and master, between driver and master, between master and > workers > > h4.Work modes and descriptions > h5.Client mode, single secret > h6.Configuration > - env: {{SPARK_AUTH_SECRET=secret}} or conf: > {{spark.authenticate.secret=secret}} > h6.Description > - The driver is running locally > - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: > {{spark.authenticate.secret}} > - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: > {{spark.authenticate.secret}} for connection to the master > - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it > will look for it in the worker configuration and it will find it there (its > presence is implied). > > h5.Client mode, multiple secrets > h6.Configuration > - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: > {{spark.app.authenticate.secret=secret}} > - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: > {{spark.submission.authenticate.secret=scheduler_secret}} > h6.Description > - The driver is running locally > - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: > {{spark.submission.authenticate.secret}} to connect to the master > - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor > conf: {{spark.submission.authenticate.secret}} > - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: > {{spark.app.authenticate.secret}} for communication with the executors > - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so > that the executors can use it to communicate with the driver > - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it > will set it in env: {{SPARK_AUTH_SECRET}} which will be read by >
[jira] [Assigned] (SPARK-11520) RegressionMetrics should support instance weights
[ https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Apache Spark reassigned SPARK-11520: Assignee: (was: Apache Spark) > RegressionMetrics should support instance weights > - > > Key: SPARK-11520 > URL: https://issues.apache.org/jira/browse/SPARK-11520 > Project: Spark > Issue Type: Improvement > Components: ML >Reporter: Joseph K. Bradley > > This will be important to improve LinearRegressionSummary, which currently > has a mix of weighted and unweighted metrics. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage
[ https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11604: Description: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala & Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. List the found issues: * Inconsistency: * Docs: ** ml.classification SPARK-11875 * Missing classes/methods/parameters ** ml.classification SPARK-11815 SPARK-11820 ** ml.feature SPARK-11922 was: For new public APIs added to MLlib, we need to check the generated HTML doc and compare the Scala & Python versions. We need to track: * Inconsistency: Do class/method/parameter names match? * Docs: Is the Python doc missing or just a stub? We want the Python doc to be as complete as the Scala doc. * API breaking changes: These should be very rare but are occasionally either necessary (intentional) or accidental. These must be recorded and added in the Migration Guide for this release. ** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, please note that as well. * Missing classes/methods/parameters: We should create to-do JIRAs for functionality missing from Python, to be added in the next release cycle. Please use a *separate* JIRA (linked below) for this list of to-do items. List the found issues: * Inconsistency: ** ml.classification SPARK-11815 SPARK-11820 * Docs: ** ml.classification SPARK-11875 > ML 1.6 QA: API: Python API coverage > --- > > Key: SPARK-11604 > URL: https://issues.apache.org/jira/browse/SPARK-11604 > Project: Spark > Issue Type: Sub-task > Components: Documentation, ML, MLlib, PySpark >Reporter: Joseph K. Bradley >Assignee: Yanbo Liang > > For new public APIs added to MLlib, we need to check the generated HTML doc > and compare the Scala & Python versions. We need to track: > * Inconsistency: Do class/method/parameter names match? > * Docs: Is the Python doc missing or just a stub? We want the Python doc to > be as complete as the Scala doc. > * API breaking changes: These should be very rare but are occasionally either > necessary (intentional) or accidental. These must be recorded and added in > the Migration Guide for this release. > ** Note: If the API change is for an Alpha/Experimental/DeveloperApi > component, please note that as well. > * Missing classes/methods/parameters: We should create to-do JIRAs for > functionality missing from Python, to be added in the next release cycle. > Please use a *separate* JIRA (linked below) for this list of to-do items. > List the found issues: > * Inconsistency: > * Docs: > ** ml.classification SPARK-11875 > * Missing classes/methods/parameters > ** ml.classification SPARK-11815 SPARK-11820 > ** ml.feature SPARK-11922 -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021720#comment-15021720 ] Yanbo Liang commented on SPARK-11918: - I use the same dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt) to train LinearRegressionModel with R:::glm, it did not throw exception but the result is not confidence. The coefficients of the model contains too many NA and NaN which is not reasonable. Please see the attached file to find the R:::glm output. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Attachment: R_GLM_output > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021729#comment-15021729 ] Yanbo Liang edited comment on SPARK-11918 at 11/23/15 8:31 AM: --- Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception like: {code} breeze.linalg.MatrixSingularException: at breeze.linalg.inv$$anon$1.apply(inv.scala:36) at breeze.linalg.inv$$anon$1.apply(inv.scala:19) at breeze.generic.UFunc$class.apply(UFunc.scala:48) at breeze.linalg.inv$.apply(inv.scala:17) {code} The breeze.linalg.inv is also call netlib LAPACK package which is the same library as Spark. Tracking the breeze code, we can get this exception is thrown at here (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33) which is also caused by the underneath lapack error. was (Author: yanboliang): Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath Cholesky Decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} It's caused by the underneath lapack library return error value. was: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath Cholesky Decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed. The failure is caused by the underneath > Cholesky Decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} > It's caused by the underneath lapack library return error value. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11919) graphx should be supported with java
[ https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] benedict jin updated SPARK-11919: - Description: Please make the graphx component to be supported with java, hope appear demo and java api for graphx as soon as possible. :-) (was: Please make the graphx component to be supported with java, hope appear demo and java api for graphx as soon as possible :-)) > graphx should be supported with java > > > Key: SPARK-11919 > URL: https://issues.apache.org/jira/browse/SPARK-11919 > Project: Spark > Issue Type: New Feature > Components: Examples, GraphX, Java API >Reporter: benedict jin > > Please make the graphx component to be supported with java, hope appear demo > and java api for graphx as soon as possible. :-) -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Comment Edited] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021729#comment-15021729 ] Yanbo Liang edited comment on SPARK-11918 at 11/23/15 8:44 AM: --- Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception like: {code} breeze.linalg.MatrixSingularException: at breeze.linalg.inv$$anon$1.apply(inv.scala:36) at breeze.linalg.inv$$anon$1.apply(inv.scala:19) at breeze.generic.UFunc$class.apply(UFunc.scala:48) at breeze.linalg.inv$.apply(inv.scala:17) {code} breeze.linalg.inv is also call netlib lapack library which is the same as Spark. Tracking the breeze code, we can get this exception is thrown at here (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33) also caused by the underneath lapack error. was (Author: yanboliang): Further more, I use the breeze library to train the model by local normal equation method. {code} import sqlCtx.implicits._ import org.apache.spark.mllib.linalg.Vector import breeze.linalg.DenseMatrix import breeze.linalg._ val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, "/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF() val features = df.select(col("features")).map { r => r.getAs[Vector](0) }.collect().flatMap { v => v.toArray } val labelArray = df.select(col("label")).map { r => r.getDouble(0) }.collect() val Xt = new DenseMatrix[Double](692, 100, features) val X = Xt.t val y = new DenseMatrix[Double](100, 1, labelArray) val XtXi = inv(Xt * X) val XtY = Xt * y val coefs = XtXi * XtY println(coefs.toString) {code} It also throw exception like: {code} breeze.linalg.MatrixSingularException: at breeze.linalg.inv$$anon$1.apply(inv.scala:36) at breeze.linalg.inv$$anon$1.apply(inv.scala:19) at breeze.generic.UFunc$class.apply(UFunc.scala:48) at breeze.linalg.inv$.apply(inv.scala:17) {code} The breeze.linalg.inv is also call netlib LAPACK package which is the same library as Spark. Tracking the breeze code, we can get this exception is thrown at here (https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33) which is also caused by the underneath lapack error. > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (SPARK-11922) Python API for ml.feature.QuantileDiscretizer
Yanbo Liang created SPARK-11922: --- Summary: Python API for ml.feature.QuantileDiscretizer Key: SPARK-11922 URL: https://issues.apache.org/jira/browse/SPARK-11922 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add Python API for ml.feature.QuantileDiscretizer. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Created] (SPARK-11923) Python API for ml.feature.ChiSqSelector
Yanbo Liang created SPARK-11923: --- Summary: Python API for ml.feature.ChiSqSelector Key: SPARK-11923 URL: https://issues.apache.org/jira/browse/SPARK-11923 Project: Spark Issue Type: Improvement Components: ML, PySpark Reporter: Yanbo Liang Priority: Minor Add Python API for ml.feature.ChiSqSelector. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11916) Expression TRIM/LTRIM/RTRIM to support specific trim word
[ https://issues.apache.org/jira/browse/SPARK-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sean Owen updated SPARK-11916: -- Component/s: SQL > Expression TRIM/LTRIM/RTRIM to support specific trim word > - > > Key: SPARK-11916 > URL: https://issues.apache.org/jira/browse/SPARK-11916 > Project: Spark > Issue Type: Improvement > Components: SQL >Reporter: Adrian Wang >Priority: Minor > > supports expressions like `trim('xxxabcxxx', 'x')` -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation
[ https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanbo Liang updated SPARK-11918: Description: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed (But the "l-bfgs" can train and get the model). The failure is caused by the underneath lapack library return error value when Cholesky decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} was: Weighted Least Squares (WLS) is one of the optimization method for solve Linear Regression (when #feature < 4096). But if the dataset is very ill condition (such as 0-1 based label used for classification and the equation is underdetermined), the WLS failed. The failure is caused by the underneath lapack library return error value when Cholesky decomposition. This issue is easy to reproduce, you can train a LinearRegressionModel by "normal" solver with the example dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). The following is the exception: {code} assertion failed: lapack.dpotrs returned 1. java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. at scala.Predef$.assert(Predef.scala:179) at org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) at org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) at org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) {code} > WLS can not resolve some kinds of equation > -- > > Key: SPARK-11918 > URL: https://issues.apache.org/jira/browse/SPARK-11918 > Project: Spark > Issue Type: Bug > Components: ML >Reporter: Yanbo Liang > Attachments: R_GLM_output > > > Weighted Least Squares (WLS) is one of the optimization method for solve > Linear Regression (when #feature < 4096). But if the dataset is very ill > condition (such as 0-1 based label used for classification and the equation > is underdetermined), the WLS failed (But the "l-bfgs" can train and get the > model). The failure is caused by the underneath lapack library return error > value when Cholesky decomposition. > This issue is easy to reproduce, you can train a LinearRegressionModel by > "normal" solver with the example > dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt). > The following is the exception: > {code} > assertion failed: lapack.dpotrs returned 1. > java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1. > at scala.Predef$.assert(Predef.scala:179) > at > org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42) > at > org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180) > at > org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-11782) Master Web UI should link to correct Application UI in cluster mode
[ https://issues.apache.org/jira/browse/SPARK-11782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021793#comment-15021793 ] Sean Owen commented on SPARK-11782: --- Oh right I read right past that. I think that's the difference with what [~jbonofre] sees. > Master Web UI should link to correct Application UI in cluster mode > --- > > Key: SPARK-11782 > URL: https://issues.apache.org/jira/browse/SPARK-11782 > Project: Spark > Issue Type: Bug > Components: Web UI >Affects Versions: 1.4.1 >Reporter: Matthias Niehoff > > - Running a standalone cluster, with node1 as master > - Submit an application to cluster with deploy-mode=cluster > - Application driver is on node other than node1 (i.e. node3) > => master WebUI links to node1:4040 for Application Detail UI and not to > node3:4040 > As the master knows on which worker the driver is running, it should be > possible to show the correct link to the Application Detail UI -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org