[jira] [Commented] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException

2015-11-23 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022775#comment-15022775
 ] 

Huaxin Gao commented on SPARK-11788:


I will change to timestampValue and dateValue.
In the test case, I intentionally make the date and timestamp 1 year less than 
the value in the table because I am using > 
$"B" > date && $"C" > timestamp



> Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC 
> dataframes causes SQLServerException
> 
>
> Key: SPARK-11788
> URL: https://issues.apache.org/jira/browse/SPARK-11788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Martin Tapp
>
> I have a MSSQL table that has a timestamp column and am reading it using 
> DataFrameReader.jdbc. Adding a where clause which compares a timestamp range 
> causes a SQLServerException.
> The problem is in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264
>  (compileValue) which should surround timestamps/dates with quotes (only does 
> it for strings).
> Sample pseudo-code:
> val beg = new java.sql.Timestamp(...)
> val end = new java.sql.Timestamp(...)
> val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" 
> < end)
> Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
> Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 
> 00:00:00.0'"
> Fallback is to filter client-side which is extremely inefficient as the whole 
> table needs to be downloaded to each Spark executor.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11836) Register a Python function creates a new SQLContext

2015-11-23 Thread Davies Liu (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Davies Liu reassigned SPARK-11836:
--

Assignee: Davies Liu

> Register a Python function creates a new SQLContext
> ---
>
> Key: SPARK-11836
> URL: https://issues.apache.org/jira/browse/SPARK-11836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
>
> You can try it with {{sqlContext.registerFunction("stringLengthString", 
> lambda x: len)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException

2015-11-23 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022845#comment-15022845
 ] 

Huaxin Gao edited comment on SPARK-11788 at 11/23/15 7:49 PM:
--

Sorry, I just realized that actually I checked in the $"B" === date && $"C" === 
timestamp instead of the $"B" > date && $"C" > timestamp
I will change.
Never mind.  It is  $"B" > date && $"C" > timestamp
I have multiple test case and I confused myself. 


was (Author: huaxing):
Sorry, I just realized that actually I checked in the $"B" === date && $"C" === 
timestamp instead of the $"B" > date && $"C" > timestamp
I will change.

> Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC 
> dataframes causes SQLServerException
> 
>
> Key: SPARK-11788
> URL: https://issues.apache.org/jira/browse/SPARK-11788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Martin Tapp
>
> I have a MSSQL table that has a timestamp column and am reading it using 
> DataFrameReader.jdbc. Adding a where clause which compares a timestamp range 
> causes a SQLServerException.
> The problem is in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264
>  (compileValue) which should surround timestamps/dates with quotes (only does 
> it for strings).
> Sample pseudo-code:
> val beg = new java.sql.Timestamp(...)
> val end = new java.sql.Timestamp(...)
> val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" 
> < end)
> Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
> Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 
> 00:00:00.0'"
> Fallback is to filter client-side which is extremely inefficient as the whole 
> table needs to be downloaded to each Spark executor.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Issue Comment Deleted] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException

2015-11-23 Thread Huaxin Gao (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Huaxin Gao updated SPARK-11788:
---
Comment: was deleted

(was: Sorry, I just realized that actually I checked in the $"B" === date && 
$"C" === timestamp instead of the $"B" > date && $"C" > timestamp
I will change.
Never mind.  It is  $"B" > date && $"C" > timestamp
I have multiple test case and I confused myself. )

> Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC 
> dataframes causes SQLServerException
> 
>
> Key: SPARK-11788
> URL: https://issues.apache.org/jira/browse/SPARK-11788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Martin Tapp
>
> I have a MSSQL table that has a timestamp column and am reading it using 
> DataFrameReader.jdbc. Adding a where clause which compares a timestamp range 
> causes a SQLServerException.
> The problem is in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264
>  (compileValue) which should surround timestamps/dates with quotes (only does 
> it for strings).
> Sample pseudo-code:
> val beg = new java.sql.Timestamp(...)
> val end = new java.sql.Timestamp(...)
> val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" 
> < end)
> Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
> Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 
> 00:00:00.0'"
> Fallback is to filter client-side which is extremely inefficient as the whole 
> table needs to be downloaded to each Spark executor.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley updated SPARK-11920:
--
Assignee: Yanbo Liang
Target Version/s: 1.6.0

> ML LinearRegression should use correct dataset in examples and user guide doc
> -
>
> Key: SPARK-11920
> URL: https://issues.apache.org/jira/browse/SPARK-11920
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
>
> ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
> examples and user guide doc, but it's actually classification dataset rather 
> than regression dataset. We should use 
> data/mllib/sample_linear_regression_data.txt instead.
> The deeper causes is that LinearRegression with "normal" solver can not solve 
> this dataset correctly, may be due to the ill condition and unreasonable 
> label. This issue has been reported at SPARK-11918.
> It will confuse users if they run the example code but get exception, so we 
> should make this change which can clearly illustrate the usage of 
> LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11886) R function name conflicts with base or stats package ones

2015-11-23 Thread Felix Cheung (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11886?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022787#comment-15022787
 ] 

Felix Cheung commented on SPARK-11886:
--

We could also create a generic function to redirect the more generic calls to 
base:: or stats:: ones, similar to how [~sunrui] handles rank

SPARK-7499 might affect how we have method signature too.


> R function name conflicts with base or stats package ones
> -
>
> Key: SPARK-11886
> URL: https://issues.apache.org/jira/browse/SPARK-11886
> Project: Spark
>  Issue Type: Bug
>  Components: SparkR
>Affects Versions: 1.5.2
>Reporter: Felix Cheung
>Priority: Minor
>
> See https://github.com/apache/spark/pull/9785
> Currently these are masked:
> stats::cov
> stats::filter
> base::sample
> base::table
> [~shivaram] suggested:
> "
> If we have same name but the param types completely don't match (and no room 
> for ...) then we override those functions but (This is true for sample, 
> table, cov right now I guess) we should try to limit the number of functions 
> where we do this. Also we should revisit some of these to see if we can avoid 
> it (for example table can be renamed ?)
> "



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11234) What's cooking classification

2015-11-23 Thread Joseph K. Bradley (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022830#comment-15022830
 ] 

Joseph K. Bradley commented on SPARK-11234:
---

[~yinxusen] Thank you for working through this task!  Here are some of my 
thoughts:

{quote}1. Currently, multi-line per record JSON file is hard to handle, I have 
to load the data with JsonInputFormat in the json-pxf-ext package.
{quote}
* WIP, but no clear ETA [SPARK-7366]

{quote}2. String indexer is easy to use. But it is hard to do beyond existing 
transformers. Like in the code, when I want to add all vectors that belong to 
the same id together, I have to write an aggregate function.
{quote}
* Does the SQLTransformer help?  If you could pick any API to write this 
operation, what would be ideal for you?  (I'm envisioning something analogous 
to a UDF for ML Pipelines, but that is almost provided by the SQLTransformer.)

{quote}3. ParamGridBuilder accepts discrete parameter candidates, but I need to 
add some parameters with guess like Array(1.0, 0.1, 0.01). I don't know which 
parameter is suitable and how to fill in the array will get a better result. 
How about giving a range of real numbers so that the ParamGridBuilder can 
generate candidates for me like [0.0001, 1]?
{quote}

Do you mean it should automatically zoom in on regions which seem to get good 
results?  I agree this can help in practice; I did something like this for a 
different ML library.

{quote}4. The evaluator forces me to select a metric method. But sometimes I 
want to see all the evaluation results, say F1, precision-recall, AUC, etc.
{quote}

Do you want the metrics (a) for the sake of viewing performance at the end of a 
test?  Or do you want the metrics (b) for model selection?  If it's for (a) 
viewing at the end of a test, then model summaries are probably the way to go.  
Only LinearRegression and LogisticRegression have summaries currently, but we 
should add them for other models too.

{quote}5. ML transformers will get stuck when facing with Int type. It's 
strange that we have to transform all Int values to double values before hand. 
I think a wise auto casting is helpful.
{quote}

I agree that too many Transformers are brittle when it comes to accepting 
multiple Numeric types.  I had made an umbrella here [SPARK-11107], but perhaps 
we can think of a way to make this change everywhere, rather than case-by-case.

> What's cooking classification
> -
>
> Key: SPARK-11234
> URL: https://issues.apache.org/jira/browse/SPARK-11234
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: Xusen Yin
>
> I add the subtask to post the work on this dataset:  
> https://www.kaggle.com/c/whats-cooking



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11836) Register a Python function creates a new SQLContext

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11836:


Assignee: Davies Liu  (was: Apache Spark)

> Register a Python function creates a new SQLContext
> ---
>
> Key: SPARK-11836
> URL: https://issues.apache.org/jira/browse/SPARK-11836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
>
> You can try it with {{sqlContext.registerFunction("stringLengthString", 
> lambda x: len)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11836) Register a Python function creates a new SQLContext

2015-11-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022841#comment-15022841
 ] 

Apache Spark commented on SPARK-11836:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9914

> Register a Python function creates a new SQLContext
> ---
>
> Key: SPARK-11836
> URL: https://issues.apache.org/jira/browse/SPARK-11836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
>
> You can try it with {{sqlContext.registerFunction("stringLengthString", 
> lambda x: len)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-3215) Add remote interface for SparkContext

2015-11-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-3215.
---
Resolution: Won't Fix

I think there isn't enough interest in getting this into Spark itself, so I'll 
just close the bug instead.

> Add remote interface for SparkContext
> -
>
> Key: SPARK-3215
> URL: https://issues.apache.org/jira/browse/SPARK-3215
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Marcelo Vanzin
>  Labels: hive
> Attachments: RemoteSparkContext.pdf
>
>
> A quick description of the issue: as part of running Hive jobs on top of 
> Spark, it's desirable to have a SparkContext that is running in the 
> background and listening for job requests for a particular user session.
> Running multiple contexts in the same JVM is not a very good solution. Not 
> only SparkContext currently has issues sharing the same JVM among multiple 
> instances, but that turns the JVM running the contexts into a huge bottleneck 
> in the system.
> So I'm proposing a solution where we have a SparkContext that is running in a 
> separate process, and listening for requests from the client application via 
> some RPC interface (most probably Akka).
> I'll attach a document shortly with the current proposal. Let's use this bug 
> to discuss the proposal and any other suggestions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11836) Register a Python function creates a new SQLContext

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11836:


Assignee: Apache Spark  (was: Davies Liu)

> Register a Python function creates a new SQLContext
> ---
>
> Key: SPARK-11836
> URL: https://issues.apache.org/jira/browse/SPARK-11836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Yin Huai
>Assignee: Apache Spark
>Priority: Critical
>
> You can try it with {{sqlContext.registerFunction("stringLengthString", 
> lambda x: len)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-7539) Perf tests for Python MLlib

2015-11-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7539?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley reassigned SPARK-7539:


Assignee: Joseph K. Bradley  (was: Xiangrui Meng)

> Perf tests for Python MLlib
> ---
>
> Key: SPARK-7539
> URL: https://issues.apache.org/jira/browse/SPARK-7539
> Project: Spark
>  Issue Type: Test
>  Components: MLlib, PySpark, Tests
>Reporter: Joseph K. Bradley
>Assignee: Joseph K. Bradley
>
> As new perf-tests are added to Scala, we should added equivalent ones in 
> Python.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11788) Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC dataframes causes SQLServerException

2015-11-23 Thread Huaxin Gao (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022845#comment-15022845
 ] 

Huaxin Gao commented on SPARK-11788:


Sorry, I just realized that actually I checked in the $"B" === date && $"C" === 
timestamp instead of the $"B" > date && $"C" > timestamp
I will change.

> Using java.sql.Timestamp and java.sql.Date in where clauses on JDBC 
> dataframes causes SQLServerException
> 
>
> Key: SPARK-11788
> URL: https://issues.apache.org/jira/browse/SPARK-11788
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Martin Tapp
>
> I have a MSSQL table that has a timestamp column and am reading it using 
> DataFrameReader.jdbc. Adding a where clause which compares a timestamp range 
> causes a SQLServerException.
> The problem is in 
> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JDBCRDD.scala#L264
>  (compileValue) which should surround timestamps/dates with quotes (only does 
> it for strings).
> Sample pseudo-code:
> val beg = new java.sql.Timestamp(...)
> val end = new java.sql.Timestamp(...)
> val filtered = jdbcdf.where($"TIMESTAMP_COLUMN" >= beg && $"TIMESTAMP_COLUMN" 
> < end)
> Generated SQL query: "TIMESTAMP_COLUMN >= 2015-01-01 00:00:00.0"
> Query should use quotes around timestamp: "TIMESTAMP_COLUMN >= '2015-01-01 
> 00:00:00.0'"
> Fallback is to filter client-side which is extremely inefficient as the whole 
> table needs to be downloaded to each Spark executor.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Joseph K. Bradley (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Joseph K. Bradley resolved SPARK-11920.
---
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9905
[https://github.com/apache/spark/pull/9905]

> ML LinearRegression should use correct dataset in examples and user guide doc
> -
>
> Key: SPARK-11920
> URL: https://issues.apache.org/jira/browse/SPARK-11920
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Assignee: Yanbo Liang
>Priority: Minor
> Fix For: 1.6.0
>
>
> ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
> examples and user guide doc, but it's actually classification dataset rather 
> than regression dataset. We should use 
> data/mllib/sample_linear_regression_data.txt instead.
> The deeper causes is that LinearRegression with "normal" solver can not solve 
> this dataset correctly, may be due to the ill condition and unreasonable 
> label. This issue has been reported at SPARK-11918.
> It will confuse users if they run the example code but get exception, so we 
> should make this change which can clearly illustrate the usage of 
> LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-23 Thread Sandy Ryza (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022634#comment-15022634
 ] 

Sandy Ryza commented on SPARK-:
---

[~nchammas] it's not clear that it makes sense to add a similar API for Python 
and R.  The main point of the Dataset API, as I understand it, is to extend 
DataFrames to take advantage of Java / Scala's static typing systems.  This 
means recovering compile-time type safety, integration with existing Java / 
Scale object frameworks, and Scala syntactic sugar like pattern matching.  
Python and R are dynamically typed so can't take advantage of these.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-6328) Python API for StreamingListener

2015-11-23 Thread Daniel Jalova (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-6328?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022690#comment-15022690
 ] 

Daniel Jalova commented on SPARK-6328:
--

@tdas Could you change the assignee to me please?

> Python API for StreamingListener
> 
>
> Key: SPARK-6328
> URL: https://issues.apache.org/jira/browse/SPARK-6328
> Project: Spark
>  Issue Type: Improvement
>  Components: PySpark
>Reporter: Yifan Wang
>Assignee: Yifan Wang
> Fix For: 1.6.0
>
>
> StreamingListener API is only available in Java/Scala. It will be useful to 
> make it available in Python so that Spark application written in python can 
> check the status of ongoing streaming computation. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11913) support typed aggregate for complex buffer schema

2015-11-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11913?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11913.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9898
[https://github.com/apache/spark/pull/9898]

> support typed aggregate for complex buffer schema
> -
>
> Key: SPARK-11913
> URL: https://issues.apache.org/jira/browse/SPARK-11913
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11894) Incorrect results are returned when using null

2015-11-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11894?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11894.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9904
[https://github.com/apache/spark/pull/9904]

> Incorrect results are returned when using null
> --
>
> Key: SPARK-11894
> URL: https://issues.apache.org/jira/browse/SPARK-11894
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Affects Versions: 1.6.0
>Reporter: Xiao Li
> Fix For: 1.6.0
>
>
> In DataSet APIs, the following two datasets are the same. 
>   Seq((new java.lang.Integer(0), "1"), (new java.lang.Integer(22), 
> "2")).toDS()
>   Seq((null.asInstanceOf[java.lang.Integer],, "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> Note: java.lang.Integer is Nullable. 
> It could generate an incorrect result. For example, 
> val ds1 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()
> val ds2 = Seq((null.asInstanceOf[java.lang.Integer], "1"), (new 
> java.lang.Integer(22), "2")).toDS()//toDF("key", "value").as('df2)
> val res1 = ds1.joinWith(ds2, lit(true)).collect()
> The expected result should be 
> ((null,1),(null,1))
> ((22,2),(null,1))
> ((null,1),(22,2))
> ((22,2),(22,2))
> The actual result is 
> ((0,1),(0,1))
> ((22,2),(0,1))
> ((0,1),(22,2))
> ((22,2),(22,2))



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11921) fix `nullable` of encoder schema

2015-11-23 Thread Michael Armbrust (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Armbrust resolved SPARK-11921.
--
   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9906
[https://github.com/apache/spark/pull/9906]

> fix `nullable` of encoder schema
> 
>
> Key: SPARK-11921
> URL: https://issues.apache.org/jira/browse/SPARK-11921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-7173) Support YARN node label expressions for the application master

2015-11-23 Thread Marcelo Vanzin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-7173?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcelo Vanzin resolved SPARK-7173.
---
   Resolution: Fixed
 Assignee: Saisai Shao
Fix Version/s: 1.6.0

> Support YARN node label expressions for the application master
> --
>
> Key: SPARK-7173
> URL: https://issues.apache.org/jira/browse/SPARK-7173
> Project: Spark
>  Issue Type: Improvement
>  Components: YARN
>Affects Versions: 1.3.1
>Reporter: Sandy Ryza
>Assignee: Saisai Shao
> Fix For: 1.6.0
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11837) spark_ec2.py breaks with python3 and m3 instances

2015-11-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen resolved SPARK-11837.

   Resolution: Fixed
Fix Version/s: 1.6.0

Issue resolved by pull request 9797
[https://github.com/apache/spark/pull/9797]

> spark_ec2.py breaks with python3 and m3 instances
> -
>
> Key: SPARK-11837
> URL: https://issues.apache.org/jira/browse/SPARK-11837
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Mortada Mehyar
>Priority: Minor
> Fix For: 1.6.0
>
>
> The `spark_ec2.py` script breaks when launching an m3 instance with python3 
> because `string.letters` is for python2 only. For python3 
> `string.ascii_letters` should be used instead. 
> The PR for fixing this is here: https://github.com/apache/spark/pull/9797 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11836) Register a Python function creates a new SQLContext

2015-11-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11836?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022870#comment-15022870
 ] 

Apache Spark commented on SPARK-11836:
--

User 'davies' has created a pull request for this issue:
https://github.com/apache/spark/pull/9915

> Register a Python function creates a new SQLContext
> ---
>
> Key: SPARK-11836
> URL: https://issues.apache.org/jira/browse/SPARK-11836
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.3.0, 1.4.0, 1.5.0
>Reporter: Yin Huai
>Assignee: Davies Liu
>Priority: Critical
>
> You can try it with {{sqlContext.registerFunction("stringLengthString", 
> lambda x: len)}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11837) spark_ec2.py breaks with python3 and m3 instances

2015-11-23 Thread Josh Rosen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11837?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Josh Rosen updated SPARK-11837:
---
Assignee: Mortada Mehyar

> spark_ec2.py breaks with python3 and m3 instances
> -
>
> Key: SPARK-11837
> URL: https://issues.apache.org/jira/browse/SPARK-11837
> Project: Spark
>  Issue Type: Bug
>  Components: EC2
>Reporter: Mortada Mehyar
>Assignee: Mortada Mehyar
>Priority: Minor
> Fix For: 1.6.0
>
>
> The `spark_ec2.py` script breaks when launching an m3 instance with python3 
> because `string.letters` is for python2 only. For python3 
> `string.ascii_letters` should be used instead. 
> The PR for fixing this is here: https://github.com/apache/spark/pull/9797 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-23 Thread Nicholas Chammas (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022735#comment-15022735
 ] 

Nicholas Chammas edited comment on SPARK- at 11/23/15 8:06 PM:
---

[~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will 
provide no performance advantages over DataFrames, and that they will just help 
in terms of catching type errors early?

{quote}
Python and R are dynamically typed so can't take advantage of these.
{quote}

I can't speak for R, but Python has supported type hints since 3.0. More 
recently, Python 3.5 introduced a [typing 
module|https://docs.python.org/3/library/typing.html#module-typing] to 
standardize how type hints are specified, which facilitates the use of static 
type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer 
a statically type checked API, but practically speaking it would have to be 
limited to Python 3+.

I suppose people don't generally expect static type checking when they use 
Python, so perhaps it makes sense not to support Datasets in PySpark.


was (Author: nchammas):
[~sandyr] - Hmm, so are you saying that, generally speaking, Datasets will 
provide no performance advantages over DataFrames, and that they will just help 
in terms of catching type errors early?

{quote}
Python and R are dynamically typed so can't take advantage of these.
{quote}

I can't speak for R, but Python as supported type hints since 3.0. More 
recently, Python 3.5 introduced a [typing 
module|https://docs.python.org/3/library/typing.html#module-typing] to 
standardize how type hints are specified, which facilitates the use of static 
type checkers like [mypy|http://mypy-lang.org/]. PySpark could definitely offer 
a statically type checked API, but practically speaking it would have to be 
limited to Python 3+.

I suppose people don't generally expect static type checking when they use 
Python, so perhaps it makes sense not to support Datasets in PySpark.

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take 

[jira] [Commented] (SPARK-11329) Expand Star when creating a struct

2015-11-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11329?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022876#comment-15022876
 ] 

Maciej BryƄski commented on SPARK-11329:


[~yhuai]
How can I execute this query:
```
SELECT max(struct(timestamp, *)) as mostRecentRecord GROUP BY key
```
without Spark SQL. I'd like to make run this from pyspark


> Expand Star when creating a struct
> --
>
> Key: SPARK-11329
> URL: https://issues.apache.org/jira/browse/SPARK-11329
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Yin Huai
>Assignee: Nong Li
> Fix For: 1.6.0
>
>
> It is pretty common for customers to do regular extractions of update data 
> from an external datasource (e.g. mysql or postgres). While this is possible 
> today, the syntax is a little onerous. With some small improvements to the 
> analyzer I think we could make this much easier.
> Goal: Allow users to execute the following two queries as well as their 
> dataframe equivalents
> to find the most recent record for each key
> {{SELECT max(struct(timestamp, *)) as mostRecentRecord GROUP BY key}}
> to unnest the struct from above.
> {{SELECT mostRecentRecord.* FROM data}}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-9999) Dataset API on top of Catalyst/DataFrame

2015-11-23 Thread Xiao Li (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022883#comment-15022883
 ] 

Xiao Li commented on SPARK-:


Agree. The major performance gain of Dataset should be from Catalyst Optimizer. 

> Dataset API on top of Catalyst/DataFrame
> 
>
> Key: SPARK-
> URL: https://issues.apache.org/jira/browse/SPARK-
> Project: Spark
>  Issue Type: Story
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Michael Armbrust
>
> The RDD API is very flexible, and as a result harder to optimize its 
> execution in some cases. The DataFrame API, on the other hand, is much easier 
> to optimize, but lacks some of the nice perks of the RDD API (e.g. harder to 
> use UDFs, lack of strong types in Scala/Java).
> The goal of Spark Datasets is to provide an API that allows users to easily 
> express transformations on domain objects, while also providing the 
> performance and robustness advantages of the Spark SQL execution engine.
> h2. Requirements
>  - *Fast* - In most cases, the performance of Datasets should be equal to or 
> better than working with RDDs.  Encoders should be as fast or faster than 
> Kryo and Java serialization, and unnecessary conversion should be avoided.
>  - *Typesafe* - Similar to RDDs, objects and functions that operate on those 
> objects should provide compile-time safety where possible.  When converting 
> from data where the schema is not known at compile-time (for example data 
> read from an external source such as JSON), the conversion function should 
> fail-fast if there is a schema mismatch.
>  - *Support for a variety of object models* - Default encoders should be 
> provided for a variety of object models: primitive types, case classes, 
> tuples, POJOs, JavaBeans, etc.  Ideally, objects that follow standard 
> conventions, such as Avro SpecificRecords, should also work out of the box.
>  - *Java Compatible* - Datasets should provide a single API that works in 
> both Scala and Java.  Where possible, shared types like Array will be used in 
> the API.  Where not possible, overloaded functions should be provided for 
> both languages.  Scala concepts, such as ClassTags should not be required in 
> the user-facing API.
>  - *Interoperates with DataFrames* - Users should be able to seamlessly 
> transition between Datasets and DataFrames, without specifying conversion 
> boiler-plate.  When names used in the input schema line-up with fields in the 
> given class, no extra mapping should be necessary.  Libraries like MLlib 
> should not need to provide different interfaces for accepting DataFrames and 
> Datasets as input.
> For a detailed outline of the complete proposed API: 
> [marmbrus/dataset-api|https://github.com/marmbrus/spark/pull/18/files]
> For an initial discussion of the design considerations in this API: [design 
> doc|https://docs.google.com/document/d/1ZVaDqOcLm2-NcS0TElmslHLsEIEwqzt0vBvzpLrV6Ik/edit#]
> The initial version of the Dataset API has been merged in Spark 1.6. However, 
> it will take a few more future releases to flush everything out.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: Weighted Least Squares (WLS) is one of the optimization method 
for solve Linear Regression (when #feature < 4096). But if the dataset is very 
ill condition (such as 0-1 based label used for classification and with 
underdetermined equation), the WLS failed.   (was: Weighted Least Squares (WLS) 
is one of the optimization method for solve Linear Regression (when #feature < 
4096). But if the dataset is very ill condition (such as 0-1 based label used 
for classification and with underdetermined equation), the WLS may failure. )

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and with 
> underdetermined equation), the WLS failed. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11918:
---

 Summary: WLS can not resolve some kinds of equation
 Key: SPARK-11918
 URL: https://issues.apache.org/jira/browse/SPARK-11918
 Project: Spark
  Issue Type: Bug
  Components: ML
Reporter: Yanbo Liang


Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the label of the dataset is very ill 
condition (such as 0-1 based label used for classification), the WLS may 
failure. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: Weighted Least Squares (WLS) is one of the optimization method 
for solve Linear Regression (when #feature < 4096). But if the dataset is very 
ill condition (such as 0-1 based label used for classification and with 
underdetermined equation), the WLS may failure.   (was: Weighted Least Squares 
(WLS) is one of the optimization method for solve Linear Regression (when 
#feature < 4096). But if the label of the dataset is very ill condition (such 
as 0-1 based label used for classification), the WLS may failure. )

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and with 
> underdetermined equation), the WLS may failure. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: Weighted Least Squares (WLS) is one of the optimization method 
for solve Linear Regression (when #feature < 4096). But if the dataset is very 
ill condition (such as 0-1 based label used for classification and the equation 
is underdetermined), the WLS failed. The failure is caused by the underneath 
Cholesky Decomposition.  (was: Weighted Least Squares (WLS) is one of the 
optimization method for solve Linear Regression (when #feature < 4096). But if 
the dataset is very ill condition (such as 0-1 based label used for 
classification and with underdetermined equation), the WLS failed. )

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> Cholesky Decomposition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: 
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
Cholesky Decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}

  was:
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
Cholesky Decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
{code}


> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> Cholesky Decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: 
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
Cholesky Decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
{code}

  was:Weighted Least Squares (WLS) is one of the optimization method for solve 
Linear Regression (when #feature < 4096). But if the dataset is very ill 
condition (such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
Cholesky Decomposition.


> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> Cholesky Decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11919) graphx should be supported with java

2015-11-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11919.
---
Resolution: Duplicate

[~benedict jin] please search JIRA before opening one. This was easy to find.

> graphx should be supported with java
> 
>
> Key: SPARK-11919
> URL: https://issues.apache.org/jira/browse/SPARK-11919
> Project: Spark
>  Issue Type: New Feature
>  Components: Examples, GraphX, Java API
>Reporter: benedict jin
>
> Please make the graphx component to be supported with java, hope appear demo 
> and java api for graphx as soon as possible. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021787#comment-15021787
 ] 

Sean Owen commented on SPARK-11918:
---

[~yanboliang] yes this is true in general of ill-conditioned problems. What are 
you proposing? to propagate the error from lapack in a different way? check the 
condition number? it's roughly speaking the correct behavior in that there's no 
real answer here.

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11782) Master Web UI should link to correct Application UI in cluster mode

2015-11-23 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-11782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021811#comment-15021811
 ] 

Jean-Baptiste Onofré commented on SPARK-11782:
--

Ah ok, understood. Let me try to reproduce and submit a fix.

> Master Web UI should link to correct Application UI in cluster mode
> ---
>
> Key: SPARK-11782
> URL: https://issues.apache.org/jira/browse/SPARK-11782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.1
>Reporter: Matthias Niehoff
>
> - Running a standalone cluster, with node1 as master
> - Submit an application to cluster with deploy-mode=cluster
> - Application driver is on node other than node1 (i.e. node3)
> => master WebUI links to node1:4040 for Application Detail UI and not to 
> node3:4040
> As the master knows on which worker the driver is running, it should be 
> possible to show the correct link to the Application Detail UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11921) fix `nullable` of encoder schema

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11921:


Assignee: (was: Apache Spark)

> fix `nullable` of encoder schema
> 
>
> Key: SPARK-11921
> URL: https://issues.apache.org/jira/browse/SPARK-11921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11921) fix `nullable` of encoder schema

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11921:


Assignee: Apache Spark

> fix `nullable` of encoder schema
> 
>
> Key: SPARK-11921
> URL: https://issues.apache.org/jira/browse/SPARK-11921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11921) fix `nullable` of encoder schema

2015-11-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021833#comment-15021833
 ] 

Apache Spark commented on SPARK-11921:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9906

> fix `nullable` of encoder schema
> 
>
> Key: SPARK-11921
> URL: https://issues.apache.org/jira/browse/SPARK-11921
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-6531) An Information Theoretic Feature Selection Framework

2015-11-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-6531?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-6531.
--
Resolution: Won't Fix

> An Information Theoretic Feature Selection Framework
> 
>
> Key: SPARK-6531
> URL: https://issues.apache.org/jira/browse/SPARK-6531
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Affects Versions: 1.3.0
>Reporter: Sergio RamĂ­rez
>
> **Information Theoretic Feature Selection Framework**
> The present framework implements Feature Selection (FS) on Spark for its 
> application on Big Data problems. This package contains a generic 
> implementation of greedy Information Theoretic Feature Selection methods. The 
> implementation is based on the common theoretic framework presented in [1]. 
> Implementations of mRMR, InfoGain, JMI and other commonly used FS filters are 
> provided. In addition, the framework can be extended with other criteria 
> provided by the user as long as the process complies with the framework 
> proposed in [1].
> -- Main features:
> * Support for sparse data (in progress).
> * Pool optimization for high-dimensional.
> * Improved performance from previous version.
> This work has associated two submitted contributions to international 
> journals which will be attached to this request as soon as they are accepted 
> This software has been proved with two large real-world datasets such as:
> - A dataset selected for the GECCO-2014 in Vancouver, July 13th, 2014 
> competition, which comes from the Protein Structure Prediction field 
> (http://cruncher.ncl.ac.uk/bdcomp/). The dataset has 32 million instances, 
> 631 attributes, 2 classes, 98% of negative examples and occupies, when 
> uncompressed, about 56GB of disk space.
> - Epsilon dataset: 
> http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#epsilon. 
> 400K instances and 2K attributes.
> -- Brief benchmark results:
> * 150 seconds by selected feature for a 65M dataset with 631 attributes. 
> *  For epsilon dataset, we have outperformed the results without FS for three 
> classifers (from MLLIB) using only 2.5% of original features.
> Design doc: 
> https://docs.google.com/document/d/1HOaPL_HJzTbL2tVdzbTjhr5wxVvPe9e-23S7rc2VcsY/edit?usp=sharing
> References
> [1] Brown, G., Pocock, A., Zhao, M. J., & LujĂĄn, M. (2012). 
> "Conditional likelihood maximisation: a unifying framework for information 
> theoretic feature selection." 
> The Journal of Machine Learning Research, 13(1), 27-66.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11915) Fix flaky python test pyspark.sql.group

2015-11-23 Thread Liang-Chi Hsieh (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Liang-Chi Hsieh closed SPARK-11915.
---
Resolution: Not A Problem

> Fix flaky python test pyspark.sql.group
> ---
>
> Key: SPARK-11915
> URL: https://issues.apache.org/jira/browse/SPARK-11915
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Reporter: Liang-Chi Hsieh
>
> The python test pyspark.sql.group will fail due to items' order in returned 
> array. We should sort the aggregation results to make the test stable.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11925) Add PySpark missing methods for ml.feature

2015-11-23 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11925:
---

 Summary: Add PySpark missing methods for ml.feature
 Key: SPARK-11925
 URL: https://issues.apache.org/jira/browse/SPARK-11925
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor


Add PySpark missing methods for ml.feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11925:


Assignee: Apache Spark

> Add PySpark missing methods for ml.feature during Spark 1.6 QA
> --
>
> Key: SPARK-11925
> URL: https://issues.apache.org/jira/browse/SPARK-11925
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> Add PySpark missing methods for ml.feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11925:


Assignee: (was: Apache Spark)

> Add PySpark missing methods for ml.feature during Spark 1.6 QA
> --
>
> Key: SPARK-11925
> URL: https://issues.apache.org/jira/browse/SPARK-11925
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add PySpark missing methods for ml.feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA

2015-11-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021955#comment-15021955
 ] 

Apache Spark commented on SPARK-11925:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9908

> Add PySpark missing methods for ml.feature during Spark 1.6 QA
> --
>
> Key: SPARK-11925
> URL: https://issues.apache.org/jira/browse/SPARK-11925
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add PySpark missing methods for ml.feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11604:

Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

List the found issues:
* Inconsistency:

* Docs:
** ml.classification SPARK-11875

* Missing classes
** ml.feature 
*** QuantileDiscretizer SPARK-11922
*** ChiSqSelector SPARK-11923

* Missing methods/parameters
** ml.classification SPARK-11815 SPARK-11820
** ml.feature SPARK-11925

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

List the found issues:
* Inconsistency:

* Docs:
** ml.classification SPARK-11875

* Missing classes
** ml.feature 
*** QuantileDiscretizer SPARK-11922
*** ChiSqSelector SPARK-11923

* Missing methods/parameters
** ml.classification SPARK-11815 SPARK-11820
** ml.feature 


> ML 1.6 QA: API: Python API coverage
> ---
>
> Key: SPARK-11604
> URL: https://issues.apache.org/jira/browse/SPARK-11604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.
> List the found issues:
> * Inconsistency:
> * Docs:
> ** ml.classification SPARK-11875
> * Missing classes
> ** ml.feature 
> *** QuantileDiscretizer SPARK-11922
> *** ChiSqSelector SPARK-11923
> * Missing methods/parameters
> ** ml.classification SPARK-11815 SPARK-11820
> ** ml.feature SPARK-11925



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11924) SparkContext stop method does not close HiveContexts

2015-11-23 Thread Sylvain Lequeux (JIRA)
Sylvain Lequeux created SPARK-11924:
---

 Summary: SparkContext stop method does not close HiveContexts
 Key: SPARK-11924
 URL: https://issues.apache.org/jira/browse/SPARK-11924
 Project: Spark
  Issue Type: Bug
Affects Versions: 1.5.1
Reporter: Sylvain Lequeux


In my unit tests, I create one HiveContext inside each test suite using a 
HiveContext creation in beforeAll and a SparkContext stop call in afterAll.

If I have 2 test suites, then it raise an exception at runtime :
"Another instance of Derby may have already booted the database"

It works with Spark 1.4.1 but does not work in 1.5.1.
It works fine using standard SQLContext instead of HiveContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11925:

Summary: Add PySpark missing methods for ml.feature during Spark 1.6 QA  
(was: Add PySpark missing methods for ml.feature during 1.6 QA)

> Add PySpark missing methods for ml.feature during Spark 1.6 QA
> --
>
> Key: SPARK-11925
> URL: https://issues.apache.org/jira/browse/SPARK-11925
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add PySpark missing methods for ml.feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11924) SparkContext stop method does not close HiveContexts

2015-11-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11924?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021953#comment-15021953
 ] 

Sean Owen commented on SPARK-11924:
---

I'm not sure this works in general outside tests, or is supposed to, but 
clearly the tests are able to test {{HiveContext}} somehow. Are you able to use 
{{TestHiveContext}} instead or adopt its way of dealing with this?

> SparkContext stop method does not close HiveContexts
> 
>
> Key: SPARK-11924
> URL: https://issues.apache.org/jira/browse/SPARK-11924
> Project: Spark
>  Issue Type: Bug
>Affects Versions: 1.5.1
>Reporter: Sylvain Lequeux
>
> In my unit tests, I create one HiveContext inside each test suite using a 
> HiveContext creation in beforeAll and a SparkContext stop call in afterAll.
> If I have 2 test suites, then it raise an exception at runtime :
> "Another instance of Derby may have already booted the database"
> It works with Spark 1.4.1 but does not work in 1.5.1.
> It works fine using standard SQLContext instead of HiveContext.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11925) Add PySpark missing methods for ml.feature during 1.6 QA

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11925:

Summary: Add PySpark missing methods for ml.feature during 1.6 QA  (was: 
Add PySpark missing methods for ml.feature)

> Add PySpark missing methods for ml.feature during 1.6 QA
> 
>
> Key: SPARK-11925
> URL: https://issues.apache.org/jira/browse/SPARK-11925
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add PySpark missing methods for ml.feature



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11925) Add PySpark missing methods for ml.feature during Spark 1.6 QA

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11925:

Description: 
Add PySpark missing methods and params for ml.feature
* RegexTokenizer should support setting toLowercase.
* MinMaxScalerModel should support output originalMin and originalMax.
* PCAModel should support output pc.

  was:Add PySpark missing methods for ml.feature


> Add PySpark missing methods for ml.feature during Spark 1.6 QA
> --
>
> Key: SPARK-11925
> URL: https://issues.apache.org/jira/browse/SPARK-11925
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>
> Add PySpark missing methods and params for ml.feature
> * RegexTokenizer should support setting toLowercase.
> * MinMaxScalerModel should support output originalMin and originalMax.
> * PCAModel should support output pc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11782) Master Web UI should link to correct Application UI in cluster mode

2015-11-23 Thread Matthias Niehoff (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021730#comment-15021730
 ] 

Matthias Niehoff commented on SPARK-11782:
--

I submit the App with deploy-mode cluster, then the driver gets started inside 
the cluster. This could be any node then and does not necessarily have to be 
the node where spark-submit was executed.

> Master Web UI should link to correct Application UI in cluster mode
> ---
>
> Key: SPARK-11782
> URL: https://issues.apache.org/jira/browse/SPARK-11782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.1
>Reporter: Matthias Niehoff
>
> - Running a standalone cluster, with node1 as master
> - Submit an application to cluster with deploy-mode=cluster
> - Application driver is on node other than node1 (i.e. node3)
> => master WebUI links to node1:4040 for Application Detail UI and not to 
> node3:4040
> As the master knows on which worker the driver is running, it should be 
> possible to show the correct link to the Application Detail UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021729#comment-15021729
 ] 

Yanbo Liang commented on SPARK-11918:
-

Further more, I use the breeze library to train the model by local normal 
equation method.
{code}
import sqlCtx.implicits._
import org.apache.spark.mllib.linalg.Vector
import breeze.linalg.DenseMatrix
import breeze.linalg._

val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, 
"/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF()


val features = df.select(col("features")).map { r =>
  r.getAs[Vector](0)
}.collect().flatMap { v => v.toArray }
val labelArray = df.select(col("label")).map { r =>
  r.getDouble(0)
}.collect()

val Xt = new DenseMatrix[Double](692, 100, features)
val X = Xt.t

val y = new DenseMatrix[Double](100, 1, labelArray)

val XtXi = inv(Xt * X)
val XtY = Xt * y

val coefs = XtXi * XtY

println(coefs.toString)
{code}
It also throw exception

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> Cholesky Decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11919) graphx should be supported with java

2015-11-23 Thread benedict jin (JIRA)
benedict jin created SPARK-11919:


 Summary: graphx should be supported with java
 Key: SPARK-11919
 URL: https://issues.apache.org/jira/browse/SPARK-11919
 Project: Spark
  Issue Type: Bug
  Components: Examples, GraphX, Java API
Reporter: benedict jin


Please make the graphx component to be supported with java, hope appear demo 
and java api for graphx as soon as possible :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11920:
---

 Summary: ML LinearRegression should use correct dataset in 
examples and user guide doc
 Key: SPARK-11920
 URL: https://issues.apache.org/jira/browse/SPARK-11920
 Project: Spark
  Issue Type: Improvement
  Components: Documentation, ML
Reporter: Yanbo Liang
Priority: Minor


ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
examples and user guide doc, but it's actually classification dataset rather 
than regression dataset. We should use 
data/mllib/sample_linear_regression_data.txt instead.
Another reason is that LinearRegression with "normal" solver can not solve this 
dataset correctly, may be due to the ill condition and unreasonable label. This 
issue has been reported at SPARK-11918.
So we should make this change in examples and user guides, that can clearly 
illustrate the usage of LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Resolved] (SPARK-11909) Spark Standalone's master URL accepts URLs without port (assuming default 7077)

2015-11-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11909?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen resolved SPARK-11909.
---
Resolution: Won't Fix

> Spark Standalone's master URL accepts URLs without port (assuming default 
> 7077)
> ---
>
> Key: SPARK-11909
> URL: https://issues.apache.org/jira/browse/SPARK-11909
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Affects Versions: 1.6.0
>Reporter: Jacek Laskowski
>Priority: Trivial
>
> It's currently impossible to use {{spark://localhost}} URL for Spark 
> Standalone's master. With the feature supported, it'd be less to know to get 
> started with the mode (and hence improve user friendliness).
> I think no-port master URL should be supported and assume the default port 
> {{7077}}.
> {code}
> org.apache.spark.SparkException: Invalid master URL: spark://localhost
>   at 
> org.apache.spark.util.Utils$.extractHostPortFromSparkUrl(Utils.scala:2088)
>   at org.apache.spark.rpc.RpcAddress$.fromSparkURL(RpcAddress.scala:47)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> org.apache.spark.deploy.client.AppClient$$anonfun$1.apply(AppClient.scala:48)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:245)
>   at 
> scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
>   at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:186)
>   at scala.collection.TraversableLike$class.map(TraversableLike.scala:245)
>   at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:186)
>   at org.apache.spark.deploy.client.AppClient.(AppClient.scala:48)
>   at 
> org.apache.spark.scheduler.cluster.SparkDeploySchedulerBackend.start(SparkDeploySchedulerBackend.scala:93)
>   at 
> org.apache.spark.scheduler.TaskSchedulerImpl.start(TaskSchedulerImpl.scala:144)
>   at org.apache.spark.SparkContext.(SparkContext.scala:530)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11520) RegressionMetrics should support instance weights

2015-11-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021867#comment-15021867
 ] 

Apache Spark commented on SPARK-11520:
--

User 'Lewuathe' has created a pull request for this issue:
https://github.com/apache/spark/pull/9907

> RegressionMetrics should support instance weights
> -
>
> Key: SPARK-11520
> URL: https://issues.apache.org/jira/browse/SPARK-11520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This will be important to improve LinearRegressionSummary, which currently 
> has a mix of weighted and unweighted metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11520) RegressionMetrics should support instance weights

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11520:


Assignee: Apache Spark

> RegressionMetrics should support instance weights
> -
>
> Key: SPARK-11520
> URL: https://issues.apache.org/jira/browse/SPARK-11520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>Assignee: Apache Spark
>
> This will be important to improve LinearRegressionSummary, which currently 
> has a mix of weighted and unweighted metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11922) Python API for ml.feature.QuantileDiscretizer

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11922?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11922:

Labels: starter  (was: )

> Python API for ml.feature.QuantileDiscretizer
> --
>
> Key: SPARK-11922
> URL: https://issues.apache.org/jira/browse/SPARK-11922
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Yanbo Liang
>Priority: Minor
>  Labels: starter
>
> Add Python API for ml.feature.QuantileDiscretizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11604:

Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

List the found issues:
* Inconsistency:

* Docs:
** ml.classification SPARK-11875

* Missing classes
** ml.feature 
*** QuantileDiscretizer SPARK-11922
*** ChiSqSelector SPARK-11923

* Missing methods/parameters
** ml.classification SPARK-11815 SPARK-11820
** ml.feature 

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

List the found issues:
* Inconsistency:

* Docs:
** ml.classification SPARK-11875

* Missing classes
** ml.classification SPARK-11815 SPARK-11820
** ml.feature 
*** QuantileDiscretizer SPARK-11922
*** ChiSqSelector SPARK-11923

* Missing methods/parameters
** ml.feature 


> ML 1.6 QA: API: Python API coverage
> ---
>
> Key: SPARK-11604
> URL: https://issues.apache.org/jira/browse/SPARK-11604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.
> List the found issues:
> * Inconsistency:
> * Docs:
> ** ml.classification SPARK-11875
> * Missing classes
> ** ml.feature 
> *** QuantileDiscretizer SPARK-11922
> *** ChiSqSelector SPARK-11923
> * Missing methods/parameters
> ** ml.classification SPARK-11815 SPARK-11820
> ** ml.feature 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11926) unify GetStructField and GetInternalRowField

2015-11-23 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11926:
---

 Summary: unify GetStructField and GetInternalRowField
 Key: SPARK-11926
 URL: https://issues.apache.org/jira/browse/SPARK-11926
 Project: Spark
  Issue Type: Improvement
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11926) unify GetStructField and GetInternalRowField

2015-11-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021984#comment-15021984
 ] 

Apache Spark commented on SPARK-11926:
--

User 'cloud-fan' has created a pull request for this issue:
https://github.com/apache/spark/pull/9909

> unify GetStructField and GetInternalRowField
> 
>
> Key: SPARK-11926
> URL: https://issues.apache.org/jira/browse/SPARK-11926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11926) unify GetStructField and GetInternalRowField

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11926:


Assignee: Apache Spark

> unify GetStructField and GetInternalRowField
> 
>
> Key: SPARK-11926
> URL: https://issues.apache.org/jira/browse/SPARK-11926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>Assignee: Apache Spark
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11926) unify GetStructField and GetInternalRowField

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11926?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11926:


Assignee: (was: Apache Spark)

> unify GetStructField and GetInternalRowField
> 
>
> Key: SPARK-11926
> URL: https://issues.apache.org/jira/browse/SPARK-11926
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Wenchen Fan
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-3947) Support Scala/Java UDAF

2015-11-23 Thread Milad Bourhani (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Milad Bourhani updated SPARK-3947:
--
Attachment: logs.zip

> Support Scala/Java UDAF
> ---
>
> Key: SPARK-3947
> URL: https://issues.apache.org/jira/browse/SPARK-3947
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Pei-Lun Lee
>Assignee: Yin Huai
> Fix For: 1.5.0
>
> Attachments: logs.zip, spark-udaf-adapted-1.5.2.zip, spark-udaf.zip
>
>
> Right now only Hive UDAFs are supported. It would be nice to have UDAF 
> similar to UDF through SQLContext.registerFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-3947) Support Scala/Java UDAF

2015-11-23 Thread Milad Bourhani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022051#comment-15022051
 ] 

Milad Bourhani commented on SPARK-3947:
---

Just for completeness, I'm attaching the [^logs.zip]. For the record, it looks 
as if the first time you run the clustered computation (right after you started 
{{./sbin/start-all.sh}}), the computation is OK, even though the race condition 
error shows up in the log. After that, it fails. So the attached logs contain 
exactly two executions: the first gives a correct answer, the second doesn't.
To reproduce, run these commands on the unzipped project 
[^spark-udaf-adapted-1.5.2.zip]:
{noformat}
mvn clean install
java -jar `ls target/uber*.jar` `ls target/uber*.jar` spark://master_host:7077
java -jar `ls target/uber*.jar` `ls target/uber*.jar` spark://master_host:7077
{noformat}
where {{spark://master_host:7077}} is your master URL.

> Support Scala/Java UDAF
> ---
>
> Key: SPARK-3947
> URL: https://issues.apache.org/jira/browse/SPARK-3947
> Project: Spark
>  Issue Type: Sub-task
>  Components: SQL
>Reporter: Pei-Lun Lee
>Assignee: Yin Huai
> Fix For: 1.5.0
>
> Attachments: logs.zip, spark-udaf-adapted-1.5.2.zip, spark-udaf.zip
>
>
> Right now only Hive UDAFs are supported. It would be nice to have UDAF 
> similar to UDF through SQLContext.registerFunction.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-939) Allow user jars to take precedence over Spark jars, if desired

2015-11-23 Thread Jayson Minard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022081#comment-15022081
 ] 

Jayson Minard commented on SPARK-939:
-

Ok, but how does this actually work?  Any time we add Jackson 2.6.x to our 
classes then it crashes Spark when run on AWS EMR.  Obviously our JAR is not 
isolated from Spark.  

> Allow user jars to take precedence over Spark jars, if desired
> --
>
> Key: SPARK-939
> URL: https://issues.apache.org/jira/browse/SPARK-939
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: holdenk
>Priority: Blocker
>  Labels: starter
> Fix For: 1.0.0
>
>
> Sometimes a user may want to include their own version of a jar that spark 
> itself uses. For example, if their code requires a newer version of that jar 
> than Spark offers. It would be good to have an option to give the users 
> dependencies precedence over Spark. This options should be disabled by 
> default, since it could lead to some odd behavior (e.g. parts of Spark not 
> working). But I think we should have it.
> From an implementation perspective, this would require modifying the way we 
> do class loading inside of an Executor. The default behavior of the  
> URLClassLoader is to delegate to it's parent first and, if that fails, to 
> find a class locally. We want to have the opposite behavior. This is 
> sometimes referred to as "parent-last" (as opposed to "parent-first") class 
> loading precedence. There is an example of how to do this here:
> http://stackoverflow.com/questions/5445511/how-do-i-create-a-parent-last-child-first-classloader-in-java-or-how-to-overr
> We should write a similar class which can encapsulate a URL classloader and 
> change the delegation order. Or if possible, maybe we could find a more 
> elegant way to do this. See relevant discussion on the user list here:
> https://groups.google.com/forum/#!topic/spark-users/b278DW3e38g
> Also see the corresponding option in Hadoop:
> https://issues.apache.org/jira/browse/MAPREDUCE-4521
> Some other relevant Hadoop JIRA's:
> https://issues.apache.org/jira/browse/MAPREDUCE-1700
> https://issues.apache.org/jira/browse/MAPREDUCE-1938



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-939) Allow user jars to take precedence over Spark jars, if desired

2015-11-23 Thread Jayson Minard (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-939?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022084#comment-15022084
 ] 

Jayson Minard commented on SPARK-939:
-

I see, this now reverses the problem.  We can crash Spark, but Spark can't 
crash us.   Maybe Spark can shade/rename common libraries it uses that are 
likely to conflict (Jackson is one example).

> Allow user jars to take precedence over Spark jars, if desired
> --
>
> Key: SPARK-939
> URL: https://issues.apache.org/jira/browse/SPARK-939
> Project: Spark
>  Issue Type: New Feature
>  Components: Spark Core
>Reporter: Patrick Wendell
>Assignee: holdenk
>Priority: Blocker
>  Labels: starter
> Fix For: 1.0.0
>
>
> Sometimes a user may want to include their own version of a jar that spark 
> itself uses. For example, if their code requires a newer version of that jar 
> than Spark offers. It would be good to have an option to give the users 
> dependencies precedence over Spark. This options should be disabled by 
> default, since it could lead to some odd behavior (e.g. parts of Spark not 
> working). But I think we should have it.
> From an implementation perspective, this would require modifying the way we 
> do class loading inside of an Executor. The default behavior of the  
> URLClassLoader is to delegate to it's parent first and, if that fails, to 
> find a class locally. We want to have the opposite behavior. This is 
> sometimes referred to as "parent-last" (as opposed to "parent-first") class 
> loading precedence. There is an example of how to do this here:
> http://stackoverflow.com/questions/5445511/how-do-i-create-a-parent-last-child-first-classloader-in-java-or-how-to-overr
> We should write a similar class which can encapsulate a URL classloader and 
> change the delegation order. Or if possible, maybe we could find a more 
> elegant way to do this. See relevant discussion on the user list here:
> https://groups.google.com/forum/#!topic/spark-users/b278DW3e38g
> Also see the corresponding option in Hadoop:
> https://issues.apache.org/jira/browse/MAPREDUCE-4521
> Some other relevant Hadoop JIRA's:
> https://issues.apache.org/jira/browse/MAPREDUCE-1700
> https://issues.apache.org/jira/browse/MAPREDUCE-1938



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11885) UDAF may nondeterministically generate wrong results

2015-11-23 Thread Milad Bourhani (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11885?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022053#comment-15022053
 ] 

Milad Bourhani commented on SPARK-11885:


I've attached the logs on SPARK-3947, just to keep all the attachments there, 
of course feel free to move/copy them here :)

> UDAF may nondeterministically generate wrong results
> 
>
> Key: SPARK-11885
> URL: https://issues.apache.org/jira/browse/SPARK-11885
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 1.5.2
>Reporter: Yin Huai
>Assignee: Yin Huai
>Priority: Critical
>
> I could not reproduce it in 1.6 branch (it can be easily reproduced in 1.5). 
> I think it is an issue in 1.5 branch



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11927) configure log4j properties with spark-submit

2015-11-23 Thread Alex Kazantsev (JIRA)
Alex Kazantsev created SPARK-11927:
--

 Summary: configure log4j properties with spark-submit 
 Key: SPARK-11927
 URL: https://issues.apache.org/jira/browse/SPARK-11927
 Project: Spark
  Issue Type: Question
  Components: Spark Submit
Affects Versions: 1.5.1
Reporter: Alex Kazantsev
Priority: Minor


How to properly configure log4j properties on worker per single application 
using spark-submit script?
Currently setting --conf 
'spark.executor.extraJavaOptions=-Dlog4j.configuration=file:"log4j.properties"' 
and --files log4j.properties does not work, because according to worker logs 
loading of specified log4j configuration happens before any files are 
downloaded from driver. Is it a bug or a feature? Is it possible to reconfigure 
log4j properties after properties file was downloaded from driver?

Application was submitted with following script
{noformat}
exec /opt/spark/current/bin/spark-submit \
   --name App \
   --master spark://master:17079 \
   --executor-memory 4G \
   --total-executor-cores 4 \
   --driver-java-options '-Dspark.ui.port=4056 -Dconfig.file=application.conf  
-Dlog4j.configuration=file:"./log4j.properties"' \
   --conf 'spark.executor.extraJavaOptions=-XX:+UseParallelGC 
-Duser.timezone=GMT -Dconfig.file=application.conf 
-Dlog4j.configuration=file:"log4j.properties"' \
   --files application.conf,log4j.properties \
   --class default.Main \
   App.jar $*
{noformat}

Worker logs:
{noformat}
log4j:ERROR Could not read configuration file from URL [file:log4j.properties].
java.io.FileNotFoundException: log4j.properties (No such file or directory)
at java.io.FileInputStream.open(Native Method)
at java.io.FileInputStream.(FileInputStream.java:146)
at java.io.FileInputStream.(FileInputStream.java:101)
at 
sun.net.www.protocol.file.FileURLConnection.connect(FileURLConnection.java:90)
at 
sun.net.www.protocol.file.FileURLConnection.getInputStream(FileURLConnection.java:188)
at 
org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:557)
at 
org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
at org.apache.log4j.LogManager.(LogManager.java:127)
at org.apache.spark.Logging$class.initializeLogging(Logging.scala:122)
at 
org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:107)
at org.apache.spark.Logging$class.log(Logging.scala:51)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.log(CoarseGrainedExecutorBackend.scala:136)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:147)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:250)
at 
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
log4j:ERROR Ignoring configuration file [file:log4j.properties].
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
15/11/23 11:47:30 INFO CoarseGrainedExecutorBackend: Registered signal handlers 
for [TERM, HUP, INT]
15/11/23 11:47:30 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
15/11/23 11:47:30 INFO SecurityManager: Changing view acls to: root
15/11/23 11:47:30 INFO SecurityManager: Changing modify acls to: root
15/11/23 11:47:30 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
15/11/23 11:47:31 INFO Slf4jLogger: Slf4jLogger started
15/11/23 11:47:31 INFO Remoting: Starting remoting
15/11/23 11:47:31 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://driverPropsFetcher@10.1.1.102:33953]
15/11/23 11:47:31 INFO Utils: Successfully started service 'driverPropsFetcher' 
on port 33953.
15/11/23 11:47:31 INFO SecurityManager: Changing view acls to: root
15/11/23 11:47:31 INFO SecurityManager: Changing modify acls to: root
15/11/23 11:47:31 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users with view permissions: Set(root); users with 
modify permissions: Set(root)
15/11/23 11:47:31 INFO RemoteActorRefProvider$RemotingTerminator: Shutting down 
remote daemon.
15/11/23 11:47:32 INFO RemoteActorRefProvider$RemotingTerminator: Remote daemon 
shut down; proceeding with flushing remote transports.
15/11/23 11:47:32 INFO RemoteActorRefProvider$RemotingTerminator: Remoting shut 
down.
15/11/23 11:47:32 INFO Slf4jLogger: Slf4jLogger started
15/11/23 11:47:32 INFO Remoting: Starting remoting
15/11/23 11:47:32 INFO Remoting: Remoting started; listening on addresses 
:[akka.tcp://sparkExecutor@10.1.1.102:39111]
15/11/23 11:47:32 INFO Utils: Successfully started service 

[jira] [Commented] (SPARK-10872) Derby error (XSDB6) when creating new HiveContext after restarting SparkContext

2015-11-23 Thread Dmytro Bielievtsov (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15022061#comment-15022061
 ] 

Dmytro Bielievtsov commented on SPARK-10872:


Removing {{metastore_db/dbex.lck}} right before {{sc = 
SparkContext("local\[*]", "app2")}} precludes the error, but it's a dangerous 
workaround. Having something like {{HiveContext.stop()}} that releases the 
locks would be best.

> Derby error (XSDB6) when creating new HiveContext after restarting 
> SparkContext
> ---
>
> Key: SPARK-10872
> URL: https://issues.apache.org/jira/browse/SPARK-10872
> Project: Spark
>  Issue Type: Bug
>  Components: PySpark, SQL
>Affects Versions: 1.4.0, 1.4.1, 1.5.0
>Reporter: Dmytro Bielievtsov
>
> Starting from spark 1.4.0 (works well on 1.3.1), the following code fails 
> with "XSDB6: Another instance of Derby may have already booted the database 
> ~/metastore_db":
> {code:python}
> from pyspark import SparkContext, HiveContext
> sc = SparkContext("local[*]", "app1")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()
> sc.stop()
> sc = SparkContext("local[*]", "app2")
> sql = HiveContext(sc)
> sql.createDataFrame([[1]]).collect()  # Py4J error
> {code}
> This is related to [#SPARK-9539], and I intend to restart spark context 
> several times for isolated jobs to prevent cache cluttering and GC errors.
> Here's a larger part of the full error trace:
> {noformat}
> Failed to start database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
> org.datanucleus.exceptions.NucleusDataStoreException: Failed to start 
> database 'metastore_db' with class loader 
> org.apache.spark.sql.hive.client.IsolatedClientLoader$$anon$1@13015ec0, see 
> the next exception for details.
>   at 
> org.datanucleus.store.rdbms.ConnectionFactoryImpl$ManagedConnectionImpl.getConnection(ConnectionFactoryImpl.java:516)
>   at 
> org.datanucleus.store.rdbms.RDBMSStoreManager.(RDBMSStoreManager.java:298)
>   at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>   at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>   at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
>   at 
> org.datanucleus.plugin.NonManagedPluginRegistry.createExecutableExtension(NonManagedPluginRegistry.java:631)
>   at 
> org.datanucleus.plugin.PluginManager.createExecutableExtension(PluginManager.java:301)
>   at 
> org.datanucleus.NucleusContext.createStoreManagerForProperties(NucleusContext.java:1187)
>   at org.datanucleus.NucleusContext.initialise(NucleusContext.java:356)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.freezeConfiguration(JDOPersistenceManagerFactory.java:775)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.createPersistenceManagerFactory(JDOPersistenceManagerFactory.java:333)
>   at 
> org.datanucleus.api.jdo.JDOPersistenceManagerFactory.getPersistenceManagerFactory(JDOPersistenceManagerFactory.java:202)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:606)
>   at javax.jdo.JDOHelper$16.run(JDOHelper.java:1965)
>   at java.security.AccessController.doPrivileged(Native Method)
>   at javax.jdo.JDOHelper.invoke(JDOHelper.java:1960)
>   at 
> javax.jdo.JDOHelper.invokeGetPersistenceManagerFactoryOnImplementation(JDOHelper.java:1166)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:808)
>   at javax.jdo.JDOHelper.getPersistenceManagerFactory(JDOHelper.java:701)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPMF(ObjectStore.java:365)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.getPersistenceManager(ObjectStore.java:394)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:291)
>   at 
> org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:258)
>   at 
> org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
>   at 
> org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.(RawStoreProxy.java:57)
>   at 
> org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:66)
>   at 
> 

[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: 
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
lapack library return error value when Cholesky decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}

  was:
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
Cholesky Decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}
It's caused by the underneath lapack library return error value. 


> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> lapack library return error value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: 
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed (But "l-bfgs" can train and get the model). 
The failure is caused by the underneath lapack library return error value when 
Cholesky decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}

  was:
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed (But the "l-bfgs" can train and get the 
model). The failure is caused by the underneath lapack library return error 
value when Cholesky decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}


> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11919) graphx should be supported with java

2015-11-23 Thread benedict jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benedict jin updated SPARK-11919:
-
Issue Type: New Feature  (was: Bug)

> graphx should be supported with java
> 
>
> Key: SPARK-11919
> URL: https://issues.apache.org/jira/browse/SPARK-11919
> Project: Spark
>  Issue Type: New Feature
>  Components: Examples, GraphX, Java API
>Reporter: benedict jin
>
> Please make the graphx component to be supported with java, hope appear demo 
> and java api for graphx as soon as possible :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021756#comment-15021756
 ] 

Yanbo Liang commented on SPARK-11918:
-

Until now, I suspect this is not a bug of MLlib but may be very ill condition 
problem is not suitable to be solved by "normal" equation method. If this 
assumption is right, I think we should document this issue. Looking forward 
your comments [~mengxr].

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11920:

Description: 
ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
examples and user guide doc, but it's actually classification dataset rather 
than regression dataset. We should use 
data/mllib/sample_linear_regression_data.txt instead.
The deeper causes is that LinearRegression with "normal" solver can not solve 
this dataset correctly, may be due to the ill condition and unreasonable label. 
This issue has been reported at SPARK-11918.
So we should make this change in examples and user guides, that can clearly 
illustrate the usage of LinearRegression algorithm.

  was:
ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
examples and user guide doc, but it's actually classification dataset rather 
than regression dataset. We should use 
data/mllib/sample_linear_regression_data.txt instead.
The deeper level reason is that LinearRegression with "normal" solver can not 
solve this dataset correctly, may be due to the ill condition and unreasonable 
label. This issue has been reported at SPARK-11918.
So we should make this change in examples and user guides, that can clearly 
illustrate the usage of LinearRegression algorithm.


> ML LinearRegression should use correct dataset in examples and user guide doc
> -
>
> Key: SPARK-11920
> URL: https://issues.apache.org/jira/browse/SPARK-11920
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
> examples and user guide doc, but it's actually classification dataset rather 
> than regression dataset. We should use 
> data/mllib/sample_linear_regression_data.txt instead.
> The deeper causes is that LinearRegression with "normal" solver can not solve 
> this dataset correctly, may be due to the ill condition and unreasonable 
> label. This issue has been reported at SPARK-11918.
> So we should make this change in examples and user guides, that can clearly 
> illustrate the usage of LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11920:

Description: 
ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
examples and user guide doc, but it's actually classification dataset rather 
than regression dataset. We should use 
data/mllib/sample_linear_regression_data.txt instead.
The deeper level reason is that LinearRegression with "normal" solver can not 
solve this dataset correctly, may be due to the ill condition and unreasonable 
label. This issue has been reported at SPARK-11918.
So we should make this change in examples and user guides, that can clearly 
illustrate the usage of LinearRegression algorithm.

  was:
ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
examples and user guide doc, but it's actually classification dataset rather 
than regression dataset. We should use 
data/mllib/sample_linear_regression_data.txt instead.
Another reason is that LinearRegression with "normal" solver can not solve this 
dataset correctly, may be due to the ill condition and unreasonable label. This 
issue has been reported at SPARK-11918.
So we should make this change in examples and user guides, that can clearly 
illustrate the usage of LinearRegression algorithm.


> ML LinearRegression should use correct dataset in examples and user guide doc
> -
>
> Key: SPARK-11920
> URL: https://issues.apache.org/jira/browse/SPARK-11920
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
> examples and user guide doc, but it's actually classification dataset rather 
> than regression dataset. We should use 
> data/mllib/sample_linear_regression_data.txt instead.
> The deeper level reason is that LinearRegression with "normal" solver can not 
> solve this dataset correctly, may be due to the ill condition and 
> unreasonable label. This issue has been reported at SPARK-11918.
> So we should make this change in examples and user guides, that can clearly 
> illustrate the usage of LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11920:


Assignee: (was: Apache Spark)

> ML LinearRegression should use correct dataset in examples and user guide doc
> -
>
> Key: SPARK-11920
> URL: https://issues.apache.org/jira/browse/SPARK-11920
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
> examples and user guide doc, but it's actually classification dataset rather 
> than regression dataset. We should use 
> data/mllib/sample_linear_regression_data.txt instead.
> The deeper causes is that LinearRegression with "normal" solver can not solve 
> this dataset correctly, may be due to the ill condition and unreasonable 
> label. This issue has been reported at SPARK-11918.
> So we should make this change in examples and user guides, that can clearly 
> illustrate the usage of LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Assigned] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11920:


Assignee: Apache Spark

> ML LinearRegression should use correct dataset in examples and user guide doc
> -
>
> Key: SPARK-11920
> URL: https://issues.apache.org/jira/browse/SPARK-11920
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Assignee: Apache Spark
>Priority: Minor
>
> ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
> examples and user guide doc, but it's actually classification dataset rather 
> than regression dataset. We should use 
> data/mllib/sample_linear_regression_data.txt instead.
> The deeper causes is that LinearRegression with "normal" solver can not solve 
> this dataset correctly, may be due to the ill condition and unreasonable 
> label. This issue has been reported at SPARK-11918.
> So we should make this change in examples and user guides, that can clearly 
> illustrate the usage of LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11919) graphx should be supported with java

2015-11-23 Thread benedict jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021807#comment-15021807
 ] 

benedict jin commented on SPARK-11919:
--

Thanks a lot, my bad. Please help me let this jira to be closed.

> graphx should be supported with java
> 
>
> Key: SPARK-11919
> URL: https://issues.apache.org/jira/browse/SPARK-11919
> Project: Spark
>  Issue Type: New Feature
>  Components: Examples, GraphX, Java API
>Reporter: benedict jin
>
> Please make the graphx component to be supported with java, hope appear demo 
> and java api for graphx as soon as possible. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Apache Spark (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021806#comment-15021806
 ] 

Apache Spark commented on SPARK-11920:
--

User 'yanboliang' has created a pull request for this issue:
https://github.com/apache/spark/pull/9905

> ML LinearRegression should use correct dataset in examples and user guide doc
> -
>
> Key: SPARK-11920
> URL: https://issues.apache.org/jira/browse/SPARK-11920
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
> examples and user guide doc, but it's actually classification dataset rather 
> than regression dataset. We should use 
> data/mllib/sample_linear_regression_data.txt instead.
> The deeper causes is that LinearRegression with "normal" solver can not solve 
> this dataset correctly, may be due to the ill condition and unreasonable 
> label. This issue has been reported at SPARK-11918.
> So we should make this change in examples and user guides, that can clearly 
> illustrate the usage of LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11920) ML LinearRegression should use correct dataset in examples and user guide doc

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11920?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11920:

Description: 
ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
examples and user guide doc, but it's actually classification dataset rather 
than regression dataset. We should use 
data/mllib/sample_linear_regression_data.txt instead.
The deeper causes is that LinearRegression with "normal" solver can not solve 
this dataset correctly, may be due to the ill condition and unreasonable label. 
This issue has been reported at SPARK-11918.
It will confuse users if they run the example code but get exception, so we 
should make this change which can clearly illustrate the usage of 
LinearRegression algorithm.

  was:
ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
examples and user guide doc, but it's actually classification dataset rather 
than regression dataset. We should use 
data/mllib/sample_linear_regression_data.txt instead.
The deeper causes is that LinearRegression with "normal" solver can not solve 
this dataset correctly, may be due to the ill condition and unreasonable label. 
This issue has been reported at SPARK-11918.
So we should make this change in examples and user guides, that can clearly 
illustrate the usage of LinearRegression algorithm.


> ML LinearRegression should use correct dataset in examples and user guide doc
> -
>
> Key: SPARK-11920
> URL: https://issues.apache.org/jira/browse/SPARK-11920
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML
>Reporter: Yanbo Liang
>Priority: Minor
>
> ML LinearRegression use data/mllib/sample_libsvm_data.txt as dataset in 
> examples and user guide doc, but it's actually classification dataset rather 
> than regression dataset. We should use 
> data/mllib/sample_linear_regression_data.txt instead.
> The deeper causes is that LinearRegression with "normal" solver can not solve 
> this dataset correctly, may be due to the ill condition and unreasonable 
> label. This issue has been reported at SPARK-11918.
> It will confuse users if they run the example code but get exception, so we 
> should make this change which can clearly illustrate the usage of 
> LinearRegression algorithm.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11919) graphx should be supported with java

2015-11-23 Thread benedict jin (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021807#comment-15021807
 ] 

benedict jin edited comment on SPARK-11919 at 11/23/15 9:16 AM:


Thanks a lot, my bad. This jira will be closed, right now.


was (Author: benedict jin):
Thanks a lot, my bad. Please help me let this jira to be closed.

> graphx should be supported with java
> 
>
> Key: SPARK-11919
> URL: https://issues.apache.org/jira/browse/SPARK-11919
> Project: Spark
>  Issue Type: New Feature
>  Components: Examples, GraphX, Java API
>Reporter: benedict jin
>
> Please make the graphx component to be supported with java, hope appear demo 
> and java api for graphx as soon as possible. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11919) graphx should be supported with java

2015-11-23 Thread benedict jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benedict jin closed SPARK-11919.


Closing...

> graphx should be supported with java
> 
>
> Key: SPARK-11919
> URL: https://issues.apache.org/jira/browse/SPARK-11919
> Project: Spark
>  Issue Type: New Feature
>  Components: Examples, GraphX, Java API
>Reporter: benedict jin
>
> Please make the graphx component to be supported with java, hope appear demo 
> and java api for graphx as soon as possible. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021830#comment-15021830
 ] 

Yanbo Liang commented on SPARK-11918:
-

[~sowen] Thanks for your comments. I think you have got part of my proposal at 
https://github.com/apache/spark/pull/9905. I also wonder that whether we can 
give better hint for users if they are in the same condition.

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11921) fix `nullable` of encoder schema

2015-11-23 Thread Wenchen Fan (JIRA)
Wenchen Fan created SPARK-11921:
---

 Summary: fix `nullable` of encoder schema
 Key: SPARK-11921
 URL: https://issues.apache.org/jira/browse/SPARK-11921
 Project: Spark
  Issue Type: Sub-task
  Components: SQL
Reporter: Wenchen Fan






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11326) Support for authentication and encryption in standalone mode

2015-11-23 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021848#comment-15021848
 ] 

Jacek Lewandowski edited comment on SPARK-11326 at 11/23/15 9:41 AM:
-

[~pwendell] - are you (DB) interested in reviewing this patch at all?


was (Author: jlewandowski):
[~pwendell] - are you interested in reviewing this patch at all?

> Support for authentication and encryption in standalone mode
> 
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication and scheduler 
> communication.
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing SASL authentication/encryption for connections between a client 
> (Client or AppClient) and Spark Master, it becomes possible introducing 
> pluggable authentication for standalone deployment mode.
> h3.Improvements introduced by this patch
> This patch introduces the following changes:
> * Spark driver or submission client do not have to use the same secret as 
> workers use to communicate with Master
> * Master is able to authenticate individual clients with the following rules:
> ** When connecting to the master, the client needs to specify 
> {{spark.authenticate.secret}} which is an authentication token for the user 
> specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default)
> ** Master configuration may include additional 
> {{spark.authenticate.secrets.}} entries for specifying 
> authentication token for particular users or 
> {{spark.authenticate.authenticatorClass}} which specify an implementation of 
> external credentials provider (which is able to retrieve the authentication 
> token for a given user).
> ** Workers authenticate with Master as default user {{sparkSaslUser}}. 
> * The authorization rules are as follows:
> ** A regular user is able to manage only his own application (the application 
> which he submitted)
> ** A regular user is not able to register or manager workers
> ** Spark default user {{sparkSaslUser}} can manage all the applications
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass a secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
> will look for it in the worker configuration and it will find it there (its 
> presence is implied). 
> 
> h5.Client mode, multiple secrets
> h6.Configuration
> - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
> {{spark.app.authenticate.secret=secret}}
> - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
> {{spark.submission.authenticate.secret=scheduler_secret}}
> h6.Description
> - The driver is running locally
> - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
> {{spark.submission.authenticate.secret}} to connect to the master
> - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor 
> conf: {{spark.submission.authenticate.secret}}
> - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
> {{spark.app.authenticate.secret}} for communication with the executors
> - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
> that the executors can use it to communicate with the driver
> - 

[jira] [Commented] (SPARK-11326) Support for authentication and encryption in standalone mode

2015-11-23 Thread Jacek Lewandowski (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11326?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021848#comment-15021848
 ] 

Jacek Lewandowski commented on SPARK-11326:
---

[~pwendell] - are you interested in reviewing this patch at all?

> Support for authentication and encryption in standalone mode
> 
>
> Key: SPARK-11326
> URL: https://issues.apache.org/jira/browse/SPARK-11326
> Project: Spark
>  Issue Type: Improvement
>  Components: Spark Core
>Reporter: Jacek Lewandowski
>
> h3.The idea
> Currently, in standalone mode, all components, for all network connections 
> need to use the same secure token if they want to have any security ensured. 
> This ticket is intended to split the communication in standalone mode to make 
> it more like in Yarn mode - application internal communication and scheduler 
> communication.
> Such refactoring will allow for the scheduler (master, workers) to use a 
> distinct secret, which will remain unknown for the users. Similarly, it will 
> allow for better security in applications, because each application will be 
> able to use a distinct secret as well. 
> By providing SASL authentication/encryption for connections between a client 
> (Client or AppClient) and Spark Master, it becomes possible introducing 
> pluggable authentication for standalone deployment mode.
> h3.Improvements introduced by this patch
> This patch introduces the following changes:
> * Spark driver or submission client do not have to use the same secret as 
> workers use to communicate with Master
> * Master is able to authenticate individual clients with the following rules:
> ** When connecting to the master, the client needs to specify 
> {{spark.authenticate.secret}} which is an authentication token for the user 
> specified by {{spark.authenticate.user}} ({{sparkSaslUser}} by default)
> ** Master configuration may include additional 
> {{spark.authenticate.secrets.}} entries for specifying 
> authentication token for particular users or 
> {{spark.authenticate.authenticatorClass}} which specify an implementation of 
> external credentials provider (which is able to retrieve the authentication 
> token for a given user).
> ** Workers authenticate with Master as default user {{sparkSaslUser}}. 
> * The authorization rules are as follows:
> ** A regular user is able to manage only his own application (the application 
> which he submitted)
> ** A regular user is not able to register or manager workers
> ** Spark default user {{sparkSaslUser}} can manage all the applications
> h3.User facing changes when running application
> h4.General principles:
> - conf: {{spark.authenticate.secret}} is *never sent* over the wire
> - env: {{SPARK_AUTH_SECRET}} is *never sent* over the wire
> - In all situations env variable will overwrite conf variable if present. 
> - In all situations when a user has to pass a secret, it is better (safer) to 
> do this through env variable
> - In work modes with multiple secrets we assume encrypted communication 
> between client and master, between driver and master, between master and 
> workers
> 
> h4.Work modes and descriptions
> h5.Client mode, single secret
> h6.Configuration
> - env: {{SPARK_AUTH_SECRET=secret}} or conf: 
> {{spark.authenticate.secret=secret}}
> h6.Description
> - The driver is running locally
> - The driver will neither send env: {{SPARK_AUTH_SECRET}} nor conf: 
> {{spark.authenticate.secret}}
> - The driver will use either env: {{SPARK_AUTH_SECRET}} or conf: 
> {{spark.authenticate.secret}} for connection to the master
> - _ExecutorRunner_ will not find any secret in _ApplicationDescription_ so it 
> will look for it in the worker configuration and it will find it there (its 
> presence is implied). 
> 
> h5.Client mode, multiple secrets
> h6.Configuration
> - env: {{SPARK_APP_AUTH_SECRET=app_secret}} or conf: 
> {{spark.app.authenticate.secret=secret}}
> - env: {{SPARK_SUBMISSION_AUTH_SECRET=scheduler_secret}} or conf: 
> {{spark.submission.authenticate.secret=scheduler_secret}}
> h6.Description
> - The driver is running locally
> - The driver will use either env: {{SPARK_SUBMISSION_AUTH_SECRET}} or conf: 
> {{spark.submission.authenticate.secret}} to connect to the master
> - The driver will neither send env: {{SPARK_SUBMISSION_AUTH_SECRET}} nor 
> conf: {{spark.submission.authenticate.secret}}
> - The driver will use either {{SPARK_APP_AUTH_SECRET}} or conf: 
> {{spark.app.authenticate.secret}} for communication with the executors
> - The driver will send {{spark.executorEnv.SPARK_AUTH_SECRET=app_secret}} so 
> that the executors can use it to communicate with the driver
> - _ExecutorRunner_ will find that secret in _ApplicationDescription_ and it 
> will set it in env: {{SPARK_AUTH_SECRET}} which will be read by 
> 

[jira] [Assigned] (SPARK-11520) RegressionMetrics should support instance weights

2015-11-23 Thread Apache Spark (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11520?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Apache Spark reassigned SPARK-11520:


Assignee: (was: Apache Spark)

> RegressionMetrics should support instance weights
> -
>
> Key: SPARK-11520
> URL: https://issues.apache.org/jira/browse/SPARK-11520
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: Joseph K. Bradley
>
> This will be important to improve LinearRegressionSummary, which currently 
> has a mix of weighted and unweighted metrics.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11604) ML 1.6 QA: API: Python API coverage

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11604:

Description: 
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

List the found issues:
* Inconsistency:

* Docs:
** ml.classification SPARK-11875

* Missing classes/methods/parameters
** ml.classification SPARK-11815 SPARK-11820
** ml.feature SPARK-11922

  was:
For new public APIs added to MLlib, we need to check the generated HTML doc and 
compare the Scala & Python versions.  We need to track:
* Inconsistency: Do class/method/parameter names match?
* Docs: Is the Python doc missing or just a stub?  We want the Python doc to be 
as complete as the Scala doc.
* API breaking changes: These should be very rare but are occasionally either 
necessary (intentional) or accidental.  These must be recorded and added in the 
Migration Guide for this release.
** Note: If the API change is for an Alpha/Experimental/DeveloperApi component, 
please note that as well.
* Missing classes/methods/parameters: We should create to-do JIRAs for 
functionality missing from Python, to be added in the next release cycle.  
Please use a *separate* JIRA (linked below) for this list of to-do items.

List the found issues:
* Inconsistency:
** ml.classification SPARK-11815 SPARK-11820

* Docs:
** ml.classification SPARK-11875


> ML 1.6 QA: API: Python API coverage
> ---
>
> Key: SPARK-11604
> URL: https://issues.apache.org/jira/browse/SPARK-11604
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib, PySpark
>Reporter: Joseph K. Bradley
>Assignee: Yanbo Liang
>
> For new public APIs added to MLlib, we need to check the generated HTML doc 
> and compare the Scala & Python versions.  We need to track:
> * Inconsistency: Do class/method/parameter names match?
> * Docs: Is the Python doc missing or just a stub?  We want the Python doc to 
> be as complete as the Scala doc.
> * API breaking changes: These should be very rare but are occasionally either 
> necessary (intentional) or accidental.  These must be recorded and added in 
> the Migration Guide for this release.
> ** Note: If the API change is for an Alpha/Experimental/DeveloperApi 
> component, please note that as well.
> * Missing classes/methods/parameters: We should create to-do JIRAs for 
> functionality missing from Python, to be added in the next release cycle.  
> Please use a *separate* JIRA (linked below) for this list of to-do items.
> List the found issues:
> * Inconsistency:
> * Docs:
> ** ml.classification SPARK-11875
> * Missing classes/methods/parameters
> ** ml.classification SPARK-11815 SPARK-11820
> ** ml.feature SPARK-11922



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021720#comment-15021720
 ] 

Yanbo Liang commented on SPARK-11918:
-

I use the same 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt)
 to train LinearRegressionModel with R:::glm, it did not throw exception but 
the result is not confidence. The coefficients of the model contains too many 
NA and NaN which is not reasonable. Please see the attached file to find the 
R:::glm output.

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> Cholesky Decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Attachment: R_GLM_output

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> Cholesky Decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021729#comment-15021729
 ] 

Yanbo Liang edited comment on SPARK-11918 at 11/23/15 8:31 AM:
---

Further more, I use the breeze library to train the model by local normal 
equation method.
{code}
import sqlCtx.implicits._
import org.apache.spark.mllib.linalg.Vector
import breeze.linalg.DenseMatrix
import breeze.linalg._

val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, 
"/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF()


val features = df.select(col("features")).map { r =>
  r.getAs[Vector](0)
}.collect().flatMap { v => v.toArray }
val labelArray = df.select(col("label")).map { r =>
  r.getDouble(0)
}.collect()

val Xt = new DenseMatrix[Double](692, 100, features)
val X = Xt.t

val y = new DenseMatrix[Double](100, 1, labelArray)

val XtXi = inv(Xt * X)
val XtY = Xt * y

val coefs = XtXi * XtY

println(coefs.toString)
{code}
It also throw exception like:
{code}
breeze.linalg.MatrixSingularException: 
at breeze.linalg.inv$$anon$1.apply(inv.scala:36)
at breeze.linalg.inv$$anon$1.apply(inv.scala:19)
at breeze.generic.UFunc$class.apply(UFunc.scala:48)
at breeze.linalg.inv$.apply(inv.scala:17)
{code}
The breeze.linalg.inv is also call netlib LAPACK package which is the same 
library as Spark. Tracking the breeze code, we can get this exception is thrown 
at here 
(https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33)
 which is also caused by the underneath lapack error. 


was (Author: yanboliang):
Further more, I use the breeze library to train the model by local normal 
equation method.
{code}
import sqlCtx.implicits._
import org.apache.spark.mllib.linalg.Vector
import breeze.linalg.DenseMatrix
import breeze.linalg._

val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, 
"/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF()


val features = df.select(col("features")).map { r =>
  r.getAs[Vector](0)
}.collect().flatMap { v => v.toArray }
val labelArray = df.select(col("label")).map { r =>
  r.getDouble(0)
}.collect()

val Xt = new DenseMatrix[Double](692, 100, features)
val X = Xt.t

val y = new DenseMatrix[Double](100, 1, labelArray)

val XtXi = inv(Xt * X)
val XtY = Xt * y

val coefs = XtXi * XtY

println(coefs.toString)
{code}
It also throw exception

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> Cholesky Decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: 
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
Cholesky Decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}
It's caused by the underneath lapack library return error value. 

  was:
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
Cholesky Decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}


> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed. The failure is caused by the underneath 
> Cholesky Decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}
> It's caused by the underneath lapack library return error value. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11919) graphx should be supported with java

2015-11-23 Thread benedict jin (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11919?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

benedict jin updated SPARK-11919:
-
Description: Please make the graphx component to be supported with java, 
hope appear demo and java api for graphx as soon as possible. :-)  (was: Please 
make the graphx component to be supported with java, hope appear demo and java 
api for graphx as soon as possible :-))

> graphx should be supported with java
> 
>
> Key: SPARK-11919
> URL: https://issues.apache.org/jira/browse/SPARK-11919
> Project: Spark
>  Issue Type: New Feature
>  Components: Examples, GraphX, Java API
>Reporter: benedict jin
>
> Please make the graphx component to be supported with java, hope appear demo 
> and java api for graphx as soon as possible. :-)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021729#comment-15021729
 ] 

Yanbo Liang edited comment on SPARK-11918 at 11/23/15 8:44 AM:
---

Further more, I use the breeze library to train the model by local normal 
equation method.
{code}
import sqlCtx.implicits._
import org.apache.spark.mllib.linalg.Vector
import breeze.linalg.DenseMatrix
import breeze.linalg._

val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, 
"/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF()


val features = df.select(col("features")).map { r =>
  r.getAs[Vector](0)
}.collect().flatMap { v => v.toArray }
val labelArray = df.select(col("label")).map { r =>
  r.getDouble(0)
}.collect()

val Xt = new DenseMatrix[Double](692, 100, features)
val X = Xt.t

val y = new DenseMatrix[Double](100, 1, labelArray)

val XtXi = inv(Xt * X)
val XtY = Xt * y

val coefs = XtXi * XtY

println(coefs.toString)
{code}
It also throw exception like:
{code}
breeze.linalg.MatrixSingularException: 
at breeze.linalg.inv$$anon$1.apply(inv.scala:36)
at breeze.linalg.inv$$anon$1.apply(inv.scala:19)
at breeze.generic.UFunc$class.apply(UFunc.scala:48)
at breeze.linalg.inv$.apply(inv.scala:17)
{code}
breeze.linalg.inv is also call netlib lapack library which is the same as 
Spark. Tracking the breeze code, we can get this exception is thrown at here 
(https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33)
 also caused by the underneath lapack error. 


was (Author: yanboliang):
Further more, I use the breeze library to train the model by local normal 
equation method.
{code}
import sqlCtx.implicits._
import org.apache.spark.mllib.linalg.Vector
import breeze.linalg.DenseMatrix
import breeze.linalg._

val df = MLUtils.loadLibSVMFile(sqlCtx.sparkContext, 
"/Users/yanboliang/data/trunk/spark/data/mllib/sample_libsvm_data.txt").toDF()


val features = df.select(col("features")).map { r =>
  r.getAs[Vector](0)
}.collect().flatMap { v => v.toArray }
val labelArray = df.select(col("label")).map { r =>
  r.getDouble(0)
}.collect()

val Xt = new DenseMatrix[Double](692, 100, features)
val X = Xt.t

val y = new DenseMatrix[Double](100, 1, labelArray)

val XtXi = inv(Xt * X)
val XtY = Xt * y

val coefs = XtXi * XtY

println(coefs.toString)
{code}
It also throw exception like:
{code}
breeze.linalg.MatrixSingularException: 
at breeze.linalg.inv$$anon$1.apply(inv.scala:36)
at breeze.linalg.inv$$anon$1.apply(inv.scala:19)
at breeze.generic.UFunc$class.apply(UFunc.scala:48)
at breeze.linalg.inv$.apply(inv.scala:17)
{code}
The breeze.linalg.inv is also call netlib LAPACK package which is the same 
library as Spark. Tracking the breeze code, we can get this exception is thrown 
at here 
(https://github.com/scalanlp/breeze/blob/master/math/src/main/scala/breeze/linalg/functions/inv.scala#L33)
 which is also caused by the underneath lapack error. 

> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (SPARK-11922) Python API for ml.feature.QuantileDiscretizer

2015-11-23 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11922:
---

 Summary: Python API for ml.feature.QuantileDiscretizer
 Key: SPARK-11922
 URL: https://issues.apache.org/jira/browse/SPARK-11922
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor


Add Python API for ml.feature.QuantileDiscretizer.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-11923) Python API for ml.feature.ChiSqSelector

2015-11-23 Thread Yanbo Liang (JIRA)
Yanbo Liang created SPARK-11923:
---

 Summary: Python API for ml.feature.ChiSqSelector
 Key: SPARK-11923
 URL: https://issues.apache.org/jira/browse/SPARK-11923
 Project: Spark
  Issue Type: Improvement
  Components: ML, PySpark
Reporter: Yanbo Liang
Priority: Minor


Add Python API for ml.feature.ChiSqSelector.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11916) Expression TRIM/LTRIM/RTRIM to support specific trim word

2015-11-23 Thread Sean Owen (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11916?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sean Owen updated SPARK-11916:
--
Component/s: SQL

> Expression TRIM/LTRIM/RTRIM to support specific trim word
> -
>
> Key: SPARK-11916
> URL: https://issues.apache.org/jira/browse/SPARK-11916
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Adrian Wang
>Priority: Minor
>
> supports expressions like `trim('xxxabcxxx', 'x')`



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-11918) WLS can not resolve some kinds of equation

2015-11-23 Thread Yanbo Liang (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-11918:

Description: 
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed (But the "l-bfgs" can train and get the 
model). The failure is caused by the underneath lapack library return error 
value when Cholesky decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}

  was:
Weighted Least Squares (WLS) is one of the optimization method for solve Linear 
Regression (when #feature < 4096). But if the dataset is very ill condition 
(such as 0-1 based label used for classification and the equation is 
underdetermined), the WLS failed. The failure is caused by the underneath 
lapack library return error value when Cholesky decomposition.
This issue is easy to reproduce, you can train a LinearRegressionModel by 
"normal" solver with the example 
dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
 The following is the exception:
{code}
assertion failed: lapack.dpotrs returned 1.
java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
at scala.Predef$.assert(Predef.scala:179)
at 
org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
at 
org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
at 
org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
{code}


> WLS can not resolve some kinds of equation
> --
>
> Key: SPARK-11918
> URL: https://issues.apache.org/jira/browse/SPARK-11918
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Reporter: Yanbo Liang
> Attachments: R_GLM_output
>
>
> Weighted Least Squares (WLS) is one of the optimization method for solve 
> Linear Regression (when #feature < 4096). But if the dataset is very ill 
> condition (such as 0-1 based label used for classification and the equation 
> is underdetermined), the WLS failed (But the "l-bfgs" can train and get the 
> model). The failure is caused by the underneath lapack library return error 
> value when Cholesky decomposition.
> This issue is easy to reproduce, you can train a LinearRegressionModel by 
> "normal" solver with the example 
> dataset(https://github.com/apache/spark/blob/master/data/mllib/sample_libsvm_data.txt).
>  The following is the exception:
> {code}
> assertion failed: lapack.dpotrs returned 1.
> java.lang.AssertionError: assertion failed: lapack.dpotrs returned 1.
>   at scala.Predef$.assert(Predef.scala:179)
>   at 
> org.apache.spark.mllib.linalg.CholeskyDecomposition$.solve(CholeskyDecomposition.scala:42)
>   at 
> org.apache.spark.ml.optim.WeightedLeastSquares.fit(WeightedLeastSquares.scala:117)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:180)
>   at 
> org.apache.spark.ml.regression.LinearRegression.train(LinearRegression.scala:67)
>   at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11782) Master Web UI should link to correct Application UI in cluster mode

2015-11-23 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11782?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15021793#comment-15021793
 ] 

Sean Owen commented on SPARK-11782:
---

Oh right I read right past that. I think that's the difference with what 
[~jbonofre] sees.

> Master Web UI should link to correct Application UI in cluster mode
> ---
>
> Key: SPARK-11782
> URL: https://issues.apache.org/jira/browse/SPARK-11782
> Project: Spark
>  Issue Type: Bug
>  Components: Web UI
>Affects Versions: 1.4.1
>Reporter: Matthias Niehoff
>
> - Running a standalone cluster, with node1 as master
> - Submit an application to cluster with deploy-mode=cluster
> - Application driver is on node other than node1 (i.e. node3)
> => master WebUI links to node1:4040 for Application Detail UI and not to 
> node3:4040
> As the master knows on which worker the driver is running, it should be 
> possible to show the correct link to the Application Detail UI



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



  1   2   3   >