[jira] [Created] (SPARK-25124) VectorSizeHint.size is buggy, breaking streaming pipeline

2018-08-15 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-25124:
--

 Summary: VectorSizeHint.size is buggy, breaking streaming pipeline
 Key: SPARK-25124
 URL: https://issues.apache.org/jira/browse/SPARK-25124
 Project: Spark
  Issue Type: Bug
  Components: ML
Affects Versions: 2.3.1
Reporter: Timothy Hunter


Currently, when using {{VectorSizeHint().setSize(3)}} in an ML pipeline, 
transforming a stream will return a nondescript exception about the stream not 
started. At core are the following bugs that {{setSize}} and {{getSize}} do not 
{{return}} values but {{None}}:

https://github.com/apache/spark/blob/master/python/pyspark/ml/feature.py#L3846

How to reproduce, using the example in the doc:

{code}
from pyspark.ml.linalg import Vectors
from pyspark.ml import Pipeline, PipelineModel
from pyspark.ml.feature import VectorAssembler, VectorSizeHint
data = [(Vectors.dense([1., 2., 3.]), 4.)]
df = spark.createDataFrame(data, ["vector", "float"])
sizeHint = VectorSizeHint(inputCol="vector", handleInvalid="skip").setSize(3) # 
Will fail
vecAssembler = VectorAssembler(inputCols=["vector", "float"], 
outputCol="assembled")
pipeline = Pipeline(stages=[sizeHint, vecAssembler])
pipelineModel = pipeline.fit(df)
pipelineModel.transform(df).head().assembled
{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams

2018-04-23 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-23996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16447781#comment-16447781
 ] 

Timothy Hunter commented on SPARK-23996:


[~wm624] yes this is the implementation:

[https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala]

you can see the test suite here:

[https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala]

The current implementation focused on doubles, but I do not see much issue in 
switch to floats. The main entry points are fairly similar:

[https://github.com/DataSketches/sketches-core/blob/master/src/main/java/com/yahoo/sketches/kll/KllFloatsSketch.java#L299]

 

> Implement the optimal KLL algorithms for quantiles in streams
> -
>
> Key: SPARK-23996
> URL: https://issues.apache.org/jira/browse/SPARK-23996
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, SQL
>Affects Versions: 2.3.0
>Reporter: Timothy Hunter
>Priority: Major
>
> The current implementation for approximate quantiles - a variant of 
> Grunwald-Khanna, which I implemented - is not the best in light of recent 
> papers:
>  - it is not exactly the one from the paper for performance reasons, but the 
> changes are not documented beyond comments on the code
>  - there are now more optimal algorithms with proven bounds (unlike q-digest, 
> the other contender at the time)
> I propose that we revisit the current implementation and look at the 
> Karnin-Lang-Liberty algorithm (KLL) for example:
> [https://arxiv.org/abs/1603.05346]
> [https://edoliberty.github.io//papers/streamingQuantiles.pdf]
> This algorithm seems to have favorable characteristics for streaming and a 
> distributed implementation, and there is a python implementation for 
> reference.
> It is a fairly standalone piece, and in that respect available to people who 
> don't know too much about spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-23996) Implement the optimal KLL algorithms for quantiles in streams

2018-04-16 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-23996:
--

 Summary: Implement the optimal KLL algorithms for quantiles in 
streams
 Key: SPARK-23996
 URL: https://issues.apache.org/jira/browse/SPARK-23996
 Project: Spark
  Issue Type: Improvement
  Components: MLlib, SQL
Affects Versions: 2.3.0
Reporter: Timothy Hunter


The current implementation for approximate quantiles - a variant of 
Grunwald-Khanna, which I implemented - is not the best in light of recent 
papers:

 - it is not exactly the one from the paper for performance reasons, but the 
changes are not documented beyond comments on the code

 - there are now more optimal algorithms with proven bounds (unlike q-digest, 
the other contender at the time)

I propose that we revisit the current implementation and look at the 
Karnin-Lang-Liberty algorithm (KLL) for example:
[https://arxiv.org/abs/1603.05346]

[https://edoliberty.github.io//papers/streamingQuantiles.pdf]

This algorithm seems to have favorable characteristics for streaming and a 
distributed implementation, and there is a python implementation for reference.

It is a fairly standalone piece, and in that respect available to people who 
don't know too much about spark internals.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-11-30 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16273575#comment-16273575
 ] 

Timothy Hunter commented on SPARK-21866:


[~josephkb] I have created a separate ticket to continue progress on the reader 
interface in SPARK-22666.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Assignee: Ilya Matiach
>  Labels: SPIP
> Fix For: 2.3.0
>
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 

[jira] [Created] (SPARK-22666) Spark reader source for image format

2017-11-30 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-22666:
--

 Summary: Spark reader source for image format
 Key: SPARK-22666
 URL: https://issues.apache.org/jira/browse/SPARK-22666
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.3.0
Reporter: Timothy Hunter


The current API for the new image format is implemented as a standalone 
feature, in order to make it reside within the mllib package. As discussed in 
SPARK-21866, users should be able to load images through the more common spark 
source reader interface.

This ticket is concerned with adding image reading support in the spark source 
API, through either of the following interfaces:
 - {{spark.read.format("image")...}}
 - {{spark.read.image}}
The output is a dataframe that contains images (and the file names for 
example), following the semantics discussed already in SPARK-21866.

A few technical notes:
* since the functionality is implemented in {{mllib}}, calling this function 
may fail at runtime if users have not imported the {{spark-mllib}} dependency
* How to deal with very flat directories? It is common to have millions of 
files in a single "directory" (like in S3), which seems to have caused some 
issues to some users. If this issue is too complex to handle in this ticket, it 
can be dealt with separately.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-11-11 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16248628#comment-16248628
 ] 

Timothy Hunter commented on SPARK-21866:


[~josephkb] if I am not mistaken, the image code is implemented in the 
{{mllib}} package, which depends on {{sql}}. Meanwhile, the data source API is 
implemented in {{sql}}, and if we want it to have some image-specific source, 
like we do for csv or json, we will need to depend on {{mllib}}. This 
dependency should not happen, first because it introduces a circular dependency 
(causing compile time issues), and second because sql (one of the core modules) 
should not depend on {{mllib}}, which is large and not related to SQL.

[~rxin] suggested that we add a runtime dependency using reflection instead, 
and I am keen on making that change a second pull request. What are your 
thoughts?

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * 

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-11-03 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16237731#comment-16237731
 ] 

Timothy Hunter commented on SPARK-21866:


Adding {{spark.read.image}} is going to create a (soft) dependency between the 
core and mllib, which hosts the implementation of the current reader methods. 
This is fine and can dealt with using reflection, but since this would involve 
adding a core API to Spark, I suggest we do it as a follow-up task.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>Priority: Major
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified 

[jira] [Commented] (SPARK-8515) Improve ML attribute API

2017-10-16 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8515?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16206111#comment-16206111
 ] 

Timothy Hunter commented on SPARK-8515:
---

Before we commit to an implementation, we should think about the goal of adding 
metadata in ML, because it comes with its own costs. For instance, there has 
been a number of bug reports around them. See for example SPARK-2008, 
SPARK-14862.

I see a couple of use cases for metadata:
 - feature indexing -> that case should require just longs (or strings) for 
each dimension of a feature vector
 - expressing categorical info -> the Estimator -> Model -> Transformer pattern 
is more appropriate, I believe
 - vector dimensions -> I think that in all cases, the underlying code should 
be able to proceed without this information, although this is debatable

> Improve ML attribute API
> 
>
> Key: SPARK-8515
> URL: https://issues.apache.org/jira/browse/SPARK-8515
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 1.4.0
>Reporter: Xiangrui Meng
>  Labels: advanced
> Attachments: SPARK-8515.pdf
>
>
> In 1.4.0, we introduced ML attribute API to embed feature/label attribute 
> info inside DataFrame's schema. However, the API is not very friendly to use. 
> We should re-visit this API and see how we can improve it.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16175158#comment-16175158
 ] 

Timothy Hunter commented on SPARK-21866:


Putting this code under {{org.apache.spark.ml.image}} sounds good to me. Based 
on the initial exploration, it should not be too hard to integrate this in the 
data source framework. I am going to submit this proposal to a vote on the dev 
mailing list.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is 

[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16154132#comment-16154132
 ] 

Timothy Hunter commented on SPARK-21866:


[~yanboliang] thanks you for the comments. Regarding your questions:

1. making {{image}} part of {{ml}} or not: I do not have a strong preference, 
but I think that image support is more general than machine learning.

2. there is no obstacle, but that would create a dependency between the core 
({{spark.read}}) and an external module. This sort of dependency inversion is 
not great design, as any change into a sub-package will have API repercussion 
into the core of Spark. The SQL team is already struggling with such issues.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has 

[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-31 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-21866:
---
Attachment: (was: SPIP - Image support for Apache Spark.pdf)

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information 

[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-31 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-21866:
---
Attachment: SPIP - Image support for Apache Spark V1.1.pdf

Updated authors' list.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** 

[jira] [Commented] (SPARK-21184) QuantileSummaries implementation is wrong and QuantileSummariesSuite fails with larger n

2017-08-31 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149510#comment-16149510
 ] 

Timothy Hunter commented on SPARK-21184:


[~a1ray] thank you for the report, someone should investigate about these given 
values.

You raise some valid questions about the choice of data structures and 
algorithm, which were discussed during the implementation and that can 
certainly be revisited:

- tree structures: the major constraint here is that this structure gets 
serialized often, due to how UDAFs work. This is why the current implementation 
is amortized over multiple records. Edo Liberty has published some recent work 
that is relevant in that area.

- algorithm: we looked at t-digest (and q-digest). The main concern back then 
was that there was no published worst-time guarantee given a target precision. 
This is still the case to my knowledge. Because of that, it is hard to 
understand what could happen in some unusual cases - which tend to be not so 
unusual in big data. That being said, it looks like it is a popular and 
well-maintained choice now, so I am certainly open to relaxing this constraint.

> QuantileSummaries implementation is wrong and QuantileSummariesSuite fails 
> with larger n
> 
>
> Key: SPARK-21184
> URL: https://issues.apache.org/jira/browse/SPARK-21184
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.1.1
>Reporter: Andrew Ray
>
> 1. QuantileSummaries implementation does not match the paper it is supposed 
> to be based on.
> 1a. The compress method 
> (https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/QuantileSummaries.scala#L240)
>  merges neighboring buckets, but thats not what the paper says to do. The 
> paper 
> (http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf) 
> describes an implicit tree structure and the compress method deletes selected 
> subtrees.
> 1b. The paper does not discuss merging these summary data structures at all. 
> The following comment is in the merge method of QuantileSummaries:
> {quote}  // The GK algorithm is a bit unclear about it, but it seems 
> there is no need to adjust the
>   // statistics during the merging: the invariants are still respected 
> after the merge.{quote}
> Unless I'm missing something that needs substantiation, it's not clear that 
> that the invariants hold.
> 2. QuantileSummariesSuite fails with n = 1 (and other non trivial values)
> https://github.com/apache/spark/blob/master/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/QuantileSummariesSuite.scala#L27
> One possible solution if these issues can't be resolved would be to move to 
> an algorithm that explicitly supports merging and is well tested like 
> https://github.com/tdunning/t-digest



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21866) SPIP: Image support in Spark

2017-08-31 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16149186#comment-16149186
 ] 

Timothy Hunter commented on SPARK-21866:


[~srowen] thank you for the comments. Indeed, this proposal is limited in scope 
on purpose, because it aims at achieving consensus around multiple libraries. 
For instance, the MMLSpark project from Microsoft uses this data format to 
interface with OpenCV (wrapped through JNI), and the Deep Learning Pipelines is 
going to rely on it as its primary mechanism to load and process images. Also, 
nothing precludes adding common transforms to this package later - it is easier 
to start small.

Regarding the spark package, yes, it will be discontinued like the CSV parser. 
The aim is to offer a working library that can be tried out without having to 
wait for an implementation to be merged into Spark itself.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * 

[jira] [Updated] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-21866:
---
Attachment: SPIP - Image support for Apache Spark.pdf

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
> Attachments: SPIP - Image support for Apache Spark.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image processing libraries and has no 
> decoding overhead.
> h2. Targets users and personas:
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing 
> images, and will gain from a common interchange format (in alphabetical 
> order):
> * BigDL
> * DeepLearning4J
> * Deep Learning Pipelines
> * MMLSpark
> * TensorFlow (Spark connector)
> * TensorFlowOnSpark
> * TensorFrames
> * Thunder
> h2. Goals:
> * Simple representation of images in Spark DataFrames, based on pre-existing 
> industrial standards (OpenCV)
> * This format should eventually allow the development of high-performance 
> integration points with image processing libraries such as libOpenCV, Google 
> TensorFlow, CNTK, and other C libraries.
> * The reader should be able to read popular formats of images from 
> distributed sources.
> h2. Non-Goals:
> Images are a versatile medium and encompass a very wide range of formats and 
> representations. This SPIP explicitly aims at the most common use case in the 
> industry currently: multi-channel matrices of binary, int32, int64, float or 
> double data that can fit comfortably in the heap of the JVM:
> * the total size of an image should be restricted to less than 2GB (roughly)
> * the meaning of color channels is application-specific and is not mandated 
> by the standard (in line with the OpenCV standard)
> * specialized formats used in meteorology, the medical field, etc. are not 
> supported
> * this format is specialized to images and does not attempt to solve the more 
> general problem of representing n-dimensional tensors in Spark
> h2. Proposed API changes
> We propose to add a new package in the package structure, under the MLlib 
> project:
> {{org.apache.spark.image}}
> h3. Data format
> We propose to add the following structure:
> imageSchema = StructType([
> * StructField("mode", StringType(), False),
> ** The exact representation of the data.
> ** The values are described in the following OpenCV convention. Basically, 
> the type has both "depth" and "number of channels" info: in particular, type 
> "CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 
> (value 32 in the table) with the channel order specified by convention.
> ** The exact channel ordering and meaning of each channel is dictated by 
> convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
> If the image failed to load, the value is the empty string "".
> * StructField("origin", StringType(), True),
> ** Some information about the origin of the image. The content of 

[jira] [Created] (SPARK-21866) SPIP: Image support in Spark

2017-08-29 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-21866:
--

 Summary: SPIP: Image support in Spark
 Key: SPARK-21866
 URL: https://issues.apache.org/jira/browse/SPARK-21866
 Project: Spark
  Issue Type: Improvement
  Components: ML
Affects Versions: 2.2.0
Reporter: Timothy Hunter


h2. Background and motivation
As Apache Spark is being used more and more in the industry, some new use cases 
are emerging for different data formats beyond the traditional SQL types or the 
numerical types (vectors and matrices). Deep Learning applications commonly 
deal with image processing. A number of projects add some Deep Learning 
capabilities to Spark (see list below), but they struggle to  communicate with 
each other or with MLlib pipelines because there is no standard way to 
represent an image in Spark DataFrames. We propose to federate efforts for 
representing images in Spark by defining a representation that caters to the 
most common needs of users and library developers.

This SPIP proposes a specification to represent images in Spark DataFrames and 
Datasets (based on existing industrial standards), and an interface for loading 
sources of images. It is not meant to be a full-fledged image processing 
library, but rather the core description that other libraries and users can 
rely on. Several packages already offer various processing facilities for 
transforming images or doing more complex operations, and each has various 
design tradeoffs that make them better as standalone solutions.

This project is a joint collaboration between Microsoft and Databricks, which 
have been testing this design in two open source packages: MMLSpark and Deep 
Learning Pipelines.

The proposed image format is an in-memory, decompressed representation that 
targets low-level applications. It is significantly more liberal in memory 
usage than compressed image representations such as JPEG, PNG, etc., but it 
allows easy communication with popular image processing libraries and has no 
decoding overhead.

h2. Targets users and personas:
Data scientists, data engineers, library developers.
The following libraries define primitives for loading and representing images, 
and will gain from a common interchange format (in alphabetical order):
* BigDL
* DeepLearning4J
* Deep Learning Pipelines
* MMLSpark
* TensorFlow (Spark connector)
* TensorFlowOnSpark
* TensorFrames
* Thunder

h2. Goals:
* Simple representation of images in Spark DataFrames, based on pre-existing 
industrial standards (OpenCV)
* This format should eventually allow the development of high-performance 
integration points with image processing libraries such as libOpenCV, Google 
TensorFlow, CNTK, and other C libraries.
* The reader should be able to read popular formats of images from distributed 
sources.

h2. Non-Goals:
Images are a versatile medium and encompass a very wide range of formats and 
representations. This SPIP explicitly aims at the most common use case in the 
industry currently: multi-channel matrices of binary, int32, int64, float or 
double data that can fit comfortably in the heap of the JVM:
* the total size of an image should be restricted to less than 2GB (roughly)
* the meaning of color channels is application-specific and is not mandated by 
the standard (in line with the OpenCV standard)
* specialized formats used in meteorology, the medical field, etc. are not 
supported
* this format is specialized to images and does not attempt to solve the more 
general problem of representing n-dimensional tensors in Spark

h2. Proposed API changes
We propose to add a new package in the package structure, under the MLlib 
project:
{{org.apache.spark.image}}

h3. Data format
We propose to add the following structure:

imageSchema = StructType([
* StructField("mode", StringType(), False),
** The exact representation of the data.
** The values are described in the following OpenCV convention. Basically, the 
type has both "depth" and "number of channels" info: in particular, type 
"CV_8UC3" means "3 channel unsigned bytes". BGRA format would be CV_8UC4 (value 
32 in the table) with the channel order specified by convention.
** The exact channel ordering and meaning of each channel is dictated by 
convention. By default, the order is RGB (3 channels) and BGRA (4 channels).
If the image failed to load, the value is the empty string "".

* StructField("origin", StringType(), True),
** Some information about the origin of the image. The content of this is 
application-specific.
** When the image is loaded from files, users should expect to find the file 
name in this field.

* StructField("height", IntegerType(), False),
** the height of the image, pixels
** If the image fails to load, the value is -1.

* StructField("width", IntegerType(), False),
** the width of the image, pixels
** If the image fails to load, the value is -1.

* StructField("nChannels", 

[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-27 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944215#comment-15944215
 ] 

Timothy Hunter commented on SPARK-19634:


[~sethah], yes, thanks for bringing up these concerns. Regarding the first 
points, the UDAF interface does not let you update arrays in place, which is a 
non-starter in our case. This is why the implementation switches to TIA. I have 
updated the design doc with these comments.

Regarding the performance, I agree that there is a tension between having an 
API that is compatible with structured streaming and the current, RDD-based 
implementation. I will provide some test numbers so that we have a basis for 
discussion. That being said, the RDD API is not going away, so if users care 
about performance and do not need the additional benefit of integrating with 
SQL or structured streaming, they can still use it.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-27 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944019#comment-15944019
 ] 

Timothy Hunter commented on SPARK-19634:


[~dongjin] [~wm624] sorry it looks like I missed your comments... I pushed a PR 
for this feature. Please feel free to comment on the PR if you have the time.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-20111) codegen bug surfaced by GraphFrames issue 165

2017-03-27 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-20111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15944004#comment-15944004
 ] 

Timothy Hunter commented on SPARK-20111:


As Spark SQL is making more and more forays into code generation, I have been 
wondering if it would make sense to start adopting practical compiler 
technologies, such as generating first an intermediate representation, instead 
of doing string manipulation as we currently do. This is of course much beyond 
the scope of this particular ticket.

> codegen bug surfaced by GraphFrames issue 165
> -
>
> Key: SPARK-20111
> URL: https://issues.apache.org/jira/browse/SPARK-20111
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.0.2, 2.1.0, 2.2.0
>Reporter: Joseph K. Bradley
>
> In GraphFrames, test {{test("named edges")}} in {{PatternMatchSuite}} 
> surfaces a SQL codegen bug.
> This is described in https://github.com/graphframes/graphframes/issues/165
> Summary
> * The unit test does a simple motif query on a graph.  Essentially, this 
> means taking 2 DataFrames, doing a few joins, selecting 2 columns, and 
> collecting the (tiny) DataFrame.
> * The test runs, but codegen fails.  See the linked GraphFrames issue for the 
> stacktrace.
> To reproduce this:
> * Check out GraphFrames https://github.com/graphframes/graphframes
> * Run {{sbt assembly}} to compile it and run tests
> Copying [~felixcheung]'s comment from the GraphFrames issue 165:
> {quote}
> Seems like codegen bug; it looks like at least 2 issues:
> 1. At L472, inputadapter_value is not defined within scope
> 2. inputadapter_value is an InternalRow, for this statement to work
> {{bhj_primitiveA = inputadapter_value;}}
> it should be
> {{bhj_primitiveA = inputadapter_value.getLong(0);}}
> {quote}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20077) Documentation for ml.stats.Correlation

2017-03-23 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-20077:
--

 Summary: Documentation for ml.stats.Correlation
 Key: SPARK-20077
 URL: https://issues.apache.org/jira/browse/SPARK-20077
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


Now that (Pearson) correlations are available in spark.ml, we need to write 
some documentation to go along with this feature. It can simply be looking at 
the unit tests for example right now.




--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-20076) Python interface for ml.stats.Correlation

2017-03-23 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-20076:
--

 Summary: Python interface for ml.stats.Correlation
 Key: SPARK-20076
 URL: https://issues.apache.org/jira/browse/SPARK-20076
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


The (Pearson) statistics have been exposed with a Dataframe interface as part 
of SPARK-19636 in the Scala interface. We should now make these available in 
Python.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-03-13 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15923119#comment-15923119
 ] 

Timothy Hunter commented on SPARK-19634:


I was not able to finish it in time, but the bulk of the code is in this branch:

https://github.com/apache/spark/compare/master...thunterdb:19634?expand=1

Note that it currently includes a (non-working) UDAF and an incomplete 
TypedImperativeAggregate. It turns out that UDAF interface is not suited for 
this sort of aggregators, which I realized quite late. I started to refactor my 
code to use TypedImperativeAggregate, but did not have to finish it. If someone 
wants to pick up this task, he or she is welcome to do it.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-27 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19634?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15886900#comment-15886900
 ] 

Timothy Hunter commented on SPARK-19634:


[~wm624] were you able to start to work on this task? I have some time now and 
I can work on it.

> Feature parity for descriptive statistics in MLlib
> --
>
> Key: SPARK-19634
> URL: https://issues.apache.org/jira/browse/SPARK-19634
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.MultivariateOnlineSummarizer over to spark.ml.
> A design has been discussed in SPARK-19208 . Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-23 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881471#comment-15881471
 ] 

Timothy Hunter commented on SPARK-19635:


After working on it, I realized that Column operations do not fit very well the 
sort of requested operations. Hypothesis testing require to chain a UDAF with a 
UDF then with a UDAF again, which is not something that can be expressed inside 
catalyst by doing {{dataframe.select(test("features"))}}. I am going to have a 
simpler interface that is simpler to interface (see design doc above).

> Feature parity for Chi-square hypothesis testing in MLlib
> -
>
> Key: SPARK-19635
> URL: https://issues.apache.org/jira/browse/SPARK-19635
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of 
> spark.mllib.Statistics.chiSqTest over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-23 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15881457#comment-15881457
 ] 

Timothy Hunter commented on SPARK-19636:


After working on it, I realized that Column operations do not fit very well the 
sort of requested operations. Correlations require to chain a UDAF with a UDF 
then with a UDAF again, which is not something that can be expressed inside 
catalyst by doing {{dataframe.select(corr("features"))}}. I am going to have a 
simpler interface that is simpler to interface (see design doc above).

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19573) Make NaN/null handling consistent in approxQuantile

2017-02-22 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19573?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15879387#comment-15879387
 ] 

Timothy Hunter commented on SPARK-19573:


I do not have too strong an opinion, as long as:
 1. we are consistent within Spark, or
 2. we follow the standard for numerical stuff (IEEE-754)

I am not sure what the standard is for SQL, though.


> Make NaN/null handling consistent in approxQuantile
> ---
>
> Key: SPARK-19573
> URL: https://issues.apache.org/jira/browse/SPARK-19573
> Project: Spark
>  Issue Type: Bug
>  Components: SQL
>Affects Versions: 2.2.0
>Reporter: zhengruifeng
>
> As discussed in https://github.com/apache/spark/pull/16776, this jira is used 
> to track the following issue:
> Multi-column version of approxQuantile drop the rows containing *any* 
> NaN/null, the results are not consistent with outputs of the single-version.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15877128#comment-15877128
 ] 

Timothy Hunter commented on SPARK-19636:


Looking more closely at the code, it makes sense to start by a replacement of 
MultivariateStatisticalSummary, which is the basis of  PearsonCorrelation and 
the final step of the Spearman correlation. Also, looking at these algorithms, 
it is not going to write them as UDAFs (unlike the original design), so the 
interface will need to take a {{Dataset[Vector]}} instead of a column.

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19636?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15876929#comment-15876929
 ] 

Timothy Hunter commented on SPARK-19636:


Unless someone has started to work on this task, I will take it.

> Feature parity for correlation statistics in MLlib
> --
>
> Key: SPARK-19636
> URL: https://issues.apache.org/jira/browse/SPARK-19636
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Affects Versions: 2.1.0
>Reporter: Timothy Hunter
>
> This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
> over to spark.ml.
> Here is a design doc:
> https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19636) Feature parity for correlation statistics in MLlib

2017-02-16 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-19636:
--

 Summary: Feature parity for correlation statistics in MLlib
 Key: SPARK-19636
 URL: https://issues.apache.org/jira/browse/SPARK-19636
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


This ticket tracks porting the functionality of spark.mllib.Statistics.corr() 
over to spark.ml.

Here is a design doc:
https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19635) Feature parity for Chi-square hypothesis testing in MLlib

2017-02-16 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-19635:
--

 Summary: Feature parity for Chi-square hypothesis testing in MLlib
 Key: SPARK-19635
 URL: https://issues.apache.org/jira/browse/SPARK-19635
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


This ticket tracks porting the functionality of 
spark.mllib.Statistics.chiSqTest over to spark.ml.

Here is a design doc:
https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-19634) Feature parity for descriptive statistics in MLlib

2017-02-16 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-19634:
--

 Summary: Feature parity for descriptive statistics in MLlib
 Key: SPARK-19634
 URL: https://issues.apache.org/jira/browse/SPARK-19634
 Project: Spark
  Issue Type: Sub-task
  Components: ML
Affects Versions: 2.1.0
Reporter: Timothy Hunter


This ticket tracks porting the functionality of 
spark.mllib.MultivariateOnlineSummarizer over to spark.ml.

A design has been discussed in SPARK-19208 . Here is a design doc:

https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-16 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15870655#comment-15870655
 ] 

Timothy Hunter commented on SPARK-19208:


I put together the ideas in this thread into a document. I will update the 
umbrella ticket with sub tasks once folks have had a chance to comment:

https://docs.google.com/document/d/1ELVpGV3EBjc2KQPLN9_9_Ge9gWchPZ6SGtDW5tTm_50/edit#

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866768#comment-15866768
 ] 

Timothy Hunter commented on SPARK-19208:


Yes, I meant returning a struct and then projecting this struct. I do not think 
there is any other way right now with the current UDAFs, as you mention. In 
that proposal, {{VectorSummarizer.metrics(...).summary(...)}} returns a struct, 
the fields of which are decided by the arguments in {{.metrics}}, and each of 
the individual functions  {{VectorSummarizer.min/max/variasce(...)}} returns 
columns of vectors or matrices.

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714
 ] 

Timothy Hunter edited comment on SPARK-19208 at 2/14/17 9:24 PM:
-

Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("vector_summary(features).min").as("min(features)"), 
col("vector_summary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?


was (Author: timhunter):
Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), 
col("VectorSummary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866714#comment-15866714
 ] 

Timothy Hunter commented on SPARK-19208:


Thanks for the clarification [~mlnick]. I was a bit unclear in my previous 
comment. What I meant by catalyst rules is supporting the case in which the 
user would naturally request multiple summaries:

{code}
val summaryDF = df.select(VectorSummary.min("features"), 
VectorSummary.variance("features"))
{code}

and have a simple rule that rewrites this logical tree to use a single UDAF 
under the hood:

{code}
val tmpDF = df.select(VectorSummary.summary("features", "min", "variance"))
val df2 = tmpDF.select(col("VectorSummary(features).min").as("min(features)"), 
col("VectorSummary(features).variance").as("variance(features)")
{code}

Of course this is more advanced, and we should probably start with:
 - a UDAF that follows some builder pattern such as 
VectorSummarizer.metrics("min", "max").summary("features")
 - some simple wrappers that (inefficiently) compute independently their 
statistics: {{VectorSummarizer.min("feature")}} is a shortcut for:
{code}
VectorSummarizer.metrics("min").summary("features").getCol("min")
{code}
etc. We can always optimize this use case later using rewrite rules.

What do you think?

> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Comment Edited] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535
 ] 

Timothy Hunter edited comment on SPARK-19208 at 2/14/17 8:04 PM:
-

I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have an API like 
this:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.



was (Author: timhunter):
I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have a the 
following API:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.


> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For 

[jira] [Commented] (SPARK-19208) MultivariateOnlineSummarizer performance optimization

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-19208?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866535#comment-15866535
 ] 

Timothy Hunter commented on SPARK-19208:


I am not sure if we should follow the Estimator API for classical statistics:
 - it does not transform the data, it only gets fitted, so it does not quite 
fit the Estimator API.
 - more generally, I would argue that the use case is to get some information 
about a dataframe for its own sake, rather than being part of a ML pipeline. 
For instance, there was no attempt to fit these algorithms into spark.mllib 
estimator/model API, and basic scalers are already in the transformer API.

I want to second [~josephkb]'s API, because it is the most flexible with 
respect to implementation, and the only one that is compatible with structured 
streaming and groupBy. That means users will be able to use all the summary 
stats without additional work from us to retrofit the API to structured 
streaming. Furthermore, the exact implementation details (a single private 
UDAF, more optimized catalyst-based transforms) can be implemented in the 
future without changing the API.

As an intermediate step, if introducing catalyst rules is too hard for now and 
if we want to address [~mlnick]'s points (a) and (b), we can have a the 
following API:

{code}
df.select(VectorSummary.summary("features", "min", "mean", ...)
df.select(VectorSummary.summaryWeighted("features", "weights", "min", "mean", 
...)
{code}

or:

{code}
df.select(VectorSummary.summaryStats("min", "mean").summary("features")
df.select(VectorSummary.summaryStats("min", "mean").summaryWeighted("features", 
"weights")
{code}

What do you think? I will be happy to put together a proposal.


> MultivariateOnlineSummarizer performance optimization
> -
>
> Key: SPARK-19208
> URL: https://issues.apache.org/jira/browse/SPARK-19208
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Reporter: zhengruifeng
> Attachments: Tests.pdf, WechatIMG2621.jpeg
>
>
> Now, {{MaxAbsScaler}} and {{MinMaxScaler}} are using 
> {{MultivariateOnlineSummarizer}} to compute the min/max.
> However {{MultivariateOnlineSummarizer}} will also compute extra unused 
> statistics. It slows down the task, moreover it is more prone to cause OOM.
> For example:
> env : --driver-memory 4G --executor-memory 1G --num-executors 4
> data: 
> [http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/binary.html#kdd2010%20(bridge%20to%20algebra)]
>  748401 instances,   and 29,890,095 features
> {{MaxAbsScaler.fit}} fails because of OOM
> {{MultivariateOnlineSummarizer}} maintains 8 arrays:
> {code}
> private var currMean: Array[Double] = _
>   private var currM2n: Array[Double] = _
>   private var currM2: Array[Double] = _
>   private var currL1: Array[Double] = _
>   private var totalCnt: Long = 0
>   private var totalWeightSum: Double = 0.0
>   private var weightSquareSum: Double = 0.0
>   private var weightSum: Array[Double] = _
>   private var nnz: Array[Long] = _
>   private var currMax: Array[Double] = _
>   private var currMin: Array[Double] = _
> {code}
> For {{MaxAbsScaler}}, only 1 array is needed (max of abs value)
> For {{MinMaxScaler}}, only 3 arrays are needed (max, min, nnz)
> After modication in the pr, the above example run successfully.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14523) Feature parity for Statistics ML with MLlib

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866295#comment-15866295
 ] 

Timothy Hunter commented on SPARK-14523:


Also, the correlation is missing the multivariate case.

I will take this task over unless one expresses some interest.

> Feature parity for Statistics ML with MLlib
> ---
>
> Key: SPARK-14523
> URL: https://issues.apache.org/jira/browse/SPARK-14523
> Project: Spark
>  Issue Type: Sub-task
>  Components: ML
>Reporter: yuhao yang
>
> Some statistics functions have been supported by DataFrame directly. Use this 
> jira to discuss/design the statistics package in Spark.ML and its function 
> scope. Hypothesis test and correlation computation may still need to expose 
> independent interfaces.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-4591) Algorithm/model parity for spark.ml (Scala)

2017-02-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-4591?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15866288#comment-15866288
 ] 

Timothy Hunter commented on SPARK-4591:
---

[~josephkb] do you also want some subtasks for KernelDensity and multivariate 
summaries? They are in the state module but not covered.

> Algorithm/model parity for spark.ml (Scala)
> ---
>
> Key: SPARK-4591
> URL: https://issues.apache.org/jira/browse/SPARK-4591
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML
>Reporter: Xiangrui Meng
>Priority: Critical
>
> This is an umbrella JIRA for porting spark.mllib implementations to use the 
> DataFrame-based API defined under spark.ml.  We want to achieve critical 
> feature parity for the next release.
> h3. Instructions for 3 subtask types
> *Review tasks*: detailed review of a subpackage to identify feature gaps 
> between spark.mllib and spark.ml.
> * Should be listed as a subtask of this umbrella.
> * Review subtasks cover major algorithm groups.  To pick up a review subtask, 
> please:
> ** Comment that you are working on it.
> ** Compare the public APIs of spark.ml vs. spark.mllib.
> ** Comment on all missing items within spark.ml: algorithms, models, methods, 
> features, etc.
> ** Check for existing JIRAs covering those items.  If there is no existing 
> JIRA, create one, and link it to your comment.
> *Critical tasks*: higher priority missing features which are required for 
> this umbrella JIRA.
> * Should be linked as "requires" links.
> *Other tasks*: lower priority missing features which can be completed after 
> the critical tasks.
> * Should be linked as "contains" links.
> h4. Excluded items
> This does *not* include:
> * Python: We can compare Scala vs. Python in spark.ml itself.
> * Moving linalg to spark.ml: [SPARK-13944]
> * Streaming ML: Requires stabilizing some internal APIs of structured 
> streaming first
> h3. TODO list
> *Critical issues*
> * [SPARK-14501]: Frequent Pattern Mining
> * [SPARK-14709]: linear SVM
> * [SPARK-15784]: Power Iteration Clustering (PIC)
> *Lower priority issues*
> * Missing methods within algorithms (see Issue Links below)
> * evaluation submodule
> * stat submodule (should probably be covered in DataFrames)
> * Developer-facing submodules:
> ** optimization (including [SPARK-17136])
> ** random, rdd
> ** util
> *To be prioritized*
> * single-instance prediction: [SPARK-10413]
> * pmml [SPARK-11171]



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-11-11 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15657825#comment-15657825
 ] 

Timothy Hunter commented on SPARK-8884:
---

I do not have a strong preference either way. We should just either
complete this feature (with DataFrame APIs) or close the open PR.



> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8884) 1-sample Anderson-Darling Goodness-of-Fit test

2016-11-10 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15655563#comment-15655563
 ] 

Timothy Hunter commented on SPARK-8884:
---

[~srowen] this ticket should still be open I believe? [~yuhaoyan] has an open 
PR for it.

> 1-sample Anderson-Darling Goodness-of-Fit test
> --
>
> Key: SPARK-8884
> URL: https://issues.apache.org/jira/browse/SPARK-8884
> Project: Spark
>  Issue Type: New Feature
>  Components: MLlib
>Reporter: Jose Cambronero
>
> We have implemented a 1-sample Anderson-Darling goodness-of-fit test to add 
> to the current hypothesis testing functionality. The current implementation 
> supports various distributions (normal, exponential, gumbel, logistic, and 
> weibull). However, users must provide distribution parameters for all except 
> normal/exponential (in which case they are estimated from the data). In 
> contrast to other tests, such as the Kolmogorov Smirnov test, we only support 
> specific distributions as the critical values depend on the distribution 
> being tested. 
> The distributed implementation of AD takes advantage of the fact that we can 
> calculate a portion of the statistic within each partition of a sorted data 
> set, independent of the global order of those observations. We can then carry 
> some additional information that allows us to adjust the final amounts once 
> we have collected 1 result per partition.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17845) Improve window function frame boundary API in DataFrame

2016-10-11 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17845?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15566380#comment-15566380
 ] 

Timothy Hunter commented on SPARK-17845:


I like the {{Window.rowsBetween(Long.MinValue, -3)}} syntax, but it is exposing 
a system implementation detail. How about having some static/singleton values 
that define our notion of plus/minus infinity instead of relying on the system 
values?

Here is a suggestion:

{code}
Window.rowsBetween(Window.unboundedBefore, -3)

object Window {
  def unboundedBefore: Long = Int.MinVal.toLong
}
{code}

To get around that various sizes of the ints for each language, I suggest we 
say that every value above 2^31 is considered unbounded above. That should be 
more than enough and cover at least python, scala, R, java.


> Improve window function frame boundary API in DataFrame
> ---
>
> Key: SPARK-17845
> URL: https://issues.apache.org/jira/browse/SPARK-17845
> Project: Spark
>  Issue Type: Improvement
>  Components: SQL
>Reporter: Reynold Xin
>Assignee: Reynold Xin
>
> ANSI SQL uses the following to specify the frame boundaries for window 
> functions:
> {code}
> ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> {code}
> In Spark's DataFrame API, we use integer values to indicate relative position:
> - 0 means "CURRENT ROW"
> - -1 means "1 PRECEDING"
> - Long.MinValue means "UNBOUNDED PRECEDING"
> - Long.MaxValue to indicate "UNBOUNDED FOLLOWING"
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetween(Long.MinValue, -3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetween(Long.MinValue, 0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetween(0, Long.MaxValue)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetween(Long.MinValue, Long.MaxValue)
> {code}
> I think using numeric values to indicate relative positions is actually a 
> good idea, but the reliance on Long.MinValue and Long.MaxValue to indicate 
> unbounded ends is pretty confusing:
> 1. The API is not self-evident. There is no way for a new user to figure out 
> how to indicate an unbounded frame by looking at just the API. The user has 
> to read the doc to figure this out.
> 2. It is weird Long.MinValue or Long.MaxValue has some special meaning.
> 3. Different languages have different min/max values, e.g. in Python we use 
> -sys.maxsize and +sys.maxsize.
> To make this API less confusing, we have a few options:
> Option 1. Add the following (additional) methods:
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsBetween(-3, +3)  // this one exists already
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsBetweenUnboundedPrecedingAnd(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsBetweenUnboundedPrecedingAndCurrentRow()
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsBetweenCurrentRowAndUnboundedFollowing()
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> Window.rowsBetweenUnboundedPrecedingAndUnboundedFollowing()
> {code}
> This is obviously very verbose, but is very similar to how these functions 
> are done in SQL, and is perhaps the most obvious to end users, especially if 
> they come from SQL background.
> Option 2. Decouple the specification for frame begin and frame end into two 
> functions. Assume the boundary is unlimited unless specified.
> {code}
> // ROWS BETWEEN 3 PRECEDING AND 3 FOLLOWING
> Window.rowsFrom(-3).rowsTo(3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND 3 PRECEDING
> Window.rowsTo(-3)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
> Window.rowsToCurrent() or Window.rowsTo(0)
> // ROWS BETWEEN CURRENT ROW AND UNBOUNDED PRECEDING
> Window.rowsFromCurrent() or Window.rowsFrom(0)
> // ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
> // no need to specify
> {code}
> If we go with option 2, we should throw exceptions if users specify multiple 
> from's or to's. A variant of option 2 is to require explicitly specification 
> of begin/end even in the case of unbounded boundary, e.g.:
> {code}
> Window.rowsFromBeginning().rowsTo(-3)
> or
> Window.rowsFromUnboundedPreceding().rowsTo(-3)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17219) QuantileDiscretizer does strange things with NaN values

2016-10-06 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17219?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15553490#comment-15553490
 ] 

Timothy Hunter commented on SPARK-17219:


If I understand correctly the PR, I am concerned by this approach for a couple 
of reasons:
 - when users set the number of buckets, the general expectation should be that 
(number of returned buckets) <= (number of requested buckets). With the current 
treatment of NaN, you can end up with more buckets than you asked for. Breaking 
this invariant seems troublesome for me.
 - in general, MLLib's policy in regard to NaNs has been to consider them as 
invalid input. This is also the approach followed by sklearn and the reason for 
having an imputer with SPARK-13568. If we start to let NaN values go through, 
they will trigger some other issues down the pipelines.

Why not simply stopping with an error at that point, as [~srowen] was 
suggesting at the beginning? [~barrybecker4], I am trying to understand your 
use case here.

> QuantileDiscretizer does strange things with NaN values
> ---
>
> Key: SPARK-17219
> URL: https://issues.apache.org/jira/browse/SPARK-17219
> Project: Spark
>  Issue Type: Bug
>  Components: ML
>Affects Versions: 1.6.2
>Reporter: Barry Becker
>Assignee: Vincent
> Fix For: 2.1.0
>
>
> How is the QuantileDiscretizer supposed to handle null values?
> Actual nulls are not allowed, so I replace them with Double.NaN.
> However, when you try to run the QuantileDiscretizer on a column that 
> contains NaNs, it will create (possibly more than one) NaN split(s) before 
> the final PositiveInfinity value.
> I am using the attache titanic csv data and trying to bin the "age" column 
> using the QuantileDiscretizer with 10 bins specified. The age column as a lot 
> of null values.
> These are the splits that I get:
> {code}
> -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, NaN, NaN, Infinity
> {code}
> Is that expected. It seems to imply that NaN is larger than any positive 
> number and less than infinity.
> I'm not sure of the best way to handle nulls, but I think they need a bucket 
> all their own. My suggestions would be to include an initial NaN split value 
> that is always there, just like the sentinel Infinities are. If that were the 
> case, then the splits for the example above might look like this:
> {code}
> NaN, -Infinity, 15.0, 20.5, 24.0, 28.0, 32.5, 38.0, 48.0, Infinity
> {code}
> This does not seem great either because a bucket that is [NaN, -Inf] doesn't 
> make much sense. Not sure if the NaN bucket counts toward numBins or not. I 
> do think it should always be there though in case future data has null even 
> though the fit data did not. Thoughts?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-17074) generate histogram information for column

2016-09-30 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-17074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15537295#comment-15537295
 ] 

Timothy Hunter commented on SPARK-17074:


We have discussed this through email and either is fine. Regarding the second 
one, even if the result is approximate, you can still get some reasonable 
bounds on the error.

> generate histogram information for column
> -
>
> Key: SPARK-17074
> URL: https://issues.apache.org/jira/browse/SPARK-17074
> Project: Spark
>  Issue Type: Sub-task
>  Components: Optimizer
>Affects Versions: 2.0.0
>Reporter: Ron Hu
>
> We support two kinds of histograms: 
> - Equi-width histogram: We have a fixed width for each column interval in 
> the histogram.  The height of a histogram represents the frequency for those 
> column values in a specific interval.  For this kind of histogram, its height 
> varies for different column intervals. We use the equi-width histogram when 
> the number of distinct values is less than 254.
> - Equi-height histogram: For this histogram, the width of column interval 
> varies.  The heights of all column intervals are the same.  The equi-height 
> histogram is effective in handling skewed data distribution. We use the equi- 
> height histogram when the number of distinct values is equal to or greater 
> than 254.  



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-16485) Additional fixes to Mllib 2.0 documentation

2016-07-11 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-16485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-16485:
---
Description: 
While reviewing the documentation of MLlib, I found some additional issues.

Important issues that affect the binary signatures:
 - GBTClassificationModel: all the setters should be overriden
 - LogisticRegressionModel: setThreshold(s)
 - RandomForestClassificationModel: all the setters should be overriden
 - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but 
most of the methods are private[ml] -> do we need to expose this class for now?
- GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should 
not be exposed
- sqlDataTypes: name does not follow conventions. Do we need to expose it?

Issues that involve only documentation:
- Evaluator:
  1. inconsistent doc between evaluate and isLargerBetter
- MinMaxScaler: math rendering
- GeneralizedLinearRegressionSummary: aic doc is incorrect


The reference documentation that was used was:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/

  was:
While reviewing the documentation of MLlib, I found some additional issues.

Important issues that affect the binary signatures:
 - GBTClassificationModel: all the setters should be overriden
 - LogisticRegressionModel: setThreshold(s)
 - RandomForestClassificationModel: all the setters should be overriden
 - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but 
most of the methods are private[ml] -> do we need to expose this class for now?
- GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should 
not be exposed
- sqlDataTypes: name does not follow conventions. Do we need to expose it?

Issues that involve only documentation:
- Evaluator:
  1. inconsistent doc between evaluate and isLargerBetter
  2. missing `def evaluate(dataset: Dataset[_]): Double` from the doc (the 
other method with the same name shows up). This may be a bug in scaladoc.
- MinMaxScaler: math rendering
- GeneralizedLinearRegressionSummary: aic doc is incorrect


The reference documentation that was used was:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/


> Additional fixes to Mllib 2.0 documentation
> ---
>
> Key: SPARK-16485
> URL: https://issues.apache.org/jira/browse/SPARK-16485
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Timothy Hunter
>
> While reviewing the documentation of MLlib, I found some additional issues.
> Important issues that affect the binary signatures:
>  - GBTClassificationModel: all the setters should be overriden
>  - LogisticRegressionModel: setThreshold(s)
>  - RandomForestClassificationModel: all the setters should be overriden
>  - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but 
> most of the methods are private[ml] -> do we need to expose this class for 
> now?
> - GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should 
> not be exposed
> - sqlDataTypes: name does not follow conventions. Do we need to expose it?
> Issues that involve only documentation:
> - Evaluator:
>   1. inconsistent doc between evaluate and isLargerBetter
> - MinMaxScaler: math rendering
> - GeneralizedLinearRegressionSummary: aic doc is incorrect
> The reference documentation that was used was:
> http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0

2016-07-11 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15371504#comment-15371504
 ] 

Timothy Hunter commented on SPARK-14816:


Also, in `mllib-guide.md`, let's switch the order between spark.ml and 
spark.mllib to give more proeminence to spark.ml.

> Update MLlib, GraphX, SparkR websites for 2.0
> -
>
> Key: SPARK-14816
> URL: https://issues.apache.org/jira/browse/SPARK-14816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>Priority: Blocker
>
> Update the sub-projects' websites to include new features in this release.
> For MLlib, make it clear that the DataFrame-based API is the primary one now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16485) Additional fixes to Mllib 2.0 documentation

2016-07-11 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-16485:
--

 Summary: Additional fixes to Mllib 2.0 documentation
 Key: SPARK-16485
 URL: https://issues.apache.org/jira/browse/SPARK-16485
 Project: Spark
  Issue Type: Sub-task
Reporter: Timothy Hunter


While reviewing the documentation of MLlib, I found some additional issues.

Important issues that affect the binary signatures:
 - GBTClassificationModel: all the setters should be overriden
 - LogisticRegressionModel: setThreshold(s)
 - RandomForestClassificationModel: all the setters should be overriden
 - org.apache.spark.ml.stat.distribution.MultivariateGaussian is exposed but 
most of the methods are private[ml] -> do we need to expose this class for now?
- GeneralizedLinearRegressionModel: linkObj, familyObj, familyAndLink should 
not be exposed
- sqlDataTypes: name does not follow conventions. Do we need to expose it?

Issues that involve only documentation:
- Evaluator:
  1. inconsistent doc between evaluate and isLargerBetter
  2. missing `def evaluate(dataset: Dataset[_]): Double` from the doc (the 
other method with the same name shows up). This may be a bug in scaladoc.
- MinMaxScaler: math rendering
- GeneralizedLinearRegressionSummary: aic doc is incorrect


The reference documentation that was used was:
http://people.apache.org/~pwendell/spark-releases/spark-2.0.0-rc2-docs/



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-28 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15353374#comment-15353374
 ] 

Timothy Hunter commented on SPARK-12922:


I opened a separate JIRA for that issue: SPARK-16258

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-16258) Automatically append the grouping keys in SparkR's gapply

2016-06-28 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-16258:
--

 Summary: Automatically append the grouping keys in SparkR's gapply
 Key: SPARK-16258
 URL: https://issues.apache.org/jira/browse/SPARK-16258
 Project: Spark
  Issue Type: Improvement
  Components: SparkR
Reporter: Timothy Hunter


While working on the group apply function for python [1], we found it easier to 
depart from SparkR's gapply function in the following way:
 - the keys are appended by default to the spark dataframe being returned
 - the output schema that the users provides is the schema of the R data frame 
and does not include the keys

Here are the reasons for doing so:
 - in most cases, users will want to know the key associated with a result -> 
appending the key is the sensible default
 - most functions in the SQL interface and in MLlib append columns, and gapply 
departs from this philosophy
 - for the cases when they do not need it, adding the key is a fraction of the 
computation time and of the output size
 - from a formal perspective, it makes calling gapply fully transparent to the 
type of the key: it is easier to build a function with gapply because it does 
not need to know anything about the key

This ticket proposes to change SparkR's gapply function to follow the same 
convention as Python's implementation.

cc [~Narine] [~shivaram]

[1] 
https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12922) Implement gapply() on DataFrame in SparkR

2016-06-27 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15351311#comment-15351311
 ] 

Timothy Hunter commented on SPARK-12922:


[~Narine] while working on a similar function for python [1], we found it 
easier to have the following changes:
 - the keys are appended by default to the spark dataframe being returned
 - the output schema that the users provides is the schema of the R data frame 
and does not include the keys

Here were our reasons to depart from the R implementation of gapply:
 - in most cases, users will want to know the key associated with a result -> 
appending the key is the sensible default
 - most functions in the SQL interface and in MLlib append columns, and gapply 
departs from this philosophy
 - for the cases when they do not need it, adding the key is a fraction of the 
computation time and of the output size
 - from a formal perspective, it makes calling gapply fully transparent to the 
type of the key: it is easier to build a function with gapply because it does 
not need to know anything about the key

I think it would make sense to make this change to the R's gapply 
implementation. Let me know what you think about it.

[1] 
https://github.com/databricks/spark-sklearn/blob/master/python/spark_sklearn/group_apply.py

> Implement gapply() on DataFrame in SparkR
> -
>
> Key: SPARK-12922
> URL: https://issues.apache.org/jira/browse/SPARK-12922
> Project: Spark
>  Issue Type: Sub-task
>  Components: SparkR
>Affects Versions: 1.6.0
>Reporter: Sun Rui
>Assignee: Narine Kokhlikyan
> Fix For: 2.0.0
>
>
> gapply() applies an R function on groups grouped by one or more columns of a 
> DataFrame, and returns a DataFrame. It is like GroupedDataSet.flatMapGroups() 
> in the Dataset API.
> Two API styles are supported:
> 1.
> {code}
> gd <- groupBy(df, col1, ...)
> gapply(gd, function(grouping_key, group) {}, schema)
> {code}
> 2.
> {code}
> gapply(df, grouping_columns, function(grouping_key, group) {}, schema) 
> {code}
> R function input: grouping keys value, a local data.frame of this grouped 
> data 
> R function output: local data.frame
> Schema specifies the Row format of the output of the R function. It must 
> match the R function's output.
> Note that map-side combination (partial aggregation) is not supported, user 
> could do map-side combination via dapply().



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-15581) MLlib 2.1 Roadmap

2016-06-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-15581?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15342674#comment-15342674
 ] 

Timothy Hunter commented on SPARK-15581:


With respect to deep learning, I think it depends on whether we are confortable 
to have a generic implementation that works for all supported languages, but 
that is going to be 1-2 orders of magnitude slower than specialized frameworks. 
Unlike BLAS for linear algebra, there is no generic interface in java or C++ to 
interface with specialized deep learning libraries, so just integrating them as 
a plugin will require a significant effort. Also, we are constrained by the 
dependencies we can pull into Spark, as experienced with breeze.
If we decide to roll out our own deep learning stack, we may be facing a 
perception issue that "deep learning on Spark is slow". 

> MLlib 2.1 Roadmap
> -
>
> Key: SPARK-15581
> URL: https://issues.apache.org/jira/browse/SPARK-15581
> Project: Spark
>  Issue Type: Umbrella
>  Components: ML, MLlib
>Reporter: Joseph K. Bradley
>Priority: Blocker
>  Labels: roadmap
>
> This is a master list for MLlib improvements we are working on for the next 
> release. Please view this as a wish list rather than a definite plan, for we 
> don't have an accurate estimate of available resources. Due to limited review 
> bandwidth, features appearing on this list will get higher priority during 
> code review. But feel free to suggest new items to the list in comments. We 
> are experimenting with this process. Your feedback would be greatly 
> appreciated.
> h1. Instructions
> h2. For contributors:
> * Please read 
> https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark 
> carefully. Code style, documentation, and unit tests are important.
> * If you are a first-time Spark contributor, please always start with a 
> [starter task|https://issues.apache.org/jira/issues/?filter=12333209] rather 
> than a medium/big feature. Based on our experience, mixing the development 
> process with a big feature usually causes long delay in code review.
> * Never work silently. Let everyone know on the corresponding JIRA page when 
> you start working on some features. This is to avoid duplicate work. For 
> small features, you don't need to wait to get JIRA assigned.
> * For medium/big features or features with dependencies, please get assigned 
> first before coding and keep the ETA updated on the JIRA. If there exist no 
> activity on the JIRA page for a certain amount of time, the JIRA should be 
> released for other contributors.
> * Do not claim multiple (>3) JIRAs at the same time. Try to finish them one 
> after another.
> * Remember to add the `@Since("VERSION")` annotation to new public APIs.
> * Please review others' PRs (https://spark-prs.appspot.com/#mllib). Code 
> review greatly helps to improve others' code as well as yours.
> h2. For committers:
> * Try to break down big features into small and specific JIRA tasks and link 
> them properly.
> * Add a "starter" label to starter tasks.
> * Put a rough estimate for medium/big features and track the progress.
> * If you start reviewing a PR, please add yourself to the Shepherd field on 
> JIRA.
> * If the code looks good to you, please comment "LGTM". For non-trivial PRs, 
> please ping a maintainer to make a final pass.
> * After merging a PR, create and link JIRAs for Python, example code, and 
> documentation if applicable.
> h1. Roadmap (*WIP*)
> This is NOT [a complete list of MLlib JIRAs for 2.1| 
> https://issues.apache.org/jira/issues/?jql=project%20%3D%20SPARK%20AND%20component%20in%20(ML%2C%20MLlib%2C%20SparkR%2C%20GraphX)%20AND%20%22Target%20Version%2Fs%22%20%3D%202.1.0%20AND%20(fixVersion%20is%20EMPTY%20OR%20fixVersion%20!%3D%202.1.0)%20AND%20(Resolution%20is%20EMPTY%20OR%20Resolution%20in%20(Done%2C%20Fixed%2C%20Implemented))%20ORDER%20BY%20priority].
>  We only include umbrella JIRAs and high-level tasks.
> Major efforts in this release:
> * Feature parity for the DataFrames-based API (`spark.ml`), relative to the 
> RDD-based API
> * ML persistence
> * Python API feature parity and test coverage
> * R API expansion and improvements
> * Note about new features: As usual, we expect to expand the feature set of 
> MLlib.  However, we will prioritize API parity, bug fixes, and improvements 
> over new features.
> Note `spark.mllib` is in maintenance mode now.  We will accept bug fixes for 
> it, but new features, APIs, and improvements will only be added to `spark.ml`.
> h2. Critical feature parity in DataFrame-based API
> * Umbrella JIRA: [SPARK-4591]
> h2. Persistence
> * Complete persistence within MLlib
> ** Python tuning (SPARK-13786)
> * MLlib in R format: compatibility with other languages (SPARK-15572)
> * Impose backwards compatibility for persistence 

[jira] [Comment Edited] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0

2016-04-29 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264370#comment-15264370
 ] 

Timothy Hunter edited comment on SPARK-14816 at 4/29/16 5:21 PM:
-

Also, add a comment about the {{spark.lapply}} API


was (Author: timhunter):
Also, add a comment about the {{doparallel}} API

> Update MLlib, GraphX, SparkR websites for 2.0
> -
>
> Key: SPARK-14816
> URL: https://issues.apache.org/jira/browse/SPARK-14816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> Update the sub-projects' websites to include new features in this release.
> For MLlib, make it clear that the DataFrame-based API is the primary one now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14816) Update MLlib, GraphX, SparkR websites for 2.0

2016-04-29 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14816?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15264370#comment-15264370
 ] 

Timothy Hunter commented on SPARK-14816:


Also, add a comment about the {{doparallel}} API

> Update MLlib, GraphX, SparkR websites for 2.0
> -
>
> Key: SPARK-14816
> URL: https://issues.apache.org/jira/browse/SPARK-14816
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, GraphX, ML, MLlib, SparkR
>Reporter: Joseph K. Bradley
>
> Update the sub-projects' websites to include new features in this release.
> For MLlib, make it clear that the DataFrame-based API is the primary one now.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14571) Log instrumentation in ALS

2016-04-19 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15249058#comment-15249058
 ] 

Timothy Hunter commented on SPARK-14571:


Yes, please feel free to take this task. Thanks!




> Log instrumentation in ALS
> --
>
> Key: SPARK-14571
> URL: https://issues.apache.org/jira/browse/SPARK-14571
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-7264) SparkR API for parallel functions

2016-04-15 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-7264?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243584#comment-15243584
 ] 

Timothy Hunter commented on SPARK-7264:
---

I will have a PR for this soon.

> SparkR API for parallel functions
> -
>
> Key: SPARK-7264
> URL: https://issues.apache.org/jira/browse/SPARK-7264
> Project: Spark
>  Issue Type: New Feature
>  Components: SparkR
>Reporter: Shivaram Venkataraman
>
> This is a JIRA to discuss design proposals for enabling parallel R 
> computation in SparkR without exposing the entire RDD API. 
> The rationale for this is that the RDD API has a number of low level 
> functions and we would like to expose a more light-weight API that is both 
> friendly to R users and easy to maintain.
> http://goo.gl/GLHKZI has a first cut design doc.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14569) Log instrumentation in KMeans

2016-04-15 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15243238#comment-15243238
 ] 

Timothy Hunter commented on SPARK-14569:


[~iamshrek] thanks for taking a look!

> Log instrumentation in KMeans
> -
>
> Key: SPARK-14569
> URL: https://issues.apache.org/jira/browse/SPARK-14569
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14571) Log instrumentation in ALS

2016-04-13 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14571?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239794#comment-15239794
 ] 

Timothy Hunter commented on SPARK-14571:


SPARK-14568 has been merged, so it should easy to follow the same metrics that 
have been added to LogisticRegression. 

[~yuu.ishik...@gmail.com], [~yinxusen], are you interested?


> Log instrumentation in ALS
> --
>
> Key: SPARK-14571
> URL: https://issues.apache.org/jira/browse/SPARK-14571
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14570) Log instrumentation in Random forests

2016-04-13 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239775#comment-15239775
 ] 

Timothy Hunter commented on SPARK-14570:


SPARK-14568 has been merged, so it should easy to follow the same pattern as in 
LogisticRegression. In fact, most of the metrics have already been added in 
{{RandomForest.scala}} by [~josephkb] . It is just a matter of surfacing them 
better.

[~yuu.ishik...@gmail.com], [~yinxusen], are you interested?


> Log instrumentation in Random forests
> -
>
> Key: SPARK-14570
> URL: https://issues.apache.org/jira/browse/SPARK-14570
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-14569) Log instrumentation in KMeans

2016-04-13 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-14569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15239751#comment-15239751
 ] 

Timothy Hunter commented on SPARK-14569:


SPARK-14568 has been merged, so it should easy to follow the same metrics that 
have been added to LogisticRegression. 

[~yuu.ishik...@gmail.com], [~yinxusen], are you interested?


> Log instrumentation in KMeans
> -
>
> Key: SPARK-14569
> URL: https://issues.apache.org/jira/browse/SPARK-14569
> Project: Spark
>  Issue Type: Sub-task
>  Components: MLlib
>Reporter: Timothy Hunter
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14567) Add instrumentation logs to MLlib training algorithms

2016-04-12 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-14567:
---
Description: 
In order to debug performance issues when training mllib algorithms,
it is useful to log some metrics about the training dataset, the training 
parameters, etc.

This ticket is an umbrella to add some simple logging messages to the most 
common MLlib estimators. There should be no performance impact on the current 
implementation, and the output is simply printed in the logs.

Here are some values that are of interest when debugging training tasks:
* number of features
* number of instances
* number of partitions
* number of classes
* input RDD/DF cache level
* hyper-parameters

I suggest to start with the most common al

  was:
In order to debug performance issues when training mllib algorithms,
it is useful to log some metrics about the training dataset, the training 
parameters, etc.

This ticket is an umbrella to add some simple logging messages to the most 
common MLlib estimators. There should be no performance impact on the current 
implementation, and the output is simply printed in the logs.

Here are some values that are of interest when debugging training tasks:
* number of features
* number of instances
* number of partitions
* number of classes
* input RDD/DF cache level
* hyper-parameters


> Add instrumentation logs to MLlib training algorithms
> -
>
> Key: SPARK-14567
> URL: https://issues.apache.org/jira/browse/SPARK-14567
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Timothy Hunter
>
> In order to debug performance issues when training mllib algorithms,
> it is useful to log some metrics about the training dataset, the training 
> parameters, etc.
> This ticket is an umbrella to add some simple logging messages to the most 
> common MLlib estimators. There should be no performance impact on the current 
> implementation, and the output is simply printed in the logs.
> Here are some values that are of interest when debugging training tasks:
> * number of features
> * number of instances
> * number of partitions
> * number of classes
> * input RDD/DF cache level
> * hyper-parameters
> I suggest to start with the most common al



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-14567) Add instrumentation logs to MLlib training algorithms

2016-04-12 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-14567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-14567:
---
Description: 
In order to debug performance issues when training mllib algorithms,
it is useful to log some metrics about the training dataset, the training 
parameters, etc.

This ticket is an umbrella to add some simple logging messages to the most 
common MLlib estimators. There should be no performance impact on the current 
implementation, and the output is simply printed in the logs.

Here are some values that are of interest when debugging training tasks:
* number of features
* number of instances
* number of partitions
* number of classes
* input RDD/DF cache level
* hyper-parameters

  was:
In order to debug performance issues when training mllib algorithms,
it is useful to log some metrics about the training dataset, the training 
parameters, etc.

This ticket is an umbrella to add some simple logging messages to the most 
common MLlib estimators. There should be no performance impact on the current 
implementation, and the output is simply printed in the logs.

Here are some values that are of interest when debugging training tasks:
* number of features
* number of instances
* number of partitions
* number of classes
* input RDD/DF cache level
* hyper-parameters

I suggest to start with the most common al


> Add instrumentation logs to MLlib training algorithms
> -
>
> Key: SPARK-14567
> URL: https://issues.apache.org/jira/browse/SPARK-14567
> Project: Spark
>  Issue Type: Umbrella
>  Components: MLlib
>Reporter: Timothy Hunter
>
> In order to debug performance issues when training mllib algorithms,
> it is useful to log some metrics about the training dataset, the training 
> parameters, etc.
> This ticket is an umbrella to add some simple logging messages to the most 
> common MLlib estimators. There should be no performance impact on the current 
> implementation, and the output is simply printed in the logs.
> Here are some values that are of interest when debugging training tasks:
> * number of features
> * number of instances
> * number of partitions
> * number of classes
> * input RDD/DF cache level
> * hyper-parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14571) Log instrumentation in ALS

2016-04-12 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-14571:
--

 Summary: Log instrumentation in ALS
 Key: SPARK-14571
 URL: https://issues.apache.org/jira/browse/SPARK-14571
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Timothy Hunter






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14570) Log instrumentation in Random forests

2016-04-12 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-14570:
--

 Summary: Log instrumentation in Random forests
 Key: SPARK-14570
 URL: https://issues.apache.org/jira/browse/SPARK-14570
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Timothy Hunter






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14569) Log instrumentation in KMeans

2016-04-12 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-14569:
--

 Summary: Log instrumentation in KMeans
 Key: SPARK-14569
 URL: https://issues.apache.org/jira/browse/SPARK-14569
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Timothy Hunter






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14568) Log instrumentation in logistic regression as a first task

2016-04-12 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-14568:
--

 Summary: Log instrumentation in logistic regression as a first task
 Key: SPARK-14568
 URL: https://issues.apache.org/jira/browse/SPARK-14568
 Project: Spark
  Issue Type: Sub-task
  Components: MLlib
Reporter: Timothy Hunter






--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14567) Add instrumentation logs to MLlib training algorithms

2016-04-12 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-14567:
--

 Summary: Add instrumentation logs to MLlib training algorithms
 Key: SPARK-14567
 URL: https://issues.apache.org/jira/browse/SPARK-14567
 Project: Spark
  Issue Type: Umbrella
  Components: MLlib
Reporter: Timothy Hunter


In order to debug performance issues when training mllib algorithms,
it is useful to log some metrics about the training dataset, the training 
parameters, etc.

This ticket is an umbrella to add some simple logging messages to the most 
common MLlib estimators. There should be no performance impact on the current 
implementation, and the output is simply printed in the logs.

Here are some values that are of interest when debugging training tasks:
* number of features
* number of instances
* number of partitions
* number of classes
* input RDD/DF cache level
* hyper-parameters



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-14100) Merge StringIndexer and StringIndexerModel

2016-03-23 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-14100:
--

 Summary: Merge StringIndexer and StringIndexerModel
 Key: SPARK-14100
 URL: https://issues.apache.org/jira/browse/SPARK-14100
 Project: Spark
  Issue Type: Sub-task
  Components: ML, MLlib
Reporter: Timothy Hunter


This is an initial task to convert a simple estimator (StringIndexer) to the 
proposed API that merges models and estimators together.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-13986) Make `DeveloperApi`-annotated things public

2016-03-19 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-13986?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15200525#comment-15200525
 ] 

Timothy Hunter commented on SPARK-13986:


[~dongjoon] how did you find the conflicting annotation? It would be great to 
automate this as part of the style checks

> Make `DeveloperApi`-annotated things public
> ---
>
> Key: SPARK-13986
> URL: https://issues.apache.org/jira/browse/SPARK-13986
> Project: Spark
>  Issue Type: Improvement
>  Components: MLlib, Spark Core
>Reporter: Dongjoon Hyun
>Priority: Minor
>
> Spark uses `@DeveloperApi` annotation, but sometimes it seems to conflict 
> with its visibility. This issue proposes to fix those conflict. The following 
> is the example.
> {code:title=JobResult.scala|borderStyle=solid}
> @DeveloperApi
> sealed trait JobResult
> @DeveloperApi
> case object JobSucceeded extends JobResult
> @DeveloperApi
> -private[spark] case class JobFailed(exception: Exception) extends JobResult
> +case class JobFailed(exception: Exception) extends JobResult
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

2016-03-10 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15190093#comment-15190093
 ] 

Timothy Hunter commented on SPARK-10931:


Using python decorators, it is fairly easy to autogenerate at runtime all the 
param wrappers, getters and setters and extract the documentation from the 
scala side so that the documentation of the parameter is included in the 
docstring of the getters and setters.

There are two issues with that:
 - do we need to specialize the documentation or some of the conversions 
between java and python? In both cases, it is possible to "subclass" and make 
sure that the methods do not get overwritten by some autogenerated stubs
 - the documentation of a class (which relies on the bytecode, not on runtime 
instances) would miss all the params, because they are only generated in 
runtime objects. I believe there are some ways around it, such as inserting 
such methods at import time, but that would require more investigation.

> PySpark ML Models should contain Param values
> -
>
> Key: SPARK-10931
> URL: https://issues.apache.org/jira/browse/SPARK-10931
> Project: Spark
>  Issue Type: Improvement
>  Components: ML, PySpark
>Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not 
> even contain Param values.  This JIRA is for copying the Param values from 
> the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, 
> but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12566) GLM model family, link function support in SparkR:::glm

2016-03-09 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12566?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15188057#comment-15188057
 ] 

Timothy Hunter commented on SPARK-12566:


[~yuhaoyan] I took a look at the current code, and it looks like the 
implementation of GLM in SparkRWrappers, and it looks like we only check the 
solver in the case of the gaussian family.

[~mengxr] if users use the 'auto' solver, it means we can swap the 
implementation underneath, right?

If this is the case, here is what I suggest, in pseudo-scala-code:
{code}
(family, solver) match {
  (gaussian, auto) => IRLS // This is a behavioral change
  (gaussian, normal | l-bfgs) => LinearRegression
  (binomial, auto) => IRLS // This is a behavioral change
  (binomial, binomial) => LogisticRegression // This is a new option to 
preserve logisticregression if there is a need for that
  (_, _) => IRLS
}
{code}

> GLM model family, link function support in SparkR:::glm
> ---
>
> Key: SPARK-12566
> URL: https://issues.apache.org/jira/browse/SPARK-12566
> Project: Spark
>  Issue Type: New Feature
>  Components: ML, SparkR
>Reporter: Joseph K. Bradley
>Assignee: yuhao yang
>Priority: Critical
>
> This JIRA is for extending the support of MLlib's Generalized Linear Models 
> (GLMs) to more model families and link functions in SparkR. After 
> SPARK-12811, we should be able to wrap GeneralizedLinearRegression in SparkR 
> with support of popular families and link functions.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-11569) StringIndexer transform fails when column contains nulls

2016-03-08 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-11569?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15185444#comment-15185444
 ] 

Timothy Hunter commented on SPARK-11569:


Also, I suggest to look at Pandas' indexers, which have the same issue to deal 
with.

> StringIndexer transform fails when column contains nulls
> 
>
> Key: SPARK-11569
> URL: https://issues.apache.org/jira/browse/SPARK-11569
> Project: Spark
>  Issue Type: Bug
>  Components: ML, PySpark
>Affects Versions: 1.4.0, 1.5.0, 1.6.0
>Reporter: Maciej Szymkiewicz
>
> Transforming column containing {{null}} values using {{StringIndexer}} 
> results in {{java.lang.NullPointerException}}
> {code}
> from pyspark.ml.feature import StringIndexer
> df = sqlContext.createDataFrame([("a", 1), (None, 2)], ("k", "v"))
> df.printSchema()
> ## root
> ##  |-- k: string (nullable = true)
> ##  |-- v: long (nullable = true)
> indexer = StringIndexer(inputCol="k", outputCol="kIdx")
> indexer.fit(df).transform(df)
> ##  py4j.protocol.Py4JJavaError: An error occurred while calling o75.json.
> ## : java.lang.NullPointerException
> {code}
> Problem disappears when we drop 
> {code}
> df1 = df.na.drop()
> indexer.fit(df1).transform(df1)
> {code}
> or replace {{nulls}}
> {code}
> from pyspark.sql.functions import col, when
> k = col("k")
> df2 = df.withColumn("k", when(k.isNull(), "__NA__").otherwise(k))
> indexer.fit(df2).transform(df2)
> {code}
> and cannot be reproduced using Scala API
> {code}
> import org.apache.spark.ml.feature.StringIndexer
> val df = sc.parallelize(Seq(("a", 1), (null, 2))).toDF("k", "v")
> df.printSchema
> // root
> //  |-- k: string (nullable = true)
> //  |-- v: integer (nullable = false)
> val indexer = new StringIndexer().setInputCol("k").setOutputCol("kIdx")
> indexer.fit(df).transform(df).count
> // 2
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-23 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15070500#comment-15070500
 ] 

Timothy Hunter commented on SPARK-12247:


Sorry for the delay. That sounds great! Let me know when you get a PR out.

On Wed, Dec 23, 2015 at 12:14 AM, Benjamin Fradet (JIRA)


> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-22 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15068908#comment-15068908
 ] 

Timothy Hunter commented on SPARK-12247:


It seems to me that the calculation of false positives is more relevant for the 
movie ratings, and that the RMSE right above in the example is already a good 
example to but. What do you think?

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066857#comment-15066857
 ] 

Timothy Hunter commented on SPARK-12247:


Thanks for working on it, [~BenFradet]!

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-21 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15066854#comment-15066854
 ] 

Timothy Hunter commented on SPARK-12247:


If we could import all the code that builds the ratings dataframe {{val ratings 
= sc.textFile(params.ratings).map(Rating.parseRating).cache()}}, that would be 
ideal.

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12324:
--

 Summary: The documentation sidebar does not collapse properly
 Key: SPARK-12324
 URL: https://issues.apache.org/jira/browse/SPARK-12324
 Project: Spark
  Issue Type: Bug
  Components: Documentation, MLlib
Affects Versions: 1.5.2
Reporter: Timothy Hunter


When the browser's window is reduced horizontally, the sidebar slides under the 
main content and does not collapse:
 - hide the sidebar when the browser's width is not large enough
 - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-12324:
---
Attachment: Screen Shot 2015-12-14 at 12.29.57 PM.png

> The documentation sidebar does not collapse properly
> 
>
> Key: SPARK-12324
> URL: https://issues.apache.org/jira/browse/SPARK-12324
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
> Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png
>
>
> When the browser's window is reduced horizontally, the sidebar slides under 
> the main content and does not collapse:
>  - hide the sidebar when the browser's width is not large enough
>  - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12324) The documentation sidebar does not collapse properly

2015-12-14 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15056677#comment-15056677
 ] 

Timothy Hunter commented on SPARK-12324:


I am creating a PR with a fix.

cc [~josephkb]

> The documentation sidebar does not collapse properly
> 
>
> Key: SPARK-12324
> URL: https://issues.apache.org/jira/browse/SPARK-12324
> Project: Spark
>  Issue Type: Bug
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
> Attachments: Screen Shot 2015-12-14 at 12.29.57 PM.png
>
>
> When the browser's window is reduced horizontally, the sidebar slides under 
> the main content and does not collapse:
>  - hide the sidebar when the browser's width is not large enough
>  - add a button to show and hide the sidebar



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-09 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049521#comment-15049521
 ] 

Timothy Hunter commented on SPARK-12247:


[~srowen] would you be interested in this task?

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-09 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12247:
--

 Summary: Documentation for spark.ml's ALS and collaborative 
filtering in general
 Key: SPARK-12247
 URL: https://issues.apache.org/jira/browse/SPARK-12247
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.5.2
Reporter: Timothy Hunter


We need to add a section in the documentation about collaborative filtering in 
the dataframe API:
 - copy explanations about collaborative filtering and ALS from spark.mllib
 - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12246) Add documentation for spark.ml.clustering.kmeans

2015-12-09 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12246:
--

 Summary: Add documentation for spark.ml.clustering.kmeans
 Key: SPARK-12246
 URL: https://issues.apache.org/jira/browse/SPARK-12246
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.5.2
Reporter: Timothy Hunter


We should add some documentation for the KMeans implementation in spark.ml.
 - small description about the concept (maybe copy and adapt the documentation 
from spark.mllib)
 - add an example for java, scala, python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12212) Clarify the distinction between spark.mllib and spark.ml

2015-12-09 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-12212:
---
Target Version/s:   (was: 1.6.0)

> Clarify the distinction between spark.mllib and spark.ml
> 
>
> Key: SPARK-12212
> URL: https://issues.apache.org/jira/browse/SPARK-12212
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>Assignee: Timothy Hunter
>
> There is a confusion in the documentation of MLLib as to what exactly MLlib: 
> is it the package, or is it the whole effort of ML on spark, and how it 
> differs from spark.ml? Is MLLib going to be deprecated?
> We should do the following:
>  - refer to the mllib the code package as spark.mllib across all the 
> documentation. Alternative name is "RDD API of MLlib".
>  - refer to MLlib the project that encompasses spark.ml + spark.mllib as 
> MLlib (it should be the default)
>  - replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of 
> MLlib". I would deemphasize that this API is for building pipelines. Some 
> users are lead to believe from the documentation that spark.ml can only be 
> used for building pipelines and that using a single algorithm can only be 
> done with spark.mllib.
> Most relevant places:
>  - {{mllib-guide.md}}
>  - {{mllib-linear-methods.md}}
>  - {{mllib-dimensionality-reduction.md}}
>  - {{mllib-pmml-model-export.md}}
>  - {{mllib-statistics.md}}
> In these files, most references to {{MLlib}} are meant to refer to 
> {{spark.mllib}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12210) Small example that shows how to integrate spark.mllib with spark.ml

2015-12-09 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12210?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-12210:
---
Target Version/s:   (was: 1.6.0)

> Small example that shows how to integrate spark.mllib with spark.ml
> ---
>
> Key: SPARK-12210
> URL: https://issues.apache.org/jira/browse/SPARK-12210
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> Since we are missing a number of algorithms in {{spark.ml}} such as 
> clustering or LDA, we should have a small example that shows the recommended 
> way to go back and forth between {{spark.ml}} and {{spark.mllib}}. It is 
> mostly putting together existing pieces, but I feel it is important for new 
> users to see how the interaction plays out in practice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-12246) Add documentation for spark.ml.clustering.kmeans

2015-12-09 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter closed SPARK-12246.
--
Resolution: Duplicate

> Add documentation for spark.ml.clustering.kmeans
> 
>
> Key: SPARK-12246
> URL: https://issues.apache.org/jira/browse/SPARK-12246
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We should add some documentation for the KMeans implementation in spark.ml.
>  - small description about the concept (maybe copy and adapt the 
> documentation from spark.mllib)
>  - add an example for java, scala, python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12246) Add documentation for spark.ml.clustering.kmeans

2015-12-09 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15049779#comment-15049779
 ] 

Timothy Hunter commented on SPARK-12246:


It does, thanks [~yuhaoyan]

> Add documentation for spark.ml.clustering.kmeans
> 
>
> Key: SPARK-12246
> URL: https://issues.apache.org/jira/browse/SPARK-12246
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We should add some documentation for the KMeans implementation in spark.ml.
>  - small description about the concept (maybe copy and adapt the 
> documentation from spark.mllib)
>  - add an example for java, scala, python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12247) Documentation for spark.ml's ALS and collaborative filtering in general

2015-12-09 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-12247:
---
Target Version/s:   (was: 1.6.0)

> Documentation for spark.ml's ALS and collaborative filtering in general
> ---
>
> Key: SPARK-12247
> URL: https://issues.apache.org/jira/browse/SPARK-12247
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We need to add a section in the documentation about collaborative filtering 
> in the dataframe API:
>  - copy explanations about collaborative filtering and ALS from spark.mllib
>  - provide an example with spark.ml's ALS



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-12246) Add documentation for spark.ml.clustering.kmeans

2015-12-09 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-12246?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-12246:
---
Target Version/s:   (was: 1.6.0)

> Add documentation for spark.ml.clustering.kmeans
> 
>
> Key: SPARK-12246
> URL: https://issues.apache.org/jira/browse/SPARK-12246
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, MLlib
>Affects Versions: 1.5.2
>Reporter: Timothy Hunter
>
> We should add some documentation for the KMeans implementation in spark.ml.
>  - small description about the concept (maybe copy and adapt the 
> documentation from spark.mllib)
>  - add an example for java, scala, python



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Updated] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-12-09 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter updated SPARK-8517:
--
Target Version/s:   (was: 1.6.0)

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12208) Abstract the examples into a common place

2015-12-08 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12208:
--

 Summary: Abstract the examples into a common place
 Key: SPARK-12208
 URL: https://issues.apache.org/jira/browse/SPARK-12208
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.5.2
Reporter: Timothy Hunter


When we write examples in the code, we put the generation of the data along 
with the example itself. We typically have either:

{code}
val data = 
sqlContext.read.format("libsvm").load("data/mllib/sample_libsvm_data.txt")
...
{code}

or some more esoteric stuff such as:
{code}
val data = Array(
  (0, 0.1),
  (1, 0.8),
  (2, 0.2)
)
val dataFrame: DataFrame = sqlContext.createDataFrame(data).toDF("label", 
"feature")
{code}

{code}
val data = Array(
  Vectors.sparse(5, Seq((1, 1.0), (3, 7.0))),
  Vectors.dense(2.0, 0.0, 3.0, 4.0, 5.0),
  Vectors.dense(4.0, 0.0, 0.0, 6.0, 7.0)
)
val df = sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
{code}

I suggest we follow the example of sklearn and standardize all the generation 
of example data inside a few methods, for example in 
{{org.apache.spark.ml.examples.ExampleData}}. One reason is that just reading 
the code is sometimes not enough to figure out what the data is supposed to be. 
For example when using {{libsvm_data}}, it is unclear what the dataframe 
columns are. This is something we should comment somewhere.
Also, it would help explaining in one place all the scala idiosyncracies such 
as using {{Tuple1.apply}} and such.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12210) Small example that shows how to integrate spark.mllib with spark.ml

2015-12-08 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12210:
--

 Summary: Small example that shows how to integrate spark.mllib 
with spark.ml
 Key: SPARK-12210
 URL: https://issues.apache.org/jira/browse/SPARK-12210
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation, MLlib
Affects Versions: 1.5.2
Reporter: Timothy Hunter


Since we are missing a number of algorithms in {{spark.ml}} such as clustering 
or LDA, we should have a small example that shows the recommended way to go 
back and forth between {{spark.ml}} and {{spark.mllib}}. It is mostly putting 
together existing pieces, but I feel it is important for new users to see how 
the interaction plays out in practice.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-12212) Clarify the distinction between spark.mllib and spark.ml

2015-12-08 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-12212:
--

 Summary: Clarify the distinction between spark.mllib and spark.ml
 Key: SPARK-12212
 URL: https://issues.apache.org/jira/browse/SPARK-12212
 Project: Spark
  Issue Type: Sub-task
  Components: Documentation
Affects Versions: 1.5.2
Reporter: Timothy Hunter


There is a confusion in the documentation of MLLib as to what exactly MLlib: is 
it the package, or is it the whole effort of ML on spark, and how it differs 
from spark.ml? Is MLLib going to be deprecated?

We should do the following:
 - refer to the mllib the code package as spark.mllib across all the 
documentation. Alternative name is "RDD API of MLlib".
 - refer to MLlib the project that encompasses spark.ml + spark.mllib as MLlib 
(it should be the default)
 - replaces reference to "Pipeline API" by spark.ml or the "Dataframe API of 
MLlib". I would deemphasize that this API is for building pipelines. Some users 
are lead to believe from the documentation that spark.ml can only be used for 
building pipelines and that using a single algorithm can only be done with 
spark.mllib.

Most relevant places:
 - {{mllib-guide.md}}
 - {{mllib-linear-methods.md}}
 - {{mllib-dimensionality-reduction.md}}
 - {{mllib-pmml-model-export.md}}
 - {{mllib-statistics.md}}
In these files, most references to {{MLlib}} are meant to refer to 
{{spark.mllib}} instead.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Closed] (SPARK-11601) ML 1.6 QA: API: Binary incompatible changes

2015-12-01 Thread Timothy Hunter (JIRA)

 [ 
https://issues.apache.org/jira/browse/SPARK-11601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Timothy Hunter closed SPARK-11601.
--
Resolution: Done

> ML 1.6 QA: API: Binary incompatible changes
> ---
>
> Key: SPARK-11601
> URL: https://issues.apache.org/jira/browse/SPARK-11601
> Project: Spark
>  Issue Type: Sub-task
>  Components: Documentation, ML, MLlib
>Reporter: Joseph K. Bradley
>Assignee: Timothy Hunter
>
> Generate a list of binary incompatible changes using MiMa and create new 
> JIRAs for issues found. Filter out false positives as needed.
> If you want to take this task, ping [~mengxr] for advice since he did it for 
> 1.5.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-12000) `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation

2015-11-30 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-12000?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15032058#comment-15032058
 ] 

Timothy Hunter commented on SPARK-12000:


Yes, I have this branch with some fixes, but I would need double review from 
someone more familiar with SBT to make sure it does not break something else:
https://github.com/apache/spark/compare/master...thunterdb:1511-java8?expand=1

> `sbt publishLocal` hits a Scala compiler bug caused by `Since` annotation
> -
>
> Key: SPARK-12000
> URL: https://issues.apache.org/jira/browse/SPARK-12000
> Project: Spark
>  Issue Type: Bug
>  Components: Build, Documentation, MLlib
>Affects Versions: 1.6.0
>Reporter: Xiangrui Meng
>Assignee: Xiangrui Meng
>
> Reported by [~josephkb]. Not sure what is the root cause, but this is the 
> error message when I ran "sbt publishLocal":
> {code}
> [error] (launcher/compile:doc) javadoc returned nonzero exit code
> [error] (mllib/compile:doc) scala.reflect.internal.FatalError:
> [error]  while compiling: 
> /Users/meng/src/spark/mllib/src/main/scala/org/apache/spark/mllib/util/modelSaveLoad.scala
> [error] during phase: global=terminal, atPhase=parser
> [error]  library version: version 2.10.5
> [error] compiler version: version 2.10.5
> [error]   reconstructed args: -Yno-self-type-checks -groups -classpath 
> 

[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-11-25 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027375#comment-15027375
 ] 

Timothy Hunter commented on SPARK-8517:
---

Here is a few comments I have at a high level:
 - branding confusion about spark.mllib vs spark.ml vs the union of the two. It 
is a bit hard right now when you navigate to the first page to see the 
difference
 - the focus of spark.ml is on pipelines. It should be on dataframes. It makes 
it clear to separate it from spark.mllib which is on RDDs
 - make pipelines a sub-concept of the spark.ml (instead of saying that 
spark.ml is pipeline). Say that you can build pipelines with spark.ml
 - make sure that all algorithms in spark.ml have the same level of usability 
as in mllib. You should not be force to make a pipeline to use a single 
algorithm
 - Reorganize the spark.ml menu about the goal and not about the content. Users 
want to solve issues (clustering, regression, classification), we organize by 
theoretical concepts (decision trees, ensembles, linear methods). We should do 
as mllib and sk-learn:
{code}
- MLlib: machine learning on RDDs
...
- SparkML: machine learning with (Spark) Dataframes
  - General concepts and overview
  - Building and transforming features
  - Classification and Regression
  - Clustering
  - Collaborative filtering
  - Chaining transforms with pipelines
  - Advanced: Evaluation, import/export, developer APIs
  - Examples
{code}
Some pieces are missing with this such as Dimensionality reduction. Also, the 
scikit-learn guide has a more academic focus by splitting roughly at supervised 
vs unsupervised.
I am going to drill down more into the sections for some suggestions.

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-11-25 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027491#comment-15027491
 ] 

Timothy Hunter commented on SPARK-8517:
---

- We need to make a whole page about how best practices with dataframes 
containing numerical data (vector UDTs). That was a big pain point for me. We 
have a whole page on spark.mllib and we should have something similar for 
dataframes.
- in `ml-guide`, I would split the high-level concepts (`fit`, `transform`, 
etc.) from chaining them together with a pipeline. From reading the current 
document, sparkML seems harder to use than spark.mllib because it introduces 
complicated examples right at the start (model selection with 
cross-validation). 
- small nit: the links under each example should link to the github file, right 
now they are not super useful. Do we have a ticket for that?


Building examples:
The current way to build a dead-simple dataframe is as follows. It is rather 
noisy when you compare it to python. I would recommend we move all the example 
code generation to a library, and thoroughly explain there what the dataframe 
contain (or make it part of the graph). For example:
{code}
val data = Array(-0.5, -0.3, 0.0, 0.2)
val dataFrame = 
sqlContext.createDataFrame(data.map(Tuple1.apply)).toDF("features")
{code}
This requires some understanding about tuple packing, the synthetic apply 
method, etc. Definitely more complicated than the python or RDD equivalent. I 
do not have a good solution right now, but I find this a bit unsettling when 
this is the first line I read in an example.

Other examples are easier to read, I find:
{code}
val training = sqlContext.createDataFrame(Seq((1.0, Vectors.dense(0.0, 1.2, 
-0.5.toDF("label", "features")
{code}

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-8517) Improve the organization and style of MLlib's user guide

2015-11-25 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-8517?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15027512#comment-15027512
 ] 

Timothy Hunter commented on SPARK-8517:
---

 - A couple of pages such as {{ml-ensembles}} and {{ml-linear-methods}} refer 
to MLlib "for details". It is unclear what the differences are. I suggest we 
either copy or clearly mention that we mean to refer to the mathematical 
formulation only, that the parameters may have the same name (num iterations, 
regularization parameter) but that the API is different. By the way, I do not 
see SVM yet on the spark.ml documentation.
 - The perceptron classifier does not need to be a top-level section of 
spark.ml. I would move it as a subsection of classification

> Improve the organization and style of MLlib's user guide
> 
>
> Key: SPARK-8517
> URL: https://issues.apache.org/jira/browse/SPARK-8517
> Project: Spark
>  Issue Type: Improvement
>  Components: Documentation, ML, MLlib
>Reporter: Xiangrui Meng
>Assignee: Timothy Hunter
>
> The current MLlib's user guide (and spark.ml's), especially the main page, 
> doesn't have a nice style. We could update it and re-organize the content to 
> make it easier to navigate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Created] (SPARK-2762) SparkILoop leaks memory in multi-repl configurations

2014-07-30 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-2762:
-

 Summary: SparkILoop leaks memory in multi-repl configurations
 Key: SPARK-2762
 URL: https://issues.apache.org/jira/browse/SPARK-2762
 Project: Spark
  Issue Type: Bug
Reporter: Timothy Hunter
Priority: Minor


When subclassing SparkILoop and instantiating multiple objects, the SparkILoop 
instances do not get garbage collected.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (SPARK-2452) Multi-statement input to spark repl does not work

2014-07-22 Thread Timothy Hunter (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-2452?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14070841#comment-14070841
 ] 

Timothy Hunter commented on SPARK-2452:
---

Excellent, thanks Patrick.



 Multi-statement input to spark repl does not work
 -

 Key: SPARK-2452
 URL: https://issues.apache.org/jira/browse/SPARK-2452
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.1
Reporter: Timothy Hunter
Assignee: Prashant Sharma
Priority: Blocker
 Fix For: 1.1.0


 Here is an example:
 {code}
 scala val x = 4 ; def f() = x
 x: Int = 4
 f: ()Int
 scala f()
 console:11: error: $VAL5 is already defined as value $VAL5
 val $VAL5 = INSTANCE;
 {code}



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (SPARK-2452) Multi-statement input to spark repl does not work

2014-07-11 Thread Timothy Hunter (JIRA)
Timothy Hunter created SPARK-2452:
-

 Summary: Multi-statement input to spark repl does not work
 Key: SPARK-2452
 URL: https://issues.apache.org/jira/browse/SPARK-2452
 Project: Spark
  Issue Type: Bug
  Components: Spark Core
Affects Versions: 1.0.0
Reporter: Timothy Hunter


Here is an example:

scala val x = 4 ; def f() = x
x: Int = 4
f: ()Int

scala f()
console:11: error: $VAL5 is already defined as value $VAL5
val $VAL5 = INSTANCE;




--
This message was sent by Atlassian JIRA
(v6.2#6252)