[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:39 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users. Is there any obstacle to implement like this?
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compres

[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:38 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, the following API would be more friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, an API like following would be very friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PN

[jira] [Comment Edited] (SPARK-21866) SPIP: Image support in Spark

2017-09-05 Thread Yanbo Liang (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16153647#comment-16153647
 ] 

Yanbo Liang edited comment on SPARK-21866 at 9/5/17 1:37 PM:
-

I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, could we follow other Spark SQL data source to 
the greatest extent? Even we don't use UDT, a familiar API would make more 
users to adopt it. For example, an API like following would be very friendly to 
Spark users.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.


was (Author: yanboliang):
I would support this effort generally. For Spark, to provide a general image 
storage format and data source is good to have. This can let users try 
different deep neural network models convenience. AFAIK, lots of users would be 
interested in applying existing deep neural models to their own dataset, that 
is to say, model inference, which can be distributed running by Spark. Thanks 
for this proposal.
[~timhunter] I have two questions regarding this SPIP:
1, As you describe above: {{org.apache.spark.image}} is the package structure, 
under the MLlib project.
If this package would only contain the common image storage format and data 
source support, should we organize the package structure as 
{{org.apache.spark.ml.image}} or {{org.apache.spark.ml.source.image}}? We 
already have {{libsvm}} support under {{org.apache.spark.ml.source}}.
2, From the API's perspective, I'd support to follow other Spark SQL data 
source to the greatest extent. Even we don't use UDT, a familiar API would make 
more users to adopt it.
{code}
spark.read.image(path, recursive, numPartitions, dropImageFailures, sampleRatio)
spark.write.image(path, ...)
{code}
If I have misunderstand, please feel free to correct me. Thanks.

> SPIP: Image support in Spark
> 
>
> Key: SPARK-21866
> URL: https://issues.apache.org/jira/browse/SPARK-21866
> Project: Spark
>  Issue Type: Improvement
>  Components: ML
>Affects Versions: 2.2.0
>Reporter: Timothy Hunter
>  Labels: SPIP
> Attachments: SPIP - Image support for Apache Spark V1.1.pdf
>
>
> h2. Background and motivation
> As Apache Spark is being used more and more in the industry, some new use 
> cases are emerging for different data formats beyond the traditional SQL 
> types or the numerical types (vectors and matrices). Deep Learning 
> applications commonly deal with image processing. A number of projects add 
> some Deep Learning capabilities to Spark (see list below), but they struggle 
> to  communicate with each other or with MLlib pipelines because there is no 
> standard way to represent an image in Spark DataFrames. We propose to 
> federate efforts for representing images in Spark by defining a 
> representation that caters to the most common needs of users and library 
> developers.
> This SPIP proposes a specification to represent images in Spark DataFrames 
> and Datasets (based on existing industrial standards), and an interface for 
> loading sources of images. It is not meant to be a full-fledged image 
> processing library, but rather the core description that other libraries and 
> users can rely on. Several packages already offer various processing 
> facilities for transforming images or doing more complex operations, and each 
> has various design tradeoffs that make them better as standalone solutions.
> This project is a joint collaboration between Microsoft and Databricks, which 
> have been testing this design in two open source packages: MMLSpark and Deep 
> Learning Pipelines.
> The proposed image format is an in-memory, decompressed representation that 
> targets low-level applications. It is significantly more liberal in memory 
> usage than compressed image representations such as JPEG, PNG, etc., but it 
> allows easy communication with popular image p