Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-21 Thread Denny Lee
+1

On Thu, Sep 21, 2017 at 11:15 Sean Owen  wrote:

> Am I right that this doesn't mean other packages would use this
> representation, but that they could?
>
> The representation looked fine to me w.r.t. what DL frameworks need.
>
> My previous comment was that this is actually quite lightweight. It's kind
> of like how I/O support is provided for CSV and JSON, so makes enough sense
> to add to Spark. It doesn't really preclude other solutions.
>
> For those reasons I think it's fine. +1
>
> On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter 
> wrote:
>
>> Hello community,
>>
>> I would like to call for a vote on SPARK-21866. It is a short proposal
>> that has important applications for image processing and deep learning.
>> Joseph Bradley has offered to be the shepherd.
>>
>> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
>> PDF version: https://issues.apache.org/jira/secure/attachment/12884792/
>> SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
>>
>> Background and motivation
>>
>> As Apache Spark is being used more and more in the industry, some new use
>> cases are emerging for different data formats beyond the traditional SQL
>> types or the numerical types (vectors and matrices). Deep Learning
>> applications commonly deal with image processing. A number of projects add
>> some Deep Learning capabilities to Spark (see list below), but they
>> struggle to communicate with each other or with MLlib pipelines because
>> there is no standard way to represent an image in Spark DataFrames. We
>> propose to federate efforts for representing images in Spark by defining a
>> representation that caters to the most common needs of users and library
>> developers.
>>
>> This SPIP proposes a specification to represent images in Spark
>> DataFrames and Datasets (based on existing industrial standards), and an
>> interface for loading sources of images. It is not meant to be a
>> full-fledged image processing library, but rather the core description that
>> other libraries and users can rely on. Several packages already offer
>> various processing facilities for transforming images or doing more complex
>> operations, and each has various design tradeoffs that make them better as
>> standalone solutions.
>>
>> This project is a joint collaboration between Microsoft and Databricks,
>> which have been testing this design in two open source packages: MMLSpark
>> and Deep Learning Pipelines.
>>
>> The proposed image format is an in-memory, decompressed representation
>> that targets low-level applications. It is significantly more liberal in
>> memory usage than compressed image representations such as JPEG, PNG, etc.,
>> but it allows easy communication with popular image processing libraries
>> and has no decoding overhead.
>> Targets users and personas:
>>
>> Data scientists, data engineers, library developers.
>> The following libraries define primitives for loading and representing
>> images, and will gain from a common interchange format (in alphabetical
>> order):
>>
>>- BigDL
>>- DeepLearning4J
>>- Deep Learning Pipelines
>>- MMLSpark
>>- TensorFlow (Spark connector)
>>- TensorFlowOnSpark
>>- TensorFrames
>>- Thunder
>>
>> Goals:
>>
>>- Simple representation of images in Spark DataFrames, based on
>>pre-existing industrial standards (OpenCV)
>>- This format should eventually allow the development of
>>high-performance integration points with image processing libraries such 
>> as
>>libOpenCV, Google TensorFlow, CNTK, and other C libraries.
>>- The reader should be able to read popular formats of images from
>>distributed sources.
>>
>> Non-Goals:
>>
>> Images are a versatile medium and encompass a very wide range of formats
>> and representations. This SPIP explicitly aims at the most common use
>> case in the industry currently: multi-channel matrices of binary, int32,
>> int64, float or double data that can fit comfortably in the heap of the JVM:
>>
>>- the total size of an image should be restricted to less than 2GB
>>(roughly)
>>- the meaning of color channels is application-specific and is not
>>mandated by the standard (in line with the OpenCV standard)
>>- specialized formats used in meteorology, the medical field, etc.
>>are not supported
>>- this format is specialized to images and does not attempt to solve
>>the more general problem of representing n-dimensional tensors in Spark
>>
>> Proposed API changes
>>
>> We propose to add a new package in the package structure, under the MLlib
>> project:
>> org.apache.spark.image
>> Data format
>>
>> We propose to add the following structure:
>>
>> imageSchema = StructType([
>>
>>- StructField("mode", StringType(), False),
>>   - The exact representation of the data.
>>   - The values are described in the following OpenCV convention.
>>   Basically, the type has both "depth" and "number 

Re: [discuss] Data Source V2 write path

2017-09-21 Thread Ryan Blue
> input data requirement

Clustering and sorting within partitions are a good start. We can always
add more later when they are needed.

The primary use case I'm thinking of for this is partitioning and
bucketing. If I'm implementing a partitioned table format, I need to tell
Spark to cluster by my partition columns. Should there also be a way to
pass those columns separately, since they may not be stored in the same way
like partitions are in the current format?

On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan  wrote:

> Hi all,
>
> I want to have some discussion about Data Source V2 write path before
> starting a voting.
>
> The Data Source V1 write path asks implementations to write a DataFrame
> directly, which is painful:
> 1. Exposing upper-level API like DataFrame to Data Source API is not good
> for maintenance.
> 2. Data sources may need to preprocess the input data before writing, like
> cluster/sort the input by some columns. It's better to do the preprocessing
> in Spark instead of in the data source.
> 3. Data sources need to take care of transaction themselves, which is
> hard. And different data sources may come up with a very similar approach
> for the transaction, which leads to many duplicated codes.
>
>
> To solve these pain points, I'm proposing a data source writing framework
> which is very similar to the reading framework, i.e., WriteSupport ->
> DataSourceV2Writer -> WriteTask -> DataWriter. You can take a look at my
> prototype to see what it looks like: https://github.com/
> apache/spark/pull/19269
>
> There are some other details need further discussion:
> 1. *partitioning/bucketing*
> Currently only the built-in file-based data sources support them, but
> there is nothing stopping us from exposing them to all data sources. One
> question is, shall we make them as mix-in interfaces for data source v2
> reader/writer, or just encode them into data source options(a
> string-to-string map)? Ideally it's more like options, Spark just transfers
> these user-given informations to data sources, and doesn't do anything for
> it.
>
> 2. *input data requirement*
> Data sources should be able to ask Spark to preprocess the input data, and
> this can be a mix-in interface for DataSourceV2Writer. I think we need to
> add clustering request and sorting within partitions request, any more?
>
> 3. *transaction*
> I think we can just follow `FileCommitProtocol`, which is the internal
> framework Spark uses to guarantee transaction for built-in file-based data
> sources. Generally speaking, we need task level and job level commit/abort.
> Again you can see more details in my prototype about it:
> https://github.com/apache/spark/pull/19269
>
> 4. *data source table*
> This is the trickiest one. In Spark you can create a table which points to
> a data source, so you can read/write this data source easily by referencing
> the table name. Ideally data source table is just a pointer which points to
> a data source with a list of predefined options, to save users from typing
> these options again and again for each query.
> If that's all, then everything is good, we don't need to add more
> interfaces to Data Source V2. However, data source tables provide special
> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
> sources to have some extra ability.
> Currently these special operators only work for built-in file-based data
> sources, and I don't think we will extend it in the near future, I propose
> to mark them as out of the scope.
>
>
> Any comments are welcome!
> Thanks,
> Wenchen
>



-- 
Ryan Blue
Software Engineer
Netflix


Re: [discuss] Data Source V2 write path

2017-09-21 Thread Reynold Xin
Ah yes I agree. I was just saying it should be options (rather than
specific constructs). Having them at creation time makes a lot of sense.
Although one tricky thing is what if they need to change, but we can
probably just special case that.

On Thu, Sep 21, 2017 at 6:28 PM Ryan Blue  wrote:

> I’d just pass them [partitioning/bucketing] as options, until there are
> clear (and strong) use cases to do them otherwise.
>
> I don’t think it makes sense to pass partitioning and bucketing
> information *into* this API. The writer should already know the table
> structure and should pass relevant information back out to Spark so it can
> sort and group data for storage.
>
> I think the idea of passing the table structure into the writer comes from
> the current implementation, where the table may not exist before a data
> frame is written. But that isn’t something that should be carried forward.
> I think the writer should be responsible for writing into an
> already-configured table. That’s the normal case we should design for.
> Creating a table at the same time (CTAS) is a convenience, but should be
> implemented by creating an empty table and then running the same writer
> that would have been used for an insert into an existing table.
>
> Otherwise, there’s confusion about how to handle the options. What should
> the writer do when partitioning passed in doesn’t match the table’s
> partitioning? We already have this situation in the DataFrameWriter API,
> where calling partitionBy and then insertInto throws an exception. I’d
> like to keep that case out of this API by setting the expectation that
> tables this writes to already exist.
>
> rb
> ​
>
> On Wed, Sep 20, 2017 at 9:52 AM, Reynold Xin  wrote:
>
>>
>>
>> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan  wrote:
>>
>>> Hi all,
>>>
>>> I want to have some discussion about Data Source V2 write path before
>>> starting a voting.
>>>
>>> The Data Source V1 write path asks implementations to write a DataFrame
>>> directly, which is painful:
>>> 1. Exposing upper-level API like DataFrame to Data Source API is not
>>> good for maintenance.
>>> 2. Data sources may need to preprocess the input data before writing,
>>> like cluster/sort the input by some columns. It's better to do the
>>> preprocessing in Spark instead of in the data source.
>>> 3. Data sources need to take care of transaction themselves, which is
>>> hard. And different data sources may come up with a very similar approach
>>> for the transaction, which leads to many duplicated codes.
>>>
>>>
>>> To solve these pain points, I'm proposing a data source writing
>>> framework which is very similar to the reading framework, i.e.,
>>> WriteSupport -> DataSourceV2Writer -> WriteTask -> DataWriter. You can take
>>> a look at my prototype to see what it looks like:
>>> https://github.com/apache/spark/pull/19269
>>>
>>> There are some other details need further discussion:
>>> 1. *partitioning/bucketing*
>>> Currently only the built-in file-based data sources support them, but
>>> there is nothing stopping us from exposing them to all data sources. One
>>> question is, shall we make them as mix-in interfaces for data source v2
>>> reader/writer, or just encode them into data source options(a
>>> string-to-string map)? Ideally it's more like options, Spark just transfers
>>> these user-given informations to data sources, and doesn't do anything for
>>> it.
>>>
>>
>>
>> I'd just pass them as options, until there are clear (and strong) use
>> cases to do them otherwise.
>>
>>
>> +1 on the rest.
>>
>>
>>
>>>
>>> 2. *input data requirement*
>>> Data sources should be able to ask Spark to preprocess the input data,
>>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>>> to add clustering request and sorting within partitions request, any more?
>>>
>>> 3. *transaction*
>>> I think we can just follow `FileCommitProtocol`, which is the internal
>>> framework Spark uses to guarantee transaction for built-in file-based data
>>> sources. Generally speaking, we need task level and job level commit/abort.
>>> Again you can see more details in my prototype about it:
>>> https://github.com/apache/spark/pull/19269
>>>
>>> 4. *data source table*
>>> This is the trickiest one. In Spark you can create a table which points
>>> to a data source, so you can read/write this data source easily by
>>> referencing the table name. Ideally data source table is just a pointer
>>> which points to a data source with a list of predefined options, to save
>>> users from typing these options again and again for each query.
>>> If that's all, then everything is good, we don't need to add more
>>> interfaces to Data Source V2. However, data source tables provide special
>>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
>>> sources to have some extra ability.
>>> Currently these special operators only work for built-in 

Re: [discuss] Data Source V2 write path

2017-09-21 Thread Ryan Blue
I’d just pass them [partitioning/bucketing] as options, until there are
clear (and strong) use cases to do them otherwise.

I don’t think it makes sense to pass partitioning and bucketing information
*into* this API. The writer should already know the table structure and
should pass relevant information back out to Spark so it can sort and group
data for storage.

I think the idea of passing the table structure into the writer comes from
the current implementation, where the table may not exist before a data
frame is written. But that isn’t something that should be carried forward.
I think the writer should be responsible for writing into an
already-configured table. That’s the normal case we should design for.
Creating a table at the same time (CTAS) is a convenience, but should be
implemented by creating an empty table and then running the same writer
that would have been used for an insert into an existing table.

Otherwise, there’s confusion about how to handle the options. What should
the writer do when partitioning passed in doesn’t match the table’s
partitioning? We already have this situation in the DataFrameWriter API,
where calling partitionBy and then insertInto throws an exception. I’d like
to keep that case out of this API by setting the expectation that tables
this writes to already exist.

rb
​

On Wed, Sep 20, 2017 at 9:52 AM, Reynold Xin  wrote:

>
>
> On Wed, Sep 20, 2017 at 3:10 AM, Wenchen Fan  wrote:
>
>> Hi all,
>>
>> I want to have some discussion about Data Source V2 write path before
>> starting a voting.
>>
>> The Data Source V1 write path asks implementations to write a DataFrame
>> directly, which is painful:
>> 1. Exposing upper-level API like DataFrame to Data Source API is not good
>> for maintenance.
>> 2. Data sources may need to preprocess the input data before writing,
>> like cluster/sort the input by some columns. It's better to do the
>> preprocessing in Spark instead of in the data source.
>> 3. Data sources need to take care of transaction themselves, which is
>> hard. And different data sources may come up with a very similar approach
>> for the transaction, which leads to many duplicated codes.
>>
>>
>> To solve these pain points, I'm proposing a data source writing framework
>> which is very similar to the reading framework, i.e., WriteSupport ->
>> DataSourceV2Writer -> WriteTask -> DataWriter. You can take a look at my
>> prototype to see what it looks like: https://github.com/apach
>> e/spark/pull/19269
>>
>> There are some other details need further discussion:
>> 1. *partitioning/bucketing*
>> Currently only the built-in file-based data sources support them, but
>> there is nothing stopping us from exposing them to all data sources. One
>> question is, shall we make them as mix-in interfaces for data source v2
>> reader/writer, or just encode them into data source options(a
>> string-to-string map)? Ideally it's more like options, Spark just transfers
>> these user-given informations to data sources, and doesn't do anything for
>> it.
>>
>
>
> I'd just pass them as options, until there are clear (and strong) use
> cases to do them otherwise.
>
>
> +1 on the rest.
>
>
>
>>
>> 2. *input data requirement*
>> Data sources should be able to ask Spark to preprocess the input data,
>> and this can be a mix-in interface for DataSourceV2Writer. I think we need
>> to add clustering request and sorting within partitions request, any more?
>>
>> 3. *transaction*
>> I think we can just follow `FileCommitProtocol`, which is the internal
>> framework Spark uses to guarantee transaction for built-in file-based data
>> sources. Generally speaking, we need task level and job level commit/abort.
>> Again you can see more details in my prototype about it:
>> https://github.com/apache/spark/pull/19269
>>
>> 4. *data source table*
>> This is the trickiest one. In Spark you can create a table which points
>> to a data source, so you can read/write this data source easily by
>> referencing the table name. Ideally data source table is just a pointer
>> which points to a data source with a list of predefined options, to save
>> users from typing these options again and again for each query.
>> If that's all, then everything is good, we don't need to add more
>> interfaces to Data Source V2. However, data source tables provide special
>> operators like ALTER TABLE SCHEMA, ADD PARTITION, etc., which requires data
>> sources to have some extra ability.
>> Currently these special operators only work for built-in file-based data
>> sources, and I don't think we will extend it in the near future, I propose
>> to mark them as out of the scope.
>>
>>
>> Any comments are welcome!
>> Thanks,
>> Wenchen
>>
>
>


-- 
Ryan Blue
Software Engineer
Netflix


Re: New to dev community | Contribution to Mlib

2017-09-21 Thread Venali Sonone
Thank you for your response.

The algorithm that I am proposing is Isolation Forest.
Link to paper: paper
. I
particularly find that it should be included in Spark ML because so many
applications that use Spark as part of real time streaming engine in
industry need anomaly detection and current Spark ML supports it in some
way by means clustering. I will probably start to create the implementation
and prepare for proposal as you suggested.

It is interesting to know that Spark is still implementing stuff in Spark
ML to reach full parity with MLlib. Can I please get connected to folks
working on it as I am interested in contributing. I have been heavy user of
Spark since summer'15.

 Cheers!
-Venali

On Thu, Sep 21, 2017 at 1:33 AM, Seth Hendrickson <
seth.hendrickso...@gmail.com> wrote:

> I'm not exactly clear on what you're proposing, but this sounds like
> something that would live as a Spark package - a framework for anomaly
> detection built on Spark. If there is some specific algorithm you have in
> mind, it would be good to propose it on JIRA and discuss why you think it
> needs to be included in Spark and not live as a Spark package.
>
> In general, there will probably be resistance to including new algorithms
> in Spark ML, especially until the ML package has reached full parity with
> MLlib. Still, if you can provide more details that will help to understand
> what is best here.
>
> On Thu, Sep 14, 2017 at 1:29 AM, Venali Sonone 
> wrote:
>
>>
>> Hello,
>>
>> I am new to dev community of Spark and also open source in general but
>> have used Spark extensively.
>> I want to create a complete part on anomaly detection in spark Mlib,
>> For the same I want to know if someone could guide me so i can start the
>> development and contribute to Spark Mlib.
>>
>> Sorry for sounding naive if i do but any help is appreciated.
>>
>> Cheers!
>> -venna
>>
>>
>


Re: [VOTE] Spark 2.1.2 (RC1)

2017-09-21 Thread Marcelo Vanzin
While you're at it, one thing that needs to be done is create a 2.1.3
version on JIRA. Not sure if you have enough permissions to do that.

Fixes after an RC should use the new version, and if you create a new
RC, you'll need to go and backdate the patches that went into the new
RC.

On Mon, Sep 18, 2017 at 8:22 PM, Holden Karau  wrote:
> As per the conversation happening around the signing of releases I'm
> cancelling this vote. If folks agree with the temporary solution there I'll
> try and get a new RC out shortly but if we end up blocking on migrating the
> Jenkins jobs it could take a bit longer.
>
> On Sun, Sep 17, 2017 at 1:30 AM, yuming wang  wrote:
>>
>> Yes, It doesn’t work in 2.1.0 and 2.1.1, I create a PR for this:
>> https://github.com/apache/spark/pull/19259.
>>
>>
>> 在 2017年9月17日,16:14,Sean Owen  写道:
>>
>> So, didn't work in 2.1.0 or 2.1.1? If it's not a regression and not
>> critical, it shouldn't block a release. It seems like this can only affect
>> Docker and/or Oracle JDBC? Well, if we need to roll another release anyway,
>> seems OK.
>>
>> On Sun, Sep 17, 2017 at 6:06 AM Xiao Li  wrote:
>>>
>>> This is a bug introduced in 2.1. It works fine in 2.0
>>>
>>> 2017-09-16 16:15 GMT-07:00 Holden Karau :

 Ok :) Was this working in 2.1.1?

 On Sat, Sep 16, 2017 at 3:59 PM Xiao Li  wrote:
>
> Still -1
>
> Unable to pass the tests in my local environment. Open a JIRA
> https://issues.apache.org/jira/browse/SPARK-22041
>
> - SPARK-16625: General data types to be mapped to Oracle *** FAILED ***
>
>   types.apply(9).equals(org.apache.spark.sql.types.DateType) was false
> (OracleIntegrationSuite.scala:158)
>
> Xiao
>
>
> 2017-09-15 17:35 GMT-07:00 Ryan Blue :
>>
>> -1 (with my Apache member hat on, non-binding)
>>
>> I'll continue discussion in the other thread, but I don't think we
>> should share signing keys.
>>
>> On Fri, Sep 15, 2017 at 5:14 PM, Holden Karau 
>> wrote:
>>>
>>> Indeed it's limited to a people with login permissions on the Jenkins
>>> host (and perhaps further limited, I'm not certain). Shane probably 
>>> knows
>>> more about the ACLs, so I'll ask him in the other thread for specifics.
>>>
>>> This is maybe branching a bit from the question of the current RC
>>> though, so I'd suggest we continue this discussion on the thread Sean 
>>> Owen
>>> made.
>>>
>>> On Fri, Sep 15, 2017 at 4:04 PM Ryan Blue  wrote:

 I'm not familiar with the release procedure, can you send a link to
 this Jenkins job? Can anyone run this job, or is it limited to 
 committers?

 rb

 On Fri, Sep 15, 2017 at 12:28 PM, Holden Karau
  wrote:
>
> That's a good question, I built the release candidate however the
> Jenkins scripts don't take a parameter for configuring who signs them 
> rather
> it always signs them with Patrick's key. You can see this from 
> previous
> releases which were managed by other folks but still signed by 
> Patrick.
>
> On Fri, Sep 15, 2017 at 12:16 PM, Ryan Blue 
> wrote:
>>
>> The signature is valid, but why was the release signed with
>> Patrick Wendell's private key? Did Patrick build the release 
>> candidate?
>>
>> rb
>>
>> On Fri, Sep 15, 2017 at 6:36 AM, Denny Lee 
>> wrote:
>>>
>>> +1 (non-binding)
>>>
>>> On Thu, Sep 14, 2017 at 10:57 PM Felix Cheung
>>>  wrote:

 +1 tested SparkR package on Windows, r-hub, Ubuntu.

 _
 From: Sean Owen 
 Sent: Thursday, September 14, 2017 3:12 PM
 Subject: Re: [VOTE] Spark 2.1.2 (RC1)
 To: Holden Karau , 



 +1
 Very nice. The sigs and hashes look fine, it builds fine for me
 on Debian Stretch with Java 8, yarn/hive/hadoop-2.7 profiles, and 
 passes
 tests.

 Yes as you say, no outstanding issues except for this which
 doesn't look critical, as it's not a regression.

 SPARK-21985 PySpark PairDeserializer is broken for double-zipped
 RDDs


 On Thu, Sep 14, 2017 at 7:47 PM Holden Karau
  wrote:
>

Re: [VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-21 Thread Sean Owen
Am I right that this doesn't mean other packages would use this
representation, but that they could?

The representation looked fine to me w.r.t. what DL frameworks need.

My previous comment was that this is actually quite lightweight. It's kind
of like how I/O support is provided for CSV and JSON, so makes enough sense
to add to Spark. It doesn't really preclude other solutions.

For those reasons I think it's fine. +1

On Thu, Sep 21, 2017 at 6:32 PM Tim Hunter  wrote:

> Hello community,
>
> I would like to call for a vote on SPARK-21866. It is a short proposal
> that has important applications for image processing and deep learning.
> Joseph Bradley has offered to be the shepherd.
>
> JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
> PDF version: https://issues.apache.org/jira/secure/attachment/12884792/
> SPIP%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf
>
> Background and motivation
>
> As Apache Spark is being used more and more in the industry, some new use
> cases are emerging for different data formats beyond the traditional SQL
> types or the numerical types (vectors and matrices). Deep Learning
> applications commonly deal with image processing. A number of projects add
> some Deep Learning capabilities to Spark (see list below), but they
> struggle to communicate with each other or with MLlib pipelines because
> there is no standard way to represent an image in Spark DataFrames. We
> propose to federate efforts for representing images in Spark by defining a
> representation that caters to the most common needs of users and library
> developers.
>
> This SPIP proposes a specification to represent images in Spark
> DataFrames and Datasets (based on existing industrial standards), and an
> interface for loading sources of images. It is not meant to be a
> full-fledged image processing library, but rather the core description that
> other libraries and users can rely on. Several packages already offer
> various processing facilities for transforming images or doing more complex
> operations, and each has various design tradeoffs that make them better as
> standalone solutions.
>
> This project is a joint collaboration between Microsoft and Databricks,
> which have been testing this design in two open source packages: MMLSpark
> and Deep Learning Pipelines.
>
> The proposed image format is an in-memory, decompressed representation
> that targets low-level applications. It is significantly more liberal in
> memory usage than compressed image representations such as JPEG, PNG, etc.,
> but it allows easy communication with popular image processing libraries
> and has no decoding overhead.
> Targets users and personas:
>
> Data scientists, data engineers, library developers.
> The following libraries define primitives for loading and representing
> images, and will gain from a common interchange format (in alphabetical
> order):
>
>- BigDL
>- DeepLearning4J
>- Deep Learning Pipelines
>- MMLSpark
>- TensorFlow (Spark connector)
>- TensorFlowOnSpark
>- TensorFrames
>- Thunder
>
> Goals:
>
>- Simple representation of images in Spark DataFrames, based on
>pre-existing industrial standards (OpenCV)
>- This format should eventually allow the development of
>high-performance integration points with image processing libraries such as
>libOpenCV, Google TensorFlow, CNTK, and other C libraries.
>- The reader should be able to read popular formats of images from
>distributed sources.
>
> Non-Goals:
>
> Images are a versatile medium and encompass a very wide range of formats
> and representations. This SPIP explicitly aims at the most common use
> case in the industry currently: multi-channel matrices of binary, int32,
> int64, float or double data that can fit comfortably in the heap of the JVM:
>
>- the total size of an image should be restricted to less than 2GB
>(roughly)
>- the meaning of color channels is application-specific and is not
>mandated by the standard (in line with the OpenCV standard)
>- specialized formats used in meteorology, the medical field, etc. are
>not supported
>- this format is specialized to images and does not attempt to solve
>the more general problem of representing n-dimensional tensors in Spark
>
> Proposed API changes
>
> We propose to add a new package in the package structure, under the MLlib
> project:
> org.apache.spark.image
> Data format
>
> We propose to add the following structure:
>
> imageSchema = StructType([
>
>- StructField("mode", StringType(), False),
>   - The exact representation of the data.
>   - The values are described in the following OpenCV convention.
>   Basically, the type has both "depth" and "number of channels" info: in
>   particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format
>   would be CV_8UC4 (value 32 in the table) with the channel order 
> specified
>   

[VOTE][SPIP] SPARK-21866 Image support in Apache Spark

2017-09-21 Thread Tim Hunter
Hello community,

I would like to call for a vote on SPARK-21866. It is a short proposal that
has important applications for image processing and deep learning. Joseph
Bradley has offered to be the shepherd.

JIRA ticket: https://issues.apache.org/jira/browse/SPARK-21866
PDF version: https://issues.apache.org/jira/secure/attachment/12884792/SPIP
%20-%20Image%20support%20for%20Apache%20Spark%20V1.1.pdf

Background and motivation

As Apache Spark is being used more and more in the industry, some new use
cases are emerging for different data formats beyond the traditional SQL
types or the numerical types (vectors and matrices). Deep Learning
applications commonly deal with image processing. A number of projects add
some Deep Learning capabilities to Spark (see list below), but they
struggle to communicate with each other or with MLlib pipelines because
there is no standard way to represent an image in Spark DataFrames. We
propose to federate efforts for representing images in Spark by defining a
representation that caters to the most common needs of users and library
developers.

This SPIP proposes a specification to represent images in Spark DataFrames
and Datasets (based on existing industrial standards), and an interface for
loading sources of images. It is not meant to be a full-fledged image
processing library, but rather the core description that other libraries
and users can rely on. Several packages already offer various processing
facilities for transforming images or doing more complex operations, and
each has various design tradeoffs that make them better as standalone
solutions.

This project is a joint collaboration between Microsoft and Databricks,
which have been testing this design in two open source packages: MMLSpark
and Deep Learning Pipelines.

The proposed image format is an in-memory, decompressed representation that
targets low-level applications. It is significantly more liberal in memory
usage than compressed image representations such as JPEG, PNG, etc., but it
allows easy communication with popular image processing libraries and has
no decoding overhead.
Targets users and personas:

Data scientists, data engineers, library developers.
The following libraries define primitives for loading and representing
images, and will gain from a common interchange format (in alphabetical
order):

   - BigDL
   - DeepLearning4J
   - Deep Learning Pipelines
   - MMLSpark
   - TensorFlow (Spark connector)
   - TensorFlowOnSpark
   - TensorFrames
   - Thunder

Goals:

   - Simple representation of images in Spark DataFrames, based on
   pre-existing industrial standards (OpenCV)
   - This format should eventually allow the development of
   high-performance integration points with image processing libraries such as
   libOpenCV, Google TensorFlow, CNTK, and other C libraries.
   - The reader should be able to read popular formats of images from
   distributed sources.

Non-Goals:

Images are a versatile medium and encompass a very wide range of formats
and representations. This SPIP explicitly aims at the most common use case
in the industry currently: multi-channel matrices of binary, int32, int64,
float or double data that can fit comfortably in the heap of the JVM:

   - the total size of an image should be restricted to less than 2GB
   (roughly)
   - the meaning of color channels is application-specific and is not
   mandated by the standard (in line with the OpenCV standard)
   - specialized formats used in meteorology, the medical field, etc. are
   not supported
   - this format is specialized to images and does not attempt to solve the
   more general problem of representing n-dimensional tensors in Spark

Proposed API changes

We propose to add a new package in the package structure, under the MLlib
project:
org.apache.spark.image
Data format

We propose to add the following structure:

imageSchema = StructType([

   - StructField("mode", StringType(), False),
  - The exact representation of the data.
  - The values are described in the following OpenCV convention.
  Basically, the type has both "depth" and "number of channels" info: in
  particular, type "CV_8UC3" means "3 channel unsigned bytes". BGRA format
  would be CV_8UC4 (value 32 in the table) with the channel order specified
  by convention.
  - The exact channel ordering and meaning of each channel is dictated
  by convention. By default, the order is RGB (3 channels) and BGRA (4
  channels).
  If the image failed to load, the value is the empty string "".


   - StructField("origin", StringType(), True),
  - Some information about the origin of the image. The content of this
  is application-specific.
  - When the image is loaded from files, users should expect to find
  the file name in this field.


   - StructField("height", IntegerType(), False),
  - the height of the image, pixels
  - If the image fails to load, the value is -1.


   - StructField("width", 

Re: doc patch review

2017-09-21 Thread lucas.g...@gmail.com
https://issues.apache.org/jira/browse/SPARK-20448

On 21 September 2017 at 04:09, Hyukjin Kwon  wrote:

> I think it would have been nicer if the JIRA and PR are written in this
> email.
>
> 2017-09-21 19:44 GMT+09:00 Steve Loughran :
>
>> I have a doc patch on spark streaming & object store sources which has
>> been hitting is six-month-unreviewed state this week
>>
>> are there any plans to review this or shall I close it as a wontfix?
>>
>> thanks
>>
>>
>> -
>> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>>
>>
>


Re: doc patch review

2017-09-21 Thread Hyukjin Kwon
I think it would have been nicer if the JIRA and PR are written in this
email.

2017-09-21 19:44 GMT+09:00 Steve Loughran :

> I have a doc patch on spark streaming & object store sources which has
> been hitting is six-month-unreviewed state this week
>
> are there any plans to review this or shall I close it as a wontfix?
>
> thanks
>
>
> -
> To unsubscribe e-mail: dev-unsubscr...@spark.apache.org
>
>


doc patch review

2017-09-21 Thread Steve Loughran
I have a doc patch on spark streaming & object store sources which has been 
hitting is six-month-unreviewed state this week

are there any plans to review this or shall I close it as a wontfix?

thanks


-
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org