[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged into master. Thanks @WeichenXu123 for the implementation and 
everyone for the review! I created the following JIRAs as follow-ups:

* deprecate ImageSchema: https://issues.apache.org/jira/browse/SPARK-25345
* list built-in data sources in doc site: 
https://issues.apache.org/jira/browse/SPARK-25346
* doc for image data source: 
https://issues.apache.org/jira/browse/SPARK-25347
* data source for binary files: 
https://issues.apache.org/jira/browse/SPARK-25348
* support sample pushdown: https://issues.apache.org/jira/browse/SPARK-25349


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95727/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95727 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95727/testReport)**
 for PR 22328 at commit 
[`218ce4c`](https://github.com/apache/spark/commit/218ce4cf796308c8705a27889b25100e2b779365).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95727 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95727/testReport)**
 for PR 22328 at commit 
[`218ce4c`](https://github.com/apache/spark/commit/218ce4cf796308c8705a27889b25100e2b779365).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2881/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
The image data source tests passed but JVM crashed at the end. Triggered 
another test.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
test this please


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95724/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95724 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95724/testReport)**
 for PR 22328 at commit 
[`218ce4c`](https://github.com/apache/spark/commit/218ce4cf796308c8705a27889b25100e2b779365).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
LGTM pending tests.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95724 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95724/testReport)**
 for PR 22328 at commit 
[`218ce4c`](https://github.com/apache/spark/commit/218ce4cf796308c8705a27889b25100e2b779365).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2879/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95715/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95715 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95715/testReport)**
 for PR 22328 at commit 
[`3c8863c`](https://github.com/apache/spark/commit/3c8863c2bd7a4789dd96e849a72eb0aac708abcf).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95715 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95715/testReport)**
 for PR 22328 at commit 
[`3c8863c`](https://github.com/apache/spark/commit/3c8863c2bd7a4789dd96e849a72eb0aac708abcf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2874/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95707/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95707 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95707/testReport)**
 for PR 22328 at commit 
[`3fffd7e`](https://github.com/apache/spark/commit/3fffd7e055fad74d7392c88d678935b3e924588f).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95707 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95707/testReport)**
 for PR 22328 at commit 
[`3fffd7e`](https://github.com/apache/spark/commit/3fffd7e055fad74d7392c88d678935b3e924588f).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2868/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95694 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95694/testReport)**
 for PR 22328 at commit 
[`bd6178c`](https://github.com/apache/spark/commit/bd6178c0bff7c5ffade1dce61894191f63a50976).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95694/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95695/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-05 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95695 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95695/testReport)**
 for PR 22328 at commit 
[`4d52754`](https://github.com/apache/spark/commit/4d527548bb7db3f2f3c1ddda206e552d223ca27b).
 * This patch **fails due to an unknown error code, -9**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
That doesn't work for Java, if I remember the issue correctly.

On Tue, Sep 4, 2018, 10:31 PM Wenchen Fan  wrote:

> *@cloud-fan* commented on this pull request.
> --
>
> In
> 
mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala
> :
>
> > + *
> + *   // Java
> + *   Dataset df = spark.read().format("image")
> + * .option("dropImageFailures", "true")
> + * .load("data/mllib/images/imagesWithPartitions");
> + * }}}
> + *
> + * IMAGE data source supports the following options:
> + *  - "dropImageFailures": Whether to drop the files that are not valid 
images from the result.
> + *
> + * @note This IMAGE data source does not support "write".
> + *
> + * @note This class is public for documentation purpose. Please don't 
use this class directly.
> + * Rather, use the data source API as illustrated above.
> + */
> +class ImageDataSource private() {}
>
> Is this a convention? AFAIK in the scala world we usually put document in
> package object.
>
> —
> You are receiving this because you were mentioned.
> Reply to this email directly, view it on GitHub
> , or 
mute
> the thread
> 

> .
>
-- 

Xiangrui Meng

Software Engineer

Databricks Inc. [image: http://databricks.com] 



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95695 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95695/testReport)**
 for PR 22328 at commit 
[`4d52754`](https://github.com/apache/spark/commit/4d527548bb7db3f2f3c1ddda206e552d223ca27b).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2856/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95694 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95694/testReport)**
 for PR 22328 at commit 
[`bd6178c`](https://github.com/apache/spark/commit/bd6178c0bff7c5ffade1dce61894191f63a50976).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2855/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
@mhamilton723 I thought about that option too. Loading general binary files 
is a useful feature but I don't feel it is necessary to pull it into the 
current scope. No matter whether the image data source has its own 
implementation or builds on top of the binary data source, I expect users to use

~~~scala
spark.read.format("image").load("...")
~~~

to read images instead of something like:

~~~scala
spark.read.format("binary").load("...").withColumn("image", 
decode($"binary"))
~~~

So we can definitely add binary file data source later and swap the 
implementation without changing the public interface. But we don't need to 
block this PR getting into 2.4, which will be cut soon.

Sounds good?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread mhamilton723
Github user mhamilton723 commented on the issue:

https://github.com/apache/spark/pull/22328
  
@WeichenXu123. Awesome work! I have not had a chance to go through this in 
depth but I did this in the originating project, [MMLSpark](www.aka.ms/spark), 
a while back and have been meaning to ship it up to Spark Core:

https://github.com/Azure/mmlspark/tree/master/src/io/image

We built ours on the same functionality for Binary Files and it might be 
wise to do the same here if you have not already. The Binary File API is a bit 
more general as it supports reading many different kinds of files. 

https://github.com/Azure/mmlspark/tree/master/src/io/binary


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/22328
  
@mhamilton723 could you take a look at this PR?  Mark added some 
performance improvements in MMLSpark that we wanted to merge in and he also 
added support for streaming (this was one of the PRs: 
https://github.com/Azure/mmlspark/pull/134/files , there were a couple more 
after).  He also had some concerns about performance (specifically how we were 
storing the images as OpenCV bytes in the dataframe which he said took a lot of 
memory and we should use the more compressed format instead) and I recall he 
had a few suggestions on how we could improve it in the future.  This seems 
like a good place to discuss how we could improve performance more.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
Yes, the ImageSchema implementation are used by the data source, which we 
cannot remove:) We are only going to mark the public APIs there as deprecated. 
The goal is to provide users a unified approach to load data into Spark. Users 
usually find `ImageSchema.readImages` hard to discover.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/22328
  
I don't think you need duplicate images in the PR - in the old tests you 
can just specify the path up to the cls/date folder.  There might be a few 
minor changes to the tests but I think that would be a better strategy than to 
have duplicate images in source.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/22328
  
@mengxr "is to mark ImageSchema deprecated in 2.4 and remove it in 3.0"
confused, this is still using ImageSchema though, right?  You mean to 
remove the top-level readImages API, but the ImageSchema will remain the same, 
correct?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/22328
  
"specify numPartitions (it will be determined by datasource automatically)"
This feature was because of this bug:

https://issues.apache.org/jira/browse/SPARK-22357

Which was just fixed a week ago in spark 2.4.0.
It can be removed, even in the current API.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread mengxr
Github user mengxr commented on the issue:

https://github.com/apache/spark/pull/22328
  
@imatiach-msft @HyukjinKwon The plan is to mark `ImageSchema` deprecated in 
2.4 and remove it in 3.0. So loading images will be the same as loading data 
from other sources.

The gaps are sampling and partition controlling, which might require more 
testing after 2.4. It would be great if you can help. For sampling, I'm 
thinking of allowing data source to handle sample operations. @cloud-fan is it 
feasible?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread imatiach-msft
Github user imatiach-msft commented on the issue:

https://github.com/apache/spark/pull/22328
  
question: why were the images renamed to images/images folder from just 
images?  Seems a bit strange to me to have the same folder name twice.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95668/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95668 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95668/testReport)**
 for PR 22328 at commit 
[`5164d19`](https://github.com/apache/spark/commit/5164d19deb447e49e87d20f5a1efbb0c2ee177ee).
 * This patch **fails PySpark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95668 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95668/testReport)**
 for PR 22328 at commit 
[`5164d19`](https://github.com/apache/spark/commit/5164d19deb447e49e87d20f5a1efbb0c2ee177ee).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2835/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread HyukjinKwon
Github user HyukjinKwon commented on the issue:

https://github.com/apache/spark/pull/22328
  
cc @imatiach-msft 


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/95667/
Test FAILed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95667 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95667/testReport)**
 for PR 22328 at commit 
[`5b5aee6`](https://github.com/apache/spark/commit/5b5aee66b2ea819341b624164298f0700ee07ddf).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread SparkQA
Github user SparkQA commented on the issue:

https://github.com/apache/spark/pull/22328
  
**[Test build #95667 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/95667/testReport)**
 for PR 22328 at commit 
[`5b5aee6`](https://github.com/apache/spark/commit/5b5aee66b2ea819341b624164298f0700ee07ddf).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 

https://amplab.cs.berkeley.edu/jenkins//job/testing-k8s-prb-make-spark-distribution-unified/2834/
Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark issue #22328: [SPARK-22666][ML][SQL] Spark datasource for image format

2018-09-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the issue:

https://github.com/apache/spark/pull/22328
  
Merged build finished. Test PASSed.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org