Repository: spark Updated Branches: refs/heads/master c66eef844 -> 925449283
[SPARK-22666][ML][SQL] Spark datasource for image format ## What changes were proposed in this pull request? Implement an image schema datasource. This image datasource support: - partition discovery (loading partitioned images) - dropImageFailures (the same behavior with `ImageSchema.readImage`) - path wildcard matching (the same behavior with `ImageSchema.readImage`) - loading recursively from directory (different from `ImageSchema.readImage`, but use such path: `/path/to/dir/**`) This datasource **NOT** support: - specify `numPartitions` (it will be determined by datasource automatically) - sampling (you can use `df.sample` later but the sampling operator won't be pushdown to datasource) ## How was this patch tested? Unit tests. ## Benchmark I benchmark and compare the cost time between old `ImageSchema.read` API and my image datasource. **cluster**: 4 nodes, each with 64GB memory, 8 cores CPU **test dataset**: Flickr8k_Dataset (about 8091 images) **time cost**: - My image datasource time (automatically generate 258 partitions): 38.04s - `ImageSchema.read` time (set 16 partitions): 68.4s - `ImageSchema.read` time (set 258 partitions): 90.6s **time cost when increase image number by double (clone Flickr8k_Dataset and loads double number images)**: - My image datasource time (automatically generate 515 partitions): 95.4s - `ImageSchema.read` (set 32 partitions): 109s - `ImageSchema.read` (set 515 partitions): 105s So we can see that my image datasource implementation (this PR) bring some performance improvement compared against old`ImageSchema.read` API. Closes #22328 from WeichenXu123/image_datasource. Authored-by: WeichenXu <weichen...@databricks.com> Signed-off-by: Xiangrui Meng <m...@databricks.com> Project: http://git-wip-us.apache.org/repos/asf/spark/repo Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/92544928 Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/92544928 Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/92544928 Branch: refs/heads/master Commit: 925449283dcaef80e0f77e60aea6ef988bd697b4 Parents: c66eef8 Author: WeichenXu <weichen...@databricks.com> Authored: Wed Sep 5 11:59:00 2018 -0700 Committer: Xiangrui Meng <m...@databricks.com> Committed: Wed Sep 5 11:59:00 2018 -0700 ---------------------------------------------------------------------- .../images/kittens/29.5.a_b_EGDP022204.jpg | Bin 27295 -> 0 bytes data/mllib/images/kittens/54893.jpg | Bin 35914 -> 0 bytes data/mllib/images/kittens/DP153539.jpg | Bin 26354 -> 0 bytes data/mllib/images/kittens/DP802813.jpg | Bin 30432 -> 0 bytes data/mllib/images/kittens/not-image.txt | 1 - data/mllib/images/multi-channel/BGRA.png | Bin 683 -> 0 bytes .../images/multi-channel/BGRA_alpha_60.png | Bin 747 -> 0 bytes data/mllib/images/multi-channel/chr30.4.184.jpg | Bin 59472 -> 0 bytes data/mllib/images/multi-channel/grayscale.jpg | Bin 36728 -> 0 bytes .../origin/kittens/29.5.a_b_EGDP022204.jpg | Bin 0 -> 27295 bytes data/mllib/images/origin/kittens/54893.jpg | Bin 0 -> 35914 bytes data/mllib/images/origin/kittens/DP153539.jpg | Bin 0 -> 26354 bytes data/mllib/images/origin/kittens/DP802813.jpg | Bin 0 -> 30432 bytes data/mllib/images/origin/kittens/not-image.txt | 1 + data/mllib/images/origin/license.txt | 13 ++ data/mllib/images/origin/multi-channel/BGRA.png | Bin 0 -> 683 bytes .../origin/multi-channel/BGRA_alpha_60.png | Bin 0 -> 747 bytes .../images/origin/multi-channel/chr30.4.184.jpg | Bin 0 -> 59472 bytes .../images/origin/multi-channel/grayscale.jpg | Bin 0 -> 36728 bytes .../date=2018-01/29.5.a_b_EGDP022204.jpg | Bin 0 -> 27295 bytes .../cls=kittens/date=2018-01/not-image.txt | 1 + .../cls=kittens/date=2018-02/54893.jpg | Bin 0 -> 35914 bytes .../cls=kittens/date=2018-02/DP153539.jpg | Bin 0 -> 26354 bytes .../cls=kittens/date=2018-02/DP802813.jpg | Bin 0 -> 30432 bytes .../cls=multichannel/date=2018-01/BGRA.png | Bin 0 -> 683 bytes .../date=2018-01/BGRA_alpha_60.png | Bin 0 -> 747 bytes .../date=2018-02/chr30.4.184.jpg | Bin 0 -> 59472 bytes .../cls=multichannel/date=2018-02/grayscale.jpg | Bin 0 -> 36728 bytes ....apache.spark.sql.sources.DataSourceRegister | 1 + .../spark/ml/source/image/ImageDataSource.scala | 53 +++++++++ .../spark/ml/source/image/ImageFileFormat.scala | 100 ++++++++++++++++ .../spark/ml/source/image/ImageOptions.scala | 32 +++++ .../spark/ml/image/ImageSchemaSuite.scala | 2 +- .../ml/source/image/ImageFileFormatSuite.scala | 119 +++++++++++++++++++ python/pyspark/ml/image.py | 2 +- python/pyspark/ml/tests.py | 4 +- 36 files changed, 324 insertions(+), 5 deletions(-) ---------------------------------------------------------------------- http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/kittens/29.5.a_b_EGDP022204.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/kittens/29.5.a_b_EGDP022204.jpg b/data/mllib/images/kittens/29.5.a_b_EGDP022204.jpg deleted file mode 100644 index 435e7df..0000000 Binary files a/data/mllib/images/kittens/29.5.a_b_EGDP022204.jpg and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/kittens/54893.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/kittens/54893.jpg b/data/mllib/images/kittens/54893.jpg deleted file mode 100644 index 825630c..0000000 Binary files a/data/mllib/images/kittens/54893.jpg and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/kittens/DP153539.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/kittens/DP153539.jpg b/data/mllib/images/kittens/DP153539.jpg deleted file mode 100644 index 571efe9..0000000 Binary files a/data/mllib/images/kittens/DP153539.jpg and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/kittens/DP802813.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/kittens/DP802813.jpg b/data/mllib/images/kittens/DP802813.jpg deleted file mode 100644 index 2d12359..0000000 Binary files a/data/mllib/images/kittens/DP802813.jpg and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/kittens/not-image.txt ---------------------------------------------------------------------- diff --git a/data/mllib/images/kittens/not-image.txt b/data/mllib/images/kittens/not-image.txt deleted file mode 100644 index 283e5e9..0000000 --- a/data/mllib/images/kittens/not-image.txt +++ /dev/null @@ -1 +0,0 @@ -not an image http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/multi-channel/BGRA.png ---------------------------------------------------------------------- diff --git a/data/mllib/images/multi-channel/BGRA.png b/data/mllib/images/multi-channel/BGRA.png deleted file mode 100644 index a944c6c..0000000 Binary files a/data/mllib/images/multi-channel/BGRA.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/multi-channel/BGRA_alpha_60.png ---------------------------------------------------------------------- diff --git a/data/mllib/images/multi-channel/BGRA_alpha_60.png b/data/mllib/images/multi-channel/BGRA_alpha_60.png deleted file mode 100644 index 913637c..0000000 Binary files a/data/mllib/images/multi-channel/BGRA_alpha_60.png and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/multi-channel/chr30.4.184.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/multi-channel/chr30.4.184.jpg b/data/mllib/images/multi-channel/chr30.4.184.jpg deleted file mode 100644 index 7068b97..0000000 Binary files a/data/mllib/images/multi-channel/chr30.4.184.jpg and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/multi-channel/grayscale.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/multi-channel/grayscale.jpg b/data/mllib/images/multi-channel/grayscale.jpg deleted file mode 100644 index 621cdd1..0000000 Binary files a/data/mllib/images/multi-channel/grayscale.jpg and /dev/null differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg b/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg new file mode 100644 index 0000000..435e7df Binary files /dev/null and b/data/mllib/images/origin/kittens/29.5.a_b_EGDP022204.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/kittens/54893.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/kittens/54893.jpg b/data/mllib/images/origin/kittens/54893.jpg new file mode 100644 index 0000000..825630c Binary files /dev/null and b/data/mllib/images/origin/kittens/54893.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/kittens/DP153539.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/kittens/DP153539.jpg b/data/mllib/images/origin/kittens/DP153539.jpg new file mode 100644 index 0000000..571efe9 Binary files /dev/null and b/data/mllib/images/origin/kittens/DP153539.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/kittens/DP802813.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/kittens/DP802813.jpg b/data/mllib/images/origin/kittens/DP802813.jpg new file mode 100644 index 0000000..2d12359 Binary files /dev/null and b/data/mllib/images/origin/kittens/DP802813.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/kittens/not-image.txt ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/kittens/not-image.txt b/data/mllib/images/origin/kittens/not-image.txt new file mode 100644 index 0000000..283e5e9 --- /dev/null +++ b/data/mllib/images/origin/kittens/not-image.txt @@ -0,0 +1 @@ +not an image http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/license.txt ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/license.txt b/data/mllib/images/origin/license.txt new file mode 100644 index 0000000..052f302 --- /dev/null +++ b/data/mllib/images/origin/license.txt @@ -0,0 +1,13 @@ +The images in the folder "kittens" are under the creative commons CC0 license, or no rights reserved: +https://creativecommons.org/share-your-work/public-domain/cc0/ +The images are taken from: +https://ccsearch.creativecommons.org/image/detail/WZnbJSJ2-dzIDiuUUdto3Q== +https://ccsearch.creativecommons.org/image/detail/_TlKu_rm_QrWlR0zthQTXA== +https://ccsearch.creativecommons.org/image/detail/OPNnHJb6q37rSZ5o_L5JHQ== +https://ccsearch.creativecommons.org/image/detail/B2CVP_j5KjwZm7UAVJ3Hvw== + +The chr30.4.184.jpg and grayscale.jpg images are also under the CC0 license, taken from: +https://ccsearch.creativecommons.org/image/detail/8eO_qqotBfEm2UYxirLntw== + +The image under "multi-channel" directory is under the CC BY-SA 4.0 license cropped from: +https://en.wikipedia.org/wiki/Alpha_compositing#/media/File:Hue_alpha_falloff.png http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/multi-channel/BGRA.png ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/multi-channel/BGRA.png b/data/mllib/images/origin/multi-channel/BGRA.png new file mode 100644 index 0000000..a944c6c Binary files /dev/null and b/data/mllib/images/origin/multi-channel/BGRA.png differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/multi-channel/BGRA_alpha_60.png ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/multi-channel/BGRA_alpha_60.png b/data/mllib/images/origin/multi-channel/BGRA_alpha_60.png new file mode 100644 index 0000000..913637c Binary files /dev/null and b/data/mllib/images/origin/multi-channel/BGRA_alpha_60.png differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/multi-channel/chr30.4.184.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/multi-channel/chr30.4.184.jpg b/data/mllib/images/origin/multi-channel/chr30.4.184.jpg new file mode 100644 index 0000000..7068b97 Binary files /dev/null and b/data/mllib/images/origin/multi-channel/chr30.4.184.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/origin/multi-channel/grayscale.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/origin/multi-channel/grayscale.jpg b/data/mllib/images/origin/multi-channel/grayscale.jpg new file mode 100644 index 0000000..621cdd1 Binary files /dev/null and b/data/mllib/images/origin/multi-channel/grayscale.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=kittens/date=2018-01/29.5.a_b_EGDP022204.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=kittens/date=2018-01/29.5.a_b_EGDP022204.jpg b/data/mllib/images/partitioned/cls=kittens/date=2018-01/29.5.a_b_EGDP022204.jpg new file mode 100644 index 0000000..435e7df Binary files /dev/null and b/data/mllib/images/partitioned/cls=kittens/date=2018-01/29.5.a_b_EGDP022204.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=kittens/date=2018-01/not-image.txt ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=kittens/date=2018-01/not-image.txt b/data/mllib/images/partitioned/cls=kittens/date=2018-01/not-image.txt new file mode 100644 index 0000000..283e5e9 --- /dev/null +++ b/data/mllib/images/partitioned/cls=kittens/date=2018-01/not-image.txt @@ -0,0 +1 @@ +not an image http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=kittens/date=2018-02/54893.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=kittens/date=2018-02/54893.jpg b/data/mllib/images/partitioned/cls=kittens/date=2018-02/54893.jpg new file mode 100644 index 0000000..825630c Binary files /dev/null and b/data/mllib/images/partitioned/cls=kittens/date=2018-02/54893.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=kittens/date=2018-02/DP153539.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=kittens/date=2018-02/DP153539.jpg b/data/mllib/images/partitioned/cls=kittens/date=2018-02/DP153539.jpg new file mode 100644 index 0000000..571efe9 Binary files /dev/null and b/data/mllib/images/partitioned/cls=kittens/date=2018-02/DP153539.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=kittens/date=2018-02/DP802813.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=kittens/date=2018-02/DP802813.jpg b/data/mllib/images/partitioned/cls=kittens/date=2018-02/DP802813.jpg new file mode 100644 index 0000000..2d12359 Binary files /dev/null and b/data/mllib/images/partitioned/cls=kittens/date=2018-02/DP802813.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=multichannel/date=2018-01/BGRA.png ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=multichannel/date=2018-01/BGRA.png b/data/mllib/images/partitioned/cls=multichannel/date=2018-01/BGRA.png new file mode 100644 index 0000000..a944c6c Binary files /dev/null and b/data/mllib/images/partitioned/cls=multichannel/date=2018-01/BGRA.png differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=multichannel/date=2018-01/BGRA_alpha_60.png ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=multichannel/date=2018-01/BGRA_alpha_60.png b/data/mllib/images/partitioned/cls=multichannel/date=2018-01/BGRA_alpha_60.png new file mode 100644 index 0000000..913637c Binary files /dev/null and b/data/mllib/images/partitioned/cls=multichannel/date=2018-01/BGRA_alpha_60.png differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=multichannel/date=2018-02/chr30.4.184.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=multichannel/date=2018-02/chr30.4.184.jpg b/data/mllib/images/partitioned/cls=multichannel/date=2018-02/chr30.4.184.jpg new file mode 100644 index 0000000..7068b97 Binary files /dev/null and b/data/mllib/images/partitioned/cls=multichannel/date=2018-02/chr30.4.184.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/data/mllib/images/partitioned/cls=multichannel/date=2018-02/grayscale.jpg ---------------------------------------------------------------------- diff --git a/data/mllib/images/partitioned/cls=multichannel/date=2018-02/grayscale.jpg b/data/mllib/images/partitioned/cls=multichannel/date=2018-02/grayscale.jpg new file mode 100644 index 0000000..621cdd1 Binary files /dev/null and b/data/mllib/images/partitioned/cls=multichannel/date=2018-02/grayscale.jpg differ http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/mllib/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister ---------------------------------------------------------------------- diff --git a/mllib/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister b/mllib/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister index a865cbe..a7dfd2d 100644 --- a/mllib/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister +++ b/mllib/src/main/resources/META-INF/services/org.apache.spark.sql.sources.DataSourceRegister @@ -1 +1,2 @@ org.apache.spark.ml.source.libsvm.LibSVMFileFormat +org.apache.spark.ml.source.image.ImageFileFormat http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala new file mode 100644 index 0000000..a111c95 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageDataSource.scala @@ -0,0 +1,53 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source.image + +/** + * `image` package implements Spark SQL data source API for loading image data as `DataFrame`. + * The loaded `DataFrame` has one `StructType` column: `image`. + * The schema of the `image` column is: + * - origin: String (represents the file path of the image) + * - height: Int (height of the image) + * - width: Int (width of the image) + * - nChannels: Int (number of the image channels) + * - mode: Int (OpenCV-compatible type) + * - data: BinaryType (Image bytes in OpenCV-compatible order: row-wise BGR in most cases) + * + * To use image data source, you need to set "image" as the format in `DataFrameReader` and + * optionally specify the data source options, for example: + * {{{ + * // Scala + * val df = spark.read.format("image") + * .option("dropInvalid", true) + * .load("data/mllib/images/partitioned") + * + * // Java + * Dataset<Row> df = spark.read().format("image") + * .option("dropInvalid", true) + * .load("data/mllib/images/partitioned"); + * }}} + * + * Image data source supports the following options: + * - "dropInvalid": Whether to drop the files that are not valid images from the result. + * + * @note This IMAGE data source does not support saving images to files. + * + * @note This class is public for documentation purpose. Please don't use this class directly. + * Rather, use the data source API as illustrated above. + */ +class ImageDataSource private() {} http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala new file mode 100644 index 0000000..c332144 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageFileFormat.scala @@ -0,0 +1,100 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source.image + +import com.google.common.io.{ByteStreams, Closeables} +import org.apache.hadoop.conf.Configuration +import org.apache.hadoop.fs.{FileStatus, Path} +import org.apache.hadoop.mapreduce.Job + +import org.apache.spark.ml.image.ImageSchema +import org.apache.spark.sql.SparkSession +import org.apache.spark.sql.catalyst.InternalRow +import org.apache.spark.sql.catalyst.encoders.RowEncoder +import org.apache.spark.sql.catalyst.expressions.{AttributeReference, UnsafeRow} +import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap +import org.apache.spark.sql.execution.datasources.{DataSource, FileFormat, OutputWriterFactory, PartitionedFile} +import org.apache.spark.sql.sources.{DataSourceRegister, Filter} +import org.apache.spark.sql.types.StructType +import org.apache.spark.util.SerializableConfiguration + +private[image] class ImageFileFormat extends FileFormat with DataSourceRegister { + + override def inferSchema( + sparkSession: SparkSession, + options: Map[String, String], + files: Seq[FileStatus]): Option[StructType] = Some(ImageSchema.imageSchema) + + override def prepareWrite( + sparkSession: SparkSession, + job: Job, + options: Map[String, String], + dataSchema: StructType): OutputWriterFactory = { + throw new UnsupportedOperationException("Write is not supported for image data source") + } + + override def shortName(): String = "image" + + override protected def buildReader( + sparkSession: SparkSession, + dataSchema: StructType, + partitionSchema: StructType, + requiredSchema: StructType, + filters: Seq[Filter], + options: Map[String, String], + hadoopConf: Configuration): (PartitionedFile) => Iterator[InternalRow] = { + assert( + requiredSchema.length <= 1, + "Image data source only produces a single data column named \"image\".") + + val broadcastedHadoopConf = + sparkSession.sparkContext.broadcast(new SerializableConfiguration(hadoopConf)) + + val imageSourceOptions = new ImageOptions(options) + + (file: PartitionedFile) => { + val emptyUnsafeRow = new UnsafeRow(0) + if (!imageSourceOptions.dropInvalid && requiredSchema.isEmpty) { + Iterator(emptyUnsafeRow) + } else { + val origin = file.filePath + val path = new Path(origin) + val fs = path.getFileSystem(broadcastedHadoopConf.value.value) + val stream = fs.open(path) + val bytes = try { + ByteStreams.toByteArray(stream) + } finally { + Closeables.close(stream, true) + } + val resultOpt = ImageSchema.decode(origin, bytes) + val filteredResult = if (imageSourceOptions.dropInvalid) { + resultOpt.toIterator + } else { + Iterator(resultOpt.getOrElse(ImageSchema.invalidImageRow(origin))) + } + + if (requiredSchema.isEmpty) { + filteredResult.map(_ => emptyUnsafeRow) + } else { + val converter = RowEncoder(requiredSchema) + filteredResult.map(row => converter.toRow(row)) + } + } + } + } +} http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala ---------------------------------------------------------------------- diff --git a/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala new file mode 100644 index 0000000..7ff1969 --- /dev/null +++ b/mllib/src/main/scala/org/apache/spark/ml/source/image/ImageOptions.scala @@ -0,0 +1,32 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source.image + +import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap + +private[image] class ImageOptions( + @transient private val parameters: CaseInsensitiveMap[String]) extends Serializable { + + def this(parameters: Map[String, String]) = this(CaseInsensitiveMap(parameters)) + + /** + * Whether to drop invalid images. If true, invalid images will be removed, otherwise + * invalid images will be returned with empty data and all other field filled with `-1`. + */ + val dropInvalid = parameters.getOrElse("dropInvalid", "false").toBoolean +} http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala ---------------------------------------------------------------------- diff --git a/mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala index 527b3f8..e16ec90 100644 --- a/mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala +++ b/mllib/src/test/scala/org/apache/spark/ml/image/ImageSchemaSuite.scala @@ -28,7 +28,7 @@ import org.apache.spark.sql.types._ class ImageSchemaSuite extends SparkFunSuite with MLlibTestSparkContext { // Single column of images named "image" - private lazy val imagePath = "../data/mllib/images" + private lazy val imagePath = "../data/mllib/images/origin" test("Smoke test: create basic ImageSchema dataframe") { val origin = "path" http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala ---------------------------------------------------------------------- diff --git a/mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala b/mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala new file mode 100644 index 0000000..1a6a8d6 --- /dev/null +++ b/mllib/src/test/scala/org/apache/spark/ml/source/image/ImageFileFormatSuite.scala @@ -0,0 +1,119 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.source.image + +import java.nio.file.Paths + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.image.ImageSchema._ +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.sql.Row +import org.apache.spark.sql.functions.{col, substring_index} + +class ImageFileFormatSuite extends SparkFunSuite with MLlibTestSparkContext { + + // Single column of images named "image" + private lazy val imagePath = "../data/mllib/images/partitioned" + + test("image datasource count test") { + val df1 = spark.read.format("image").load(imagePath) + assert(df1.count === 9) + + val df2 = spark.read.format("image").option("dropInvalid", true).load(imagePath) + assert(df2.count === 8) + } + + test("image datasource test: read jpg image") { + val df = spark.read.format("image").load(imagePath + "/cls=kittens/date=2018-02/DP153539.jpg") + assert(df.count() === 1) + } + + test("image datasource test: read png image") { + val df = spark.read.format("image").load(imagePath + "/cls=multichannel/date=2018-01/BGRA.png") + assert(df.count() === 1) + } + + test("image datasource test: read non image") { + val filePath = imagePath + "/cls=kittens/date=2018-01/not-image.txt" + val df = spark.read.format("image").option("dropInvalid", true) + .load(filePath) + assert(df.count() === 0) + + val df2 = spark.read.format("image").option("dropInvalid", false) + .load(filePath) + assert(df2.count() === 1) + val result = df2.head() + assert(result === invalidImageRow( + Paths.get(filePath).toAbsolutePath().normalize().toUri().toString)) + } + + test("image datasource partition test") { + val result = spark.read.format("image") + .option("dropInvalid", true).load(imagePath) + .select(substring_index(col("image.origin"), "/", -1).as("origin"), col("cls"), col("date")) + .collect() + + assert(Set(result: _*) === Set( + Row("29.5.a_b_EGDP022204.jpg", "kittens", "2018-01"), + Row("54893.jpg", "kittens", "2018-02"), + Row("DP153539.jpg", "kittens", "2018-02"), + Row("DP802813.jpg", "kittens", "2018-02"), + Row("BGRA.png", "multichannel", "2018-01"), + Row("BGRA_alpha_60.png", "multichannel", "2018-01"), + Row("chr30.4.184.jpg", "multichannel", "2018-02"), + Row("grayscale.jpg", "multichannel", "2018-02") + )) + } + + // Images with the different number of channels + test("readImages pixel values test") { + val images = spark.read.format("image").option("dropInvalid", true) + .load(imagePath + "/cls=multichannel/").collect() + + val firstBytes20Set = images.map { rrow => + val row = rrow.getAs[Row]("image") + val filename = Paths.get(getOrigin(row)).getFileName().toString() + val mode = getMode(row) + val bytes20 = getData(row).slice(0, 20).toList + filename -> Tuple2(mode, bytes20) // Cannot remove `Tuple2`, otherwise `->` operator + // will match 2 arguments + }.toSet + + assert(firstBytes20Set === expectedFirstBytes20Set) + } + + // number of channels and first 20 bytes of OpenCV representation + // - default representation for 3-channel RGB images is BGR row-wise: + // (B00, G00, R00, B10, G10, R10, ...) + // - default representation for 4-channel RGB images is BGRA row-wise: + // (B00, G00, R00, A00, B10, G10, R10, A10, ...) + private val expectedFirstBytes20Set = Set( + "grayscale.jpg" -> + ((0, List[Byte](-2, -33, -61, -60, -59, -59, -64, -59, -66, -67, -73, -73, -62, + -57, -60, -63, -53, -49, -55, -69))), + "chr30.4.184.jpg" -> ((16, + List[Byte](-9, -3, -1, -43, -32, -28, -75, -60, -57, -78, -59, -56, -74, -59, -57, + -71, -58, -56, -73, -64))), + "BGRA.png" -> ((24, + List[Byte](-128, -128, -8, -1, -128, -128, -8, -1, -128, + -128, -8, -1, 127, 127, -9, -1, 127, 127, -9, -1))), + "BGRA_alpha_60.png" -> ((24, + List[Byte](-128, -128, -8, 60, -128, -128, -8, 60, -128, + -128, -8, 60, 127, 127, -9, 60, 127, 127, -9, 60))) + ) +} http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/python/pyspark/ml/image.py ---------------------------------------------------------------------- diff --git a/python/pyspark/ml/image.py b/python/pyspark/ml/image.py index 5f0c57e..ef6785b 100644 --- a/python/pyspark/ml/image.py +++ b/python/pyspark/ml/image.py @@ -216,7 +216,7 @@ class _ImageSchema(object): :return: a :class:`DataFrame` with a single column of "images", see ImageSchema for details. - >>> df = ImageSchema.readImages('data/mllib/images/kittens', recursive=True) + >>> df = ImageSchema.readImages('data/mllib/images/origin/kittens', recursive=True) >>> df.count() 5 http://git-wip-us.apache.org/repos/asf/spark/blob/92544928/python/pyspark/ml/tests.py ---------------------------------------------------------------------- diff --git a/python/pyspark/ml/tests.py b/python/pyspark/ml/tests.py index 625d992..821e037 100755 --- a/python/pyspark/ml/tests.py +++ b/python/pyspark/ml/tests.py @@ -2186,7 +2186,7 @@ class FPGrowthTests(SparkSessionTestCase): class ImageReaderTest(SparkSessionTestCase): def test_read_images(self): - data_path = 'data/mllib/images/kittens' + data_path = 'data/mllib/images/origin/kittens' df = ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True) self.assertEqual(df.count(), 4) first_row = df.take(1)[0][0] @@ -2253,7 +2253,7 @@ class ImageReaderTest2(PySparkTestCase): def test_read_images_multiple_times(self): # This test case is to check if `ImageSchema.readImages` tries to # initiate Hive client multiple times. See SPARK-22651. - data_path = 'data/mllib/images/kittens' + data_path = 'data/mllib/images/origin/kittens' ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True) ImageSchema.readImages(data_path, recursive=True, dropImageFailures=True) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org For additional commands, e-mail: commits-h...@spark.apache.org