Replace this line:

 img_data = sc.parallelize( list(im.getdata()) )

With:

 img_data = sc.parallelize( list(im.getdata()), 3 * No cores you have )


Thanks
Best Regards

On Thu, Jun 4, 2015 at 1:57 AM, Justin Spargur <jmspar...@gmail.com> wrote:

> Hi all,
>
>      I'm playing around with manipulating images via Python and want to
> utilize Spark for scalability. That said, I'm just learing Spark and my
> Python is a bit rusty (been doing PHP coding for the last few years). I
> think I have most of the process figured out. However, the script fails on
> larger images and Spark is sending out the following warning for smaller
> images:
>
> Stage 0 contains a task of very large size (1151 KB). The maximum
> recommended task size is 100 KB.
>
> My code is as follows:
>
> import Image
> from pyspark import SparkContext
>
> if __name__ == "__main__":
>
>     imageFile = "sample.jpg"
>     outFile   = "sample.gray.jpg"
>
>     sc = SparkContext(appName="Grayscale")
>     im = Image.open(imageFile)
>
>     # Create an RDD for the data from the image file
>     img_data = sc.parallelize( list(im.getdata()) )
>
>     # Create an RDD for the grayscale value
>     gValue = img_data.map( lambda x: int(x[0]*0.21 + x[1]*0.72 +
> x[2]*0.07) )
>
>     # Put our grayscale value into the RGR channels
>     grayscale = gValue.map( lambda x: (x,x,x)  )
>
>     # Save the output in a new image.
>     im.putdata( grayscale.collect() )
>
>     im.save(outFile)
>
> Obviously, something is amiss. However, I can't figure out where I'm off
> track with this. Any help is appreciated! Thanks in advance!!!
>

Reply via email to