Hi Don,

Good to hear from you. I think the problem is that regardless of whether
you use yield or a generator - Spark internally will produce the entire
result as a single large JVM object which will blow up your heap space.

Would it be possible to shrink the overall size of the image object storing
it as a vector or Array vs a large Java class object?

That might be the more prudent approach.

-RG

Richard Garris

Principal Architect

Databricks, Inc

650.200.0840

rlgar...@databricks.com

On December 14, 2017 at 10:23:00 AM, Marcelo Vanzin (van...@cloudera.com)
wrote:

This sounds like something mapPartitions should be able to do, not
sure if there's an easier way.

On Thu, Dec 14, 2017 at 10:20 AM, Don Drake <dondr...@gmail.com> wrote:
> I'm looking for some advice when I have a flatMap on a Dataset that is
> creating and returning a sequence of a new case class
> (Seq[BigDataStructure]) that contains a very large amount of data, much
> larger than the single input record (think images).
>
> In python, you can use generators (yield) to bypass creating a large list
of
> structures and returning the list.
>
> I'm programming this is in Scala and was wondering if there are any
similar
> tricks to optimally return a list of classes?? I found the for/yield
> semantics, but it appears the compiler is just creating a sequence for
you
> and this will blow through my Heap given the number of elements in the
list
> and the size of each element.
>
> Is there anything else I can use?
>
> Thanks.
>
> -Don
>
> --
> Donald Drake
> Drake Consulting
> http://www.drakeconsulting.com/
> https://twitter.com/dondrake
> 800-733-2143



-- 
Marcelo

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to