Re: How can I create an RDD with millions of entries created programmatically

2014-12-09 Thread Daniel Darabos
Ah... I think you're right about the flatMap then :). Or you could use
mapPartitions. (I'm not sure if it makes a difference.)

On Mon, Dec 8, 2014 at 10:09 PM, Steve Lewis  wrote:

> looks good but how do I say that in Java
> as far as I can see sc.parallelize (in Java)  has only one implementation
> which takes a List - requiring an in memory representation
>
> On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos <
> daniel.dara...@lynxanalytics.com> wrote:
>
>> Hi,
>> I think you have the right idea. I would not even worry about flatMap.
>>
>> val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x =>
>> generateRandomObject(x))
>>
>> Then when you try to evaluate something on this RDD, it will happen
>> partition-by-partition. So 1000 random objects will be generated at a time
>> per executor thread.
>>
>> On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis 
>> wrote:
>>
>>>  I have a function which generates a Java object and I want to explore
>>> failures which only happen when processing large numbers of these object.
>>> the real code is reading a many gigabyte file but in the test code I can
>>> generate similar objects programmatically. I could create a small list,
>>> parallelize it and then use flatmap to inflate it several times by a factor
>>> of 1000 (remember I can hold a list of 1000 items in memory but not a
>>> million)
>>> Are there better ideas - remember I want to create more objects than can
>>> be held in memory at once.
>>>
>>>
>>
>
>
> --
> Steven M. Lewis PhD
> 4221 105th Ave NE
> Kirkland, WA 98033
> 206-384-1340 (cell)
> Skype lordjoe_com
>
>


Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Steve Lewis
looks good but how do I say that in Java
as far as I can see sc.parallelize (in Java)  has only one implementation
which takes a List - requiring an in memory representation

On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos <
daniel.dara...@lynxanalytics.com> wrote:

> Hi,
> I think you have the right idea. I would not even worry about flatMap.
>
> val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x =>
> generateRandomObject(x))
>
> Then when you try to evaluate something on this RDD, it will happen
> partition-by-partition. So 1000 random objects will be generated at a time
> per executor thread.
>
> On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis  wrote:
>
>>  I have a function which generates a Java object and I want to explore
>> failures which only happen when processing large numbers of these object.
>> the real code is reading a many gigabyte file but in the test code I can
>> generate similar objects programmatically. I could create a small list,
>> parallelize it and then use flatmap to inflate it several times by a factor
>> of 1000 (remember I can hold a list of 1000 items in memory but not a
>> million)
>> Are there better ideas - remember I want to create more objects than can
>> be held in memory at once.
>>
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com


Re: How can I create an RDD with millions of entries created programmatically

2014-12-08 Thread Daniel Darabos
Hi,
I think you have the right idea. I would not even worry about flatMap.

val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x =>
generateRandomObject(x))

Then when you try to evaluate something on this RDD, it will happen
partition-by-partition. So 1000 random objects will be generated at a time
per executor thread.

On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis  wrote:

>  I have a function which generates a Java object and I want to explore
> failures which only happen when processing large numbers of these object.
> the real code is reading a many gigabyte file but in the test code I can
> generate similar objects programmatically. I could create a small list,
> parallelize it and then use flatmap to inflate it several times by a factor
> of 1000 (remember I can hold a list of 1000 items in memory but not a
> million)
> Are there better ideas - remember I want to create more objects than can
> be held in memory at once.
>
>