Re: How can I create an RDD with millions of entries created programmatically
Ah... I think you're right about the flatMap then :). Or you could use mapPartitions. (I'm not sure if it makes a difference.) On Mon, Dec 8, 2014 at 10:09 PM, Steve Lewis wrote: > looks good but how do I say that in Java > as far as I can see sc.parallelize (in Java) has only one implementation > which takes a List - requiring an in memory representation > > On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos < > daniel.dara...@lynxanalytics.com> wrote: > >> Hi, >> I think you have the right idea. I would not even worry about flatMap. >> >> val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x => >> generateRandomObject(x)) >> >> Then when you try to evaluate something on this RDD, it will happen >> partition-by-partition. So 1000 random objects will be generated at a time >> per executor thread. >> >> On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis >> wrote: >> >>> I have a function which generates a Java object and I want to explore >>> failures which only happen when processing large numbers of these object. >>> the real code is reading a many gigabyte file but in the test code I can >>> generate similar objects programmatically. I could create a small list, >>> parallelize it and then use flatmap to inflate it several times by a factor >>> of 1000 (remember I can hold a list of 1000 items in memory but not a >>> million) >>> Are there better ideas - remember I want to create more objects than can >>> be held in memory at once. >>> >>> >> > > > -- > Steven M. Lewis PhD > 4221 105th Ave NE > Kirkland, WA 98033 > 206-384-1340 (cell) > Skype lordjoe_com > >
Re: How can I create an RDD with millions of entries created programmatically
looks good but how do I say that in Java as far as I can see sc.parallelize (in Java) has only one implementation which takes a List - requiring an in memory representation On Mon, Dec 8, 2014 at 12:06 PM, Daniel Darabos < daniel.dara...@lynxanalytics.com> wrote: > Hi, > I think you have the right idea. I would not even worry about flatMap. > > val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x => > generateRandomObject(x)) > > Then when you try to evaluate something on this RDD, it will happen > partition-by-partition. So 1000 random objects will be generated at a time > per executor thread. > > On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis wrote: > >> I have a function which generates a Java object and I want to explore >> failures which only happen when processing large numbers of these object. >> the real code is reading a many gigabyte file but in the test code I can >> generate similar objects programmatically. I could create a small list, >> parallelize it and then use flatmap to inflate it several times by a factor >> of 1000 (remember I can hold a list of 1000 items in memory but not a >> million) >> Are there better ideas - remember I want to create more objects than can >> be held in memory at once. >> >> > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
Re: How can I create an RDD with millions of entries created programmatically
Hi, I think you have the right idea. I would not even worry about flatMap. val rdd = sc.parallelize(1 to 100, numSlices = 1000).map(x => generateRandomObject(x)) Then when you try to evaluate something on this RDD, it will happen partition-by-partition. So 1000 random objects will be generated at a time per executor thread. On Mon, Dec 8, 2014 at 8:05 PM, Steve Lewis wrote: > I have a function which generates a Java object and I want to explore > failures which only happen when processing large numbers of these object. > the real code is reading a many gigabyte file but in the test code I can > generate similar objects programmatically. I could create a small list, > parallelize it and then use flatmap to inflate it several times by a factor > of 1000 (remember I can hold a list of 1000 items in memory but not a > million) > Are there better ideas - remember I want to create more objects than can > be held in memory at once. > >