The in-memory and Spark versions are pretty easy, the MR one will be a bit more work. Will track this at https://issues.apache.org/jira/browse/CRUNCH-489
J On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears <[email protected]> wrote: > Hi Josh, > > 1) Yes, having a version that allowed a specification of parallelism would > be very useful! I had been thinking of using scaleFactor to try to force a > higher degree of parallelism but not sure if that would have worked and > being able to explicitly specify the parallelism is much cleaner. > > 2) Yes, the difference would be a varargs array vs. an iterable as the > argument so having the analogous overloaded methods to > MemPipeline.typedCollectionOf would probably be best (sorry, I didn't > initially notice typedCollectionOf and collectionOf each had two overloaded > versions). > > Thanks again! > > -Ben > > > On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <[email protected]> wrote: > >> Hey Ben, >> >> Couple of questions: >> >> 1) If one potential use case for this was running simulations, wouldn't >> you want a version of collectionOf that allowed you to specify parallelism, >> like via NLineFileSource? >> 2) collectionOf vs. collectionFrom: do you just mean like a varargs array >> vs. an Iterable as the argument difference here? I also think that whatever >> version of this I did would have to take a PType so we knew how to >> serialize the data, so they would look more like typedCollectionOf on >> MemPipeline. >> >> Thanks! >> J >> >> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <[email protected] >> > wrote: >> >>> Hi Josh, >>> >>> Thanks for the quick reply! >>> >>> For me, I think a useful API would be to have an analogous >>> MRPipeline.collectionOf >>> and also potentially a method like MRPipeline.collectionFrom that takes in >>> a Java Iterable and returns a PCollection compatible with MRPipeline. >>> >>> -Ben >>> >>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <[email protected]> >>> wrote: >>> >>>> Hey Ben, >>>> >>>> No easy way to do it right now besides writing the data yourself, >>>> though that sort of simulation-based use case has been in the back of my >>>> mind ever since we added the NLineFileSource. What would your ideal API >>>> look like here? >>>> >>>> Thanks, >>>> J >>>> >>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears < >>>> [email protected]> wrote: >>>> >>>>> Hi, >>>>> >>>>> I'm trying to write a Crunch job to generate a large amount of >>>>> simulated data. To kick the job off, I need inputs into a do function. >>>>> These inputs are essentially dummy values that will be ignored in the do >>>>> fn. To accomplish this, I'd like to create an inmemory PCollection that >>>>> can then be passed into a MR pipeline, but if I do this with >>>>> MemPipeline.collectionOf >>>>> I get an error: >>>>> >>>>> Exception in thread "main" java.lang.IllegalStateException: named 'null' >>>>> cannot be serialized >>>>> at >>>>> org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110) >>>>> at >>>>> org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129) >>>>> >>>>> Is it possible to explicitly declare/instantiate a PCollection to pass >>>>> into an MRPipeline? >>>>> >>>>> Thanks! >>>>> >>>>> -Ben >>>>> >>>>> >>>> >>>> >>>> -- >>>> Director of Data Science >>>> Cloudera <http://www.cloudera.com> >>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>> >>> >>> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
