Hi Josh, 1) Yes, having a version that allowed a specification of parallelism would be very useful! I had been thinking of using scaleFactor to try to force a higher degree of parallelism but not sure if that would have worked and being able to explicitly specify the parallelism is much cleaner.
2) Yes, the difference would be a varargs array vs. an iterable as the argument so having the analogous overloaded methods to MemPipeline.typedCollectionOf would probably be best (sorry, I didn't initially notice typedCollectionOf and collectionOf each had two overloaded versions). Thanks again! -Ben On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <[email protected]> wrote: > Hey Ben, > > Couple of questions: > > 1) If one potential use case for this was running simulations, wouldn't > you want a version of collectionOf that allowed you to specify parallelism, > like via NLineFileSource? > 2) collectionOf vs. collectionFrom: do you just mean like a varargs array > vs. an Iterable as the argument difference here? I also think that whatever > version of this I did would have to take a PType so we knew how to > serialize the data, so they would look more like typedCollectionOf on > MemPipeline. > > Thanks! > J > > On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <[email protected]> > wrote: > >> Hi Josh, >> >> Thanks for the quick reply! >> >> For me, I think a useful API would be to have an analogous >> MRPipeline.collectionOf >> and also potentially a method like MRPipeline.collectionFrom that takes in >> a Java Iterable and returns a PCollection compatible with MRPipeline. >> >> -Ben >> >> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <[email protected]> wrote: >> >>> Hey Ben, >>> >>> No easy way to do it right now besides writing the data yourself, though >>> that sort of simulation-based use case has been in the back of my mind ever >>> since we added the NLineFileSource. What would your ideal API look like >>> here? >>> >>> Thanks, >>> J >>> >>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears < >>> [email protected]> wrote: >>> >>>> Hi, >>>> >>>> I'm trying to write a Crunch job to generate a large amount of >>>> simulated data. To kick the job off, I need inputs into a do function. >>>> These inputs are essentially dummy values that will be ignored in the do >>>> fn. To accomplish this, I'd like to create an inmemory PCollection that >>>> can then be passed into a MR pipeline, but if I do this with >>>> MemPipeline.collectionOf >>>> I get an error: >>>> >>>> Exception in thread "main" java.lang.IllegalStateException: named 'null' >>>> cannot be serialized >>>> at >>>> org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110) >>>> at >>>> org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129) >>>> >>>> Is it possible to explicitly declare/instantiate a PCollection to pass >>>> into an MRPipeline? >>>> >>>> Thanks! >>>> >>>> -Ben >>>> >>>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >> >> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
