Hey Ben, Couple of questions:
1) If one potential use case for this was running simulations, wouldn't you want a version of collectionOf that allowed you to specify parallelism, like via NLineFileSource? 2) collectionOf vs. collectionFrom: do you just mean like a varargs array vs. an Iterable as the argument difference here? I also think that whatever version of this I did would have to take a PType so we knew how to serialize the data, so they would look more like typedCollectionOf on MemPipeline. Thanks! J On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <[email protected]> wrote: > Hi Josh, > > Thanks for the quick reply! > > For me, I think a useful API would be to have an analogous > MRPipeline.collectionOf > and also potentially a method like MRPipeline.collectionFrom that takes in > a Java Iterable and returns a PCollection compatible with MRPipeline. > > -Ben > > On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <[email protected]> wrote: > >> Hey Ben, >> >> No easy way to do it right now besides writing the data yourself, though >> that sort of simulation-based use case has been in the back of my mind ever >> since we added the NLineFileSource. What would your ideal API look like >> here? >> >> Thanks, >> J >> >> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <[email protected] >> > wrote: >> >>> Hi, >>> >>> I'm trying to write a Crunch job to generate a large amount of simulated >>> data. To kick the job off, I need inputs into a do function. These inputs >>> are essentially dummy values that will be ignored in the do fn. To >>> accomplish this, I'd like to create an inmemory PCollection that can then >>> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf >>> I get an error: >>> >>> Exception in thread "main" java.lang.IllegalStateException: named 'null' >>> cannot be serialized >>> at >>> org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110) >>> at >>> org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129) >>> >>> Is it possible to explicitly declare/instantiate a PCollection to pass into >>> an MRPipeline? >>> >>> Thanks! >>> >>> -Ben >>> >>> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > -- Director of Data Science Cloudera <http://www.cloudera.com> Twitter: @josh_wills <http://twitter.com/josh_wills>
