Hi Josh, Thanks for the quick reply!
For me, I think a useful API would be to have an analogous MRPipeline.collectionOf and also potentially a method like MRPipeline.collectionFrom that takes in a Java Iterable and returns a PCollection compatible with MRPipeline. -Ben On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <[email protected]> wrote: > Hey Ben, > > No easy way to do it right now besides writing the data yourself, though > that sort of simulation-based use case has been in the back of my mind ever > since we added the NLineFileSource. What would your ideal API look like > here? > > Thanks, > J > > On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <[email protected]> > wrote: > >> Hi, >> >> I'm trying to write a Crunch job to generate a large amount of simulated >> data. To kick the job off, I need inputs into a do function. These inputs >> are essentially dummy values that will be ignored in the do fn. To >> accomplish this, I'd like to create an inmemory PCollection that can then >> be passed into a MR pipeline, but if I do this with MemPipeline.collectionOf >> I get an error: >> >> Exception in thread "main" java.lang.IllegalStateException: named 'null' >> cannot be serialized >> at >> org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110) >> at >> org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129) >> >> Is it possible to explicitly declare/instantiate a PCollection to pass into >> an MRPipeline? >> >> Thanks! >> >> -Ben >> >> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
