Great, thanks! -Ben
On Thu, Jan 22, 2015 at 10:12 AM, Josh Wills <[email protected]> wrote: > The in-memory and Spark versions are pretty easy, the MR one will be a bit > more work. Will track this at > https://issues.apache.org/jira/browse/CRUNCH-489 > > J > > On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears <[email protected]> > wrote: > >> Hi Josh, >> >> 1) Yes, having a version that allowed a specification of parallelism >> would be very useful! I had been thinking of using scaleFactor to try to >> force a higher degree of parallelism but not sure if that would have worked >> and being able to explicitly specify the parallelism is much cleaner. >> >> 2) Yes, the difference would be a varargs array vs. an iterable as the >> argument so having the analogous overloaded methods to >> MemPipeline.typedCollectionOf would probably be best (sorry, I didn't >> initially notice typedCollectionOf and collectionOf each had two overloaded >> versions). >> >> Thanks again! >> >> -Ben >> >> >> On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <[email protected]> wrote: >> >>> Hey Ben, >>> >>> Couple of questions: >>> >>> 1) If one potential use case for this was running simulations, wouldn't >>> you want a version of collectionOf that allowed you to specify parallelism, >>> like via NLineFileSource? >>> 2) collectionOf vs. collectionFrom: do you just mean like a varargs >>> array vs. an Iterable as the argument difference here? I also think that >>> whatever version of this I did would have to take a PType so we knew how to >>> serialize the data, so they would look more like typedCollectionOf on >>> MemPipeline. >>> >>> Thanks! >>> J >>> >>> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears < >>> [email protected]> wrote: >>> >>>> Hi Josh, >>>> >>>> Thanks for the quick reply! >>>> >>>> For me, I think a useful API would be to have an analogous >>>> MRPipeline.collectionOf >>>> and also potentially a method like MRPipeline.collectionFrom that takes in >>>> a Java Iterable and returns a PCollection compatible with MRPipeline. >>>> >>>> -Ben >>>> >>>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <[email protected]> >>>> wrote: >>>> >>>>> Hey Ben, >>>>> >>>>> No easy way to do it right now besides writing the data yourself, >>>>> though that sort of simulation-based use case has been in the back of my >>>>> mind ever since we added the NLineFileSource. What would your ideal API >>>>> look like here? >>>>> >>>>> Thanks, >>>>> J >>>>> >>>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to write a Crunch job to generate a large amount of >>>>>> simulated data. To kick the job off, I need inputs into a do function. >>>>>> These inputs are essentially dummy values that will be ignored in the do >>>>>> fn. To accomplish this, I'd like to create an inmemory PCollection that >>>>>> can then be passed into a MR pipeline, but if I do this with >>>>>> MemPipeline.collectionOf >>>>>> I get an error: >>>>>> >>>>>> Exception in thread "main" java.lang.IllegalStateException: named >>>>>> 'null' cannot be serialized >>>>>> at >>>>>> org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110) >>>>>> at >>>>>> org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129) >>>>>> >>>>>> Is it possible to explicitly declare/instantiate a PCollection to pass >>>>>> into an MRPipeline? >>>>>> >>>>>> Thanks! >>>>>> >>>>>> -Ben >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Director of Data Science >>>>> Cloudera <http://www.cloudera.com> >>>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>>> >>>> >>>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >> >> > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> >
