Re: In memory PCollection for use in MRPipeline

Josh Wills Thu, 22 Jan 2015 10:14:08 -0800

The in-memory and Spark versions are pretty easy, the MR one will be a bit
more work. Will track this at
https://issues.apache.org/jira/browse/CRUNCH-489


J

On Wed, Jan 21, 2015 at 9:24 PM, Benjamin Mears <[email protected]>
wrote:

> Hi Josh,
>
> 1) Yes, having a version that allowed a specification of parallelism would
> be very useful!  I had been thinking of using scaleFactor to try to force a
> higher degree of parallelism but not sure if that would have worked and
> being able to explicitly specify the parallelism is much cleaner.
>
> 2) Yes, the difference would be a varargs array vs. an iterable as the
> argument so having the analogous overloaded methods to
> MemPipeline.typedCollectionOf would probably be best (sorry, I didn't
> initially notice typedCollectionOf and collectionOf each had two overloaded
> versions).
>
> Thanks again!
>
> -Ben
>
>
> On Wed, Jan 21, 2015 at 8:58 PM, Josh Wills <[email protected]> wrote:
>
>> Hey Ben,
>>
>> Couple of questions:
>>
>> 1) If one potential use case for this was running simulations, wouldn't
>> you want a version of collectionOf that allowed you to specify parallelism,
>> like via NLineFileSource?
>> 2) collectionOf vs. collectionFrom: do you just mean like a varargs array
>> vs. an Iterable as the argument difference here? I also think that whatever
>> version of this I did would have to take a PType so we knew how to
>> serialize the data, so they would look more like typedCollectionOf on
>> MemPipeline.
>>
>> Thanks!
>> J
>>
>> On Wed, Jan 21, 2015 at 7:19 PM, Benjamin Mears <[email protected]
>> > wrote:
>>
>>> Hi Josh,
>>>
>>> Thanks for the quick reply!
>>>
>>> For me, I think a useful API would be to have an analogous 
>>> MRPipeline.collectionOf
>>> and also potentially a method like MRPipeline.collectionFrom that takes in
>>> a Java Iterable and returns a PCollection compatible with MRPipeline.
>>>
>>> -Ben
>>>
>>> On Wed, Jan 21, 2015 at 11:19 AM, Josh Wills <[email protected]>
>>> wrote:
>>>
>>>> Hey Ben,
>>>>
>>>> No easy way to do it right now besides writing the data yourself,
>>>> though that sort of simulation-based use case has been in the back of my
>>>> mind ever since we added the NLineFileSource. What would your ideal API
>>>> look like here?
>>>>
>>>> Thanks,
>>>> J
>>>>
>>>> On Wed, Jan 21, 2015 at 9:01 AM, Benjamin Mears <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> I'm trying to write a Crunch job to generate a large amount of
>>>>> simulated data.  To kick the job off, I need inputs into a do function.
>>>>> These inputs are essentially dummy values that will be ignored in the do
>>>>> fn.  To accomplish this, I'd like to create an inmemory PCollection that
>>>>> can then be passed into a MR pipeline, but if I do this with 
>>>>> MemPipeline.collectionOf
>>>>> I get an error:
>>>>>
>>>>> Exception in thread "main" java.lang.IllegalStateException:  named 'null' 
>>>>> cannot be serialized
>>>>>   at 
>>>>> org.apache.crunch.impl.mem.collect.MemCollection.verifySerializable(MemCollection.java:110)
>>>>>   at 
>>>>> org.apache.crunch.impl.mem.collect.MemCollection.parallelDo(MemCollection.java:129)
>>>>>
>>>>> Is it possible to explicitly declare/instantiate a PCollection to pass 
>>>>> into an MRPipeline?
>>>>>
>>>>> Thanks!
>>>>>
>>>>> -Ben
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Director of Data Science
>>>> Cloudera <http://www.cloudera.com>
>>>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>>>
>>>
>>>
>>
>>
>> --
>> Director of Data Science
>> Cloudera <http://www.cloudera.com>
>> Twitter: @josh_wills <http://twitter.com/josh_wills>
>>
>
>


-- 
Director of Data Science
Cloudera <http://www.cloudera.com>
Twitter: @josh_wills <http://twitter.com/josh_wills>

Re: In memory PCollection for use in MRPipeline

Reply via email to