That sounds great, thanks.
On Tue, Nov 26, 2013 at 2:46 PM, Josh Wills <[email protected]> wrote: > JIRA is here-- https://issues.apache.org/jira/browse/CRUNCH-306 > > The question I have right off the bat is whether we should restrict these > outputs to PGroupedTable types, where we know that all of the records for > the same key will be in the same partition. For arbitrary PTable types, we > might have multiple partitions containing the same key, and we might need > to keep a large number of output record writers open at the same time, > which probably isn't a great idea. > > > On Tue, Nov 26, 2013 at 11:50 AM, Josh Wills <[email protected]> wrote: > >> Hey Bryan, >> >> This comes up often enough that we need to prioritize the use case-- what >> we really want is a Target that would take in a PTable<String, T> and would >> be able to write an output file/directory for each String key. I'll create >> a JIRA to track this. >> >> Josh >> >> >> On Tue, Nov 26, 2013 at 11:25 AM, Bryan Baugher <[email protected]> wrote: >> >>> Hi everyone, >>> >>> I have a PCollection of avro based objects and I want to categorize >>> these avro objects by a certain property by writing each category into a >>> different avro file. The number of distinct categories should be small >>> (hundreds) and the property I am categorizing on is a String. I was hoping >>> there was some way to end up with a Map<String, PCollection> but there >>> didn't seem to be any obvious choice. For now I have gone with a simple >>> approach of >>> >>> - Find all categories (DoFn that returns PCollection<String>) >>> - Materialize and iterate over this collection >>> - For each category use a FilterFn to create desired categorized >>> PCollection >>> - Write this to avro file >>> >>> This works but it seems like there should be a better way to do it. Any >>> thoughts? >>> >>> -Bryan >>> >> >> >> >> -- >> Director of Data Science >> Cloudera <http://www.cloudera.com> >> Twitter: @josh_wills <http://twitter.com/josh_wills> >> > > > > -- > Director of Data Science > Cloudera <http://www.cloudera.com> > Twitter: @josh_wills <http://twitter.com/josh_wills> > -- -Bryan
