I lose at proofreading. Had to completely rewrite a section of one of my pipelines because of that issue.
On Tue, Apr 7, 2015 at 9:25 AM David Ortiz <[email protected]> wrote: > That would be the expectation. Depending on the number of records though, > it's possible to start getting OutOfMemoryErrors thrown by the Hadoop > framework during the shuffle/sort phase. Had to completely a section of > one of my pipelines because once we ran it on production level data that > was happening. Depending on what else you're running on the cluster, that > particular issue will also be very disruptive to other jobs. > > On Tue, Apr 7, 2015 at 3:27 AM André Pinto <[email protected]> > wrote: > >> Hi Josh, >> >> Yes. I guess the reasoning to no have the Iterable on Sort.sort but have >> it on the Secondary Sort was to avoid people using it on the complete data >> set then (and it is assumed that there will be never that much records with >> the same key, so it will be OK to iterate over those few records). Seems >> reasonable. >> >> Yes, using the Iterable on a single reducer is certainly not the best way >> to do this, but considering that there is no (simple) access to the global >> index I think there is really no other way. At least iterating over the >> Iterable will not move all the data into memory right? It does lazy >> loading, so it will just take a lot longer than doing it in parallel. >> >> Thanks. >> >> On Tue, Apr 7, 2015 at 4:06 AM, Josh Wills <[email protected]> wrote: >> >>> Hey Andre, >>> >>> Not sure what you mean precisely-- do you mean an option or method in >>> the Sort API that would include the rank of each item? >>> >>> In general, I like to avoid assuming that one reducer can handle all of >>> the data in a PCollection on API methods, which I think is what you're >>> saying (i.e., just stream all of the data in sorted order to a single >>> reducer.) >>> >>> J >>> >>> On Mon, Apr 6, 2015 at 3:19 PM, André Pinto <[email protected] >>> > wrote: >>> >>>> Hi Josh, >>>> >>>> Thanks for replying. >>>> >>>> That really sounds very hacky. I was expecting something with a little >>>> more support from the API. >>>> >>>> I guess we could also use sortAndApply with a random generated >>>> singleton Key for the entire set of values and then use the Iterable on the >>>> Values to obtain the sorted index. It still looks bad though... >>>> >>>> Just out of curiosity, why isn't the Iterable approach also supported >>>> on the simple Sort.sort? Sorry if this looks obvious to you, but I'm still >>>> new to Crunch and Hadoop. >>>> >>>> Thanks. >>>> >>>> On Thu, Apr 2, 2015 at 6:36 PM, Josh Wills <[email protected]> wrote: >>>> >>>>> I can't think of a great way to do it-- knowing exactly which record >>>>> you're processing (in any kind of order) in a distributed processing job >>>>> is >>>>> always somewhat fraught. Gun to my head, I would do it in two phases: >>>>> >>>>> 1) Get the name of the FileSplit for the current task-- which can be >>>>> retrieved, although we don't make it easy. You can do it via something >>>>> like >>>>> this from inside of a map-side DoFn: >>>>> >>>>> InputSplit split = ((MapContext) getContext()).getInputSplit(); >>>>> FileSplit baseSplit = (FileSplit) ((Supplier<InputSplit>) split).get(); >>>>> >>>>> The count up the number of records inside of each FileSplit. I'm not >>>>> sure if you should disable combine files when you do this, but it seems >>>>> like a good idea. >>>>> >>>>> 2) Create a new DoFn that takes the output of the previous job and >>>>> uses it to determine exactly which record in order the currently >>>>> processing >>>>> record is, based on the sorted order of the FileSplit names and an >>>>> internal >>>>> counter that gets reset to zero for each new FileSplit. >>>>> >>>>> J >>>>> >>>>> On Thu, Apr 2, 2015 at 7:39 AM, André Pinto < >>>>> [email protected]> wrote: >>>>> >>>>>> Hi, >>>>>> >>>>>> I'm trying to calculate the percentile ranks for the values of a >>>>>> sorted PTable (i.e. at which % rank each element is within the whole data >>>>>> set). Is there a way to do this with Crunch? Seems that we would only >>>>>> need >>>>>> to have access to the global index of the record during an iteration over >>>>>> the data set. >>>>>> >>>>>> Thanks in advance, >>>>>> André >>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> Director of Data Science >>>>> Cloudera <http://www.cloudera.com> >>>>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>>>> >>>> >>>> >>> >>> >>> -- >>> Director of Data Science >>> Cloudera <http://www.cloudera.com> >>> Twitter: @josh_wills <http://twitter.com/josh_wills> >>> >> >>
