I have posted the code by a gist link in the mail. I just simplify the real code to make it simple, will that trigger a secondary sort automatically?
If that, is there any other places I should check to understand why the cleanup of the mapreduce takes that long time? Thanks. Best wishes, Stanley Xu On Wed, Oct 31, 2012 at 11:21 PM, Alan Gates <[email protected]> wrote: > Seeing your Pig Latin script will help us determine whether this will work > in your case. But in general Pig uses secondary sort when you do an order > by in a nested foreach. So if you are grouping you could order within that > group and then pass it to your UDF. > > Alan. > > On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote: > > > Dear buddies, > > > > We are trying to write some of the UDF to do some machine learning work. > We > > did a simple experiment to calculate the AUC through a UDF like the > > following code in gist > > > > https://gist.github.com/3985764 > > > > The map-reduce job will only take a couple of few minutes, but will wait > > there hours to do the cleanup. > > > > I guess the reason is that the sort inside the foreach will generate lots > > of data spill to local fs and takes a long time to do cleanup there. > > > > In a java map-reduce problem, we could made it like a secondary sort. We > > make the model + ctr as the key so the same model's ctr will be sorted, > and > > group by only the model name part, then the sort is done after shuffling. > > > > I am wondering if we could do that kind of optimization in pig as well? > >
