Seeing your Pig Latin script will help us determine whether this will work in 
your case.  But in general Pig uses secondary sort when you do an order by in a 
nested foreach.  So if you are grouping you could order within that group and 
then pass it to your UDF.

Alan.

On Oct 31, 2012, at 1:20 AM, Stanley Xu wrote:

> Dear buddies,
> 
> We are trying to write some of the UDF to do some machine learning work. We
> did a simple experiment to calculate the AUC through a UDF like the
> following code in gist
> 
> https://gist.github.com/3985764
> 
> The map-reduce job will only take a couple of few minutes, but will wait
> there hours to do the cleanup.
> 
> I guess the reason is that the sort inside the foreach will generate lots
> of data spill to local fs and takes a long time to do cleanup there.
> 
> In a java map-reduce problem, we could made it like a secondary sort. We
> make the model + ctr as the key so the same model's ctr will be sorted, and
> group by only the model name part, then the sort is done after shuffling.
> 
> I  am wondering if we could do that kind of optimization in pig as well?

Reply via email to