Is that possible to use Pig to do an optimized secondary sort.

Stanley Xu Wed, 31 Oct 2012 01:20:54 -0700

Dear buddies,

We are trying to write some of the UDF to do some machine learning work. We
did a simple experiment to calculate the AUC through a UDF like the
following code in gist


https://gist.github.com/3985764

The map-reduce job will only take a couple of few minutes, but will wait
there hours to do the cleanup.

I guess the reason is that the sort inside the foreach will generate lots
of data spill to local fs and takes a long time to do cleanup there.

In a java map-reduce problem, we could made it like a secondary sort. We
make the model + ctr as the key so the same model's ctr will be sorted, and
group by only the model name part, then the sort is done after shuffling.

I  am wondering if we could do that kind of optimization in pig as well?

Is that possible to use Pig to do an optimized secondary sort.

Reply via email to