Hello,

I have a parent/child relation, and I would like to get a sample
of this data using Pig's SAMPLE.

My question is whether COGROUP is the optimal way
of getting the child records of the sampled parent data.

orders = LOAD 'orders' AS (order_id, order_date);
order_details = LOAD 'order_details' AS (order_id, product_id, quantity);

sample_orders = SAMPLE orders 0.01;
grpd = COGROUP sample_orders BY order_id, order_details BY order_id;
-- I want to get only the order_details into a relation, so that I only
-- have the order_detail fields in the sample_order_details
sample_order_details = FOREACH grpd GENERATE FLATTEN(order_details);

STORE sample_orders INTO 'sample_orders';
STORE sample_order_details INTO 'sample_order_details';

Is this a reasonable way of getting a sample of parent records, then getting
the corresponding child records of the sample parent records?  JOIN seems
difficult
or unwieldy because I would need to project only the order_details fields
from the joined relation.

Thanks,
--Nate

Reply via email to