I have the following data: records = foreach std generate request_date as request_date, SubtractDuration(time, CONCAT('PT', CONCAT((chararray)CEIL(response_time), 'S')) as time_requested, ToString(SubtractDuration(time, CONCAT('PT', CONCAT((chararray)CEIL(response_time), 'S'))), 'yyyy-MM-dd') as date_requested, GetHour(SubtractDuration(time, CONCAT('PT', CONCAT((chararray)CEIL(response_time), 'S'))) as hour_requested, hour as hour, path as path, original_path as original_path, is_static_resource as is_static_resource, is_page as is_page, status as status, is_internal_host as is_internal_host, referrer as referrer, content_length as content_length, response_time as response_time, web_server as web_server, app_server as app_server, app_server_instance as app_server_instance, session_id as session_id, sold_to_party_num as sold_to_party_num, customer_name as customer_name, login_id as login_id, employee_id as employee_id, first_name as first_name, last_name as last_name, session_start_date as session_start_date, browser as browser, browser_version as browser_version, outlier_response_time as outlier_response_time, is_slow_response
And then this data: gc_times = foreach data generate ToString(SubtractDuration(ToDate(date_time, 'yyyy MMM dd HH:mm:ss'), CONCAT('PT', CONCAT((chararray)stop_time_sec, 'S'))), 'yyyy-MM-dd') as start_date, SubtractDuration(ToDate(date_time, 'yyyy MMM dd HH:mm:ss'), CONCAT('PT', CONCAT((chararray)stop_time_sec, 'S'))) as start_time, ToDate(date_time, 'yyyy MMM dd HH:mm:ss') as end_time, GetHour(ToDate(date_time, 'yyyy MMM dd HH:mm:ss')) as hour, server, instance, process_id, stop_time_seconds; -- ie. 1.03 I want to find the "records" which have a time_requested that is between the start_time and end_time in gc_times. I was thinking writing a UDF that basically accepted a bag of gc_times for a given group (date, server, instance, hour) and basically looped through each start_time, end_time and return 'T' or 'F' depending on if the given date_time was between those the gc_start_time and gc_end_time. But then I was thinking maybe I could do the entire thing w/o a custom UDF. I got as far as: datag = cogroup records by (app_server, app_server_instance, date_requested, hour_requested), gc_times by (server, instance, start_date, hour); datag = foreach datag { } In the end, I was hoping for a new field that had the number of seconds spent in the gc_time. something like: 2013-04-09|09:00:32|/some/path|32.5|30.1| Where 32.5 is total time spent and 30.1 is time spent in gc_time. Any ideas on how to do this in Pig w/o a custom UDF? Thanks, Christian