[ https://issues.apache.org/jira/browse/PIG-614?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Olga Natkovich resolved PIG-614. -------------------------------- Resolution: Duplicate This issue will be addressed by https://issues.apache.org/jira/browse/PIG-627 > reduce io during sharing scans of the same input datasets > ---------------------------------------------------------- > > Key: PIG-614 > URL: https://issues.apache.org/jira/browse/PIG-614 > Project: Pig > Issue Type: Improvement > Components: impl > Affects Versions: types_branch > Reporter: Samuel Guo > Priority: Minor > Fix For: types_branch > > > If we want to store different results that generated from the same input > dataset, now we need to write two or several *STORE* clauses. And these > *STORE* clauses will be translated to different mr jobs despite of these mr > jobs may share scans of the same input datasets. > for example: > Dataset 'weather' contains the records of the weather. Each record contains > three part : wind/air/tempreture. we need to process different part of the > records. > we may write a pig script as below: > weather = load 'weather.txt' as (wind, air, tempreture); > wind_results = ... wind ...; > air_results = ...air...; > temp_results = ...tempreture...; > store wind_results into 'wind.results'; > store air_results into 'air.results'; > store temp_results into 'temp.results'; > now pig will translate this script into three different MR jobs wich run > sequencely: scan 'weather.txt', process the wind data, store the wind > results; scan 'weather.txt' again, process the air data, store the air > results; ... > if the input data set is large, it is not efficient. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.