[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amir Youssefi updated PIG-460: -- Attachment: sampler2.patch PERFORMANCE: Order by done in 3 MR jobs, could be done in 2 Key: PIG-460 URL: https://issues.apache.org/jira/browse/PIG-460 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Alan Gates Assignee: Amir Youssefi Fix For: types_branch Attachments: sampler.patch Currently order by is done in three MR jobs: job 1: read data in whatever loader the user requests, store using BinStorage job 2: load using RandomSampleLoader, find quantiles job 3: load data again and sort It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it. If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged. On average job 1 takes about 15% of the time of an order by script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amir Youssefi updated PIG-460: -- Attachment: sampler2.patch PERFORMANCE: Order by done in 3 MR jobs, could be done in 2 Key: PIG-460 URL: https://issues.apache.org/jira/browse/PIG-460 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Alan Gates Assignee: Amir Youssefi Fix For: types_branch Attachments: sampler.patch, sampler2.patch Currently order by is done in three MR jobs: job 1: read data in whatever loader the user requests, store using BinStorage job 2: load using RandomSampleLoader, find quantiles job 3: load data again and sort It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it. If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged. On average job 1 takes about 15% of the time of an order by script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amir Youssefi updated PIG-460: -- Attachment: (was: sampler2.patch) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2 Key: PIG-460 URL: https://issues.apache.org/jira/browse/PIG-460 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Alan Gates Assignee: Amir Youssefi Fix For: types_branch Attachments: sampler.patch Currently order by is done in three MR jobs: job 1: read data in whatever loader the user requests, store using BinStorage job 2: load using RandomSampleLoader, find quantiles job 3: load data again and sort It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it. If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged. On average job 1 takes about 15% of the time of an order by script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2
[ https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Amir Youssefi updated PIG-460: -- Attachment: sampler2.patch PERFORMANCE: Order by done in 3 MR jobs, could be done in 2 Key: PIG-460 URL: https://issues.apache.org/jira/browse/PIG-460 Project: Pig Issue Type: Bug Affects Versions: types_branch Reporter: Alan Gates Assignee: Amir Youssefi Fix For: types_branch Attachments: sampler.patch, sampler2.patch Currently order by is done in three MR jobs: job 1: read data in whatever loader the user requests, store using BinStorage job 2: load using RandomSampleLoader, find quantiles job 3: load data again and sort It is done this way because RandomSampleLoader extends BinStorage, and so needs the data in that format to read it. If the logic in RandomSampleLoader was made into an operator instead of being in a loader then jobs 1 and 2 could be merged. On average job 1 takes about 15% of the time of an order by script. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.