[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-05-26 Thread Alan Gates (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Alan Gates updated PIG-460:
---

Resolution: Won't Fix
Status: Resolved  (was: Patch Available)

After some testing by Amir Youssefi we determined that making this change 
actually makes performance worse. Changing RandomSampleLoader into an EvalFunc 
means that all records in the file have to be read and parsed. Since hadoop 
efficiently supports skipping in the input stream, this is very expensive. 
Instead we will pursue making RandomSampleLoader subsume the user's loader to 
avoid requiring a third MR job (see PIG-820).

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: 0.2.0
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Attachments: sampler.patch, sampler2.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-01-28 Thread Amir Youssefi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Youssefi updated PIG-460:
--

Assignee: Benjamin Reed  (was: Amir Youssefi)
  Status: Patch Available  (was: Open)

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Benjamin Reed
 Fix For: types_branch

 Attachments: sampler.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-01-28 Thread Amir Youssefi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Youssefi updated PIG-460:
--

Assignee: Amir Youssefi  (was: Benjamin Reed)
  Status: Open  (was: Patch Available)

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Amir Youssefi
 Fix For: types_branch

 Attachments: sampler.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-01-28 Thread Amir Youssefi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Youssefi updated PIG-460:
--

Attachment: (was: sampler2.patch)

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Amir Youssefi
 Fix For: types_branch

 Attachments: sampler.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-01-28 Thread Amir Youssefi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Youssefi updated PIG-460:
--

Attachment: sampler2.patch

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Amir Youssefi
 Fix For: types_branch

 Attachments: sampler.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-01-28 Thread Amir Youssefi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Youssefi updated PIG-460:
--

Attachment: sampler2.patch

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Amir Youssefi
 Fix For: types_branch

 Attachments: sampler.patch, sampler2.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-01-28 Thread Amir Youssefi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Youssefi updated PIG-460:
--

Assignee: Benjamin Francisoud  (was: Amir Youssefi)
  Status: Patch Available  (was: Open)

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Benjamin Francisoud
 Fix For: types_branch

 Attachments: sampler.patch, sampler2.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-01-28 Thread Amir Youssefi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Youssefi updated PIG-460:
--

Attachment: (was: sampler2.patch)

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Amir Youssefi
 Fix For: types_branch

 Attachments: sampler.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-460) PERFORMANCE: Order by done in 3 MR jobs, could be done in 2

2009-01-28 Thread Amir Youssefi (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Amir Youssefi updated PIG-460:
--

Attachment: sampler2.patch

 PERFORMANCE:  Order by done in 3 MR jobs, could be done in 2
 

 Key: PIG-460
 URL: https://issues.apache.org/jira/browse/PIG-460
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Amir Youssefi
 Fix For: types_branch

 Attachments: sampler.patch, sampler2.patch


 Currently order by is done in three MR jobs:
 job 1: read data in whatever loader the user requests, store using BinStorage
 job 2: load using RandomSampleLoader, find quantiles
 job 3: load data again and sort
 It is done this way because RandomSampleLoader extends BinStorage, and so 
 needs the data in that format to read it.
 If the logic in RandomSampleLoader was made into an operator instead of being 
 in a loader then jobs 1 and 2 could be merged.  On average job 1 takes about 
 15% of the time of an order by script.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.