[jira] Updated: (PIG-554) Fragment Replicate Join
[ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-554: --- Attachment: frjofflat.patch Fragment Replicate Join --- Key: PIG-554 URL: https://issues.apache.org/jira/browse/PIG-554 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shravan Matthur Narayanamurthy Assignee: Shravan Matthur Narayanamurthy Fix For: types_branch Attachments: frjofflat.patch Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a very small table (fitting in memory small) and the join doesn't expand the data by much. The idea is to distribute the processing of the huge files by fragmenting it and replicating the small file to all machines receiving a fragment of the huge file. Because of the availability of the entire small file, the join becomes a trivial task without needing any break in the pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Will post the details in a wiki and add a link here The patch makes changes to parts of the code where new operators are introduced. Currently, when a new operator is introduced, its alias is not set. For schema computation I have modified this behaviour to set the alias of the new operator to that of its predecessor. The logical side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup. Currently, this patch doesn't have support for joins other than inner joins. The rest of the code has been documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-554) Fragment Replicate Join
[ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-554: --- Attachment: frjofflat1.patch Fragment Replicate Join --- Key: PIG-554 URL: https://issues.apache.org/jira/browse/PIG-554 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shravan Matthur Narayanamurthy Assignee: Shravan Matthur Narayanamurthy Fix For: types_branch Attachments: frjofflat.patch, frjofflat1.patch Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a very small table (fitting in memory small) and the join doesn't expand the data by much. The idea is to distribute the processing of the huge files by fragmenting it and replicating the small file to all machines receiving a fragment of the huge file. Because of the availability of the entire small file, the join becomes a trivial task without needing any break in the pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin The patch makes changes to parts of the code where new operators are introduced. Currently, when a new operator is introduced, its alias is not set. For schema computation I have modified this behaviour to set the alias of the new operator to that of its predecessor. The logical side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup. Currently, this patch doesn't have support for joins other than inner joins. The rest of the code has been documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-554) Fragment Replicate Join
[ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655938#action_12655938 ] Shravan Matthur Narayanamurthy commented on PIG-554: The latest one has the fixes mentioned above. Please take a look. Fragment Replicate Join --- Key: PIG-554 URL: https://issues.apache.org/jira/browse/PIG-554 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shravan Matthur Narayanamurthy Assignee: Shravan Matthur Narayanamurthy Fix For: types_branch Attachments: frjofflat.patch, frjofflat1.patch Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a very small table (fitting in memory small) and the join doesn't expand the data by much. The idea is to distribute the processing of the huge files by fragmenting it and replicating the small file to all machines receiving a fragment of the huge file. Because of the availability of the entire small file, the join becomes a trivial task without needing any break in the pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin The patch makes changes to parts of the code where new operators are introduced. Currently, when a new operator is introduced, its alias is not set. For schema computation I have modified this behaviour to set the alias of the new operator to that of its predecessor. The logical side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup. Currently, this patch doesn't have support for joins other than inner joins. The rest of the code has been documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-554) Fragment Replicate Join
[ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658834#action_12658834 ] Shravan Matthur Narayanamurthy commented on PIG-554: 1) Consider the following script: A = load 'file1'; B = load 'file2'; C = filter A by $010; D = filter B by $010; E = join C by $0, D by $0 using replicated; We need to materialize the result of D before we can use it as replicated input. Also DC has not been used as it doesn't support directories iirc (we will have to handle many complications manually) and the load specification in pig can contain regexps too. Also as the size of the replicated file is small it doesn't make too much diff. 2) Instead of writing all the code to handle the various combinations of the group item specification, I chose to use LR which already does it. I think I store only the plain tuple(extracted from the LR ouput) and not the LR output in the hashtables. So it doesn't add to any memory overhead. The LR is used only to separate out key value and these are stored as a mapping from key to value (plain tuples). Fragment Replicate Join --- Key: PIG-554 URL: https://issues.apache.org/jira/browse/PIG-554 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shravan Matthur Narayanamurthy Assignee: Shravan Matthur Narayanamurthy Fix For: types_branch Attachments: frjofflat.patch, frjofflat1.patch Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a very small table (fitting in memory small) and the join doesn't expand the data by much. The idea is to distribute the processing of the huge files by fragmenting it and replicating the small file to all machines receiving a fragment of the huge file. Because of the availability of the entire small file, the join becomes a trivial task without needing any break in the pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin The patch makes changes to parts of the code where new operators are introduced. Currently, when a new operator is introduced, its alias is not set. For schema computation I have modified this behaviour to set the alias of the new operator to that of its predecessor. The logical side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup. Currently, this patch doesn't have support for joins other than inner joins. The rest of the code has been documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-554) Fragment Replicate Join
[ https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661449#action_12661449 ] Shravan Matthur Narayanamurthy commented on PIG-554: (1) is a good catch! Really hadn't thought about this. (2) Hashtable to HashMap is fine but should we be storing DataBag instead of List? I thought DataBag took more space than List because of which the number of tuples we can handle decreases. I think you forgot to include TestFRJoin in the patch. Rest looks good to me. Fragment Replicate Join --- Key: PIG-554 URL: https://issues.apache.org/jira/browse/PIG-554 Project: Pig Issue Type: New Feature Affects Versions: types_branch Reporter: Shravan Matthur Narayanamurthy Assignee: Shravan Matthur Narayanamurthy Fix For: types_branch Attachments: frjofflat.patch, frjofflat1.patch, PIG-554-v3.patch Fragment Replicate Join(FRJ) is useful when we want a join between a huge table and a very small table (fitting in memory small) and the join doesn't expand the data by much. The idea is to distribute the processing of the huge files by fragmenting it and replicating the small file to all machines receiving a fragment of the huge file. Because of the availability of the entire small file, the join becomes a trivial task without needing any break in the pipeline. Exhaustive test have done to determine the improvement we get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin The patch makes changes to parts of the code where new operators are introduced. Currently, when a new operator is introduced, its alias is not set. For schema computation I have modified this behaviour to set the alias of the new operator to that of its predecessor. The logical side of the patch mimics the cogroup behavior as join syntax closely resembles that of cogroup. Currently, this patch doesn't have support for joins other than inner joins. The rest of the code has been documented. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF
[ https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663306#action_12663306 ] Shravan Matthur Narayanamurthy commented on PIG-597: The exception is being thrown from ARITY where it is trying to convert the first field of the tuple into a tuple. However, since we have a star, the tuple is not wrapped inside another tuple and hence the exception. This was done in order to model the trunk behavior which is that there is an implicit flatten in front of a *. If we want to retain this behavior, then we need to change ARITY other functions which were written with the assumption that POUserFunc will wrap anything inside a tuple though most of these functions will be useless when we have a UDF which outputs a tuple. To give an example, say we have a function which returns a tuple and we want to find its arity, ARITY(TupleRetUDF(*)) will always return one since POUserFunc will wrap the output of TupleRetUDF into another tuple and ARITY is changed to return just the size of the input tuple and not the size of the first field. However, if we comment this code, then we need to modify FindQuantiles to consider the fact that everything will be wrapped inside a tuple the behavior is not conditional upon the use of a star. I think this is better and Olga seems to agree as per her previous comment. Any other thoughts? Retain trunk behavior or change it? Pig does not handdle correctly the case where * is passed to UDF -- Key: PIG-597 URL: https://issues.apache.org/jira/browse/PIG-597 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Shravan Matthur Narayanamurthy Script: == A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) 5; DUMP B; Error: = 2009-01-05 21:46:56,355 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc - Caught error from UDF org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple] Problem: === Santhosh tracked this to the following code in POUserFunc.java: if(op instanceof POProject op.getResultType() == DataType.TUPLE){ POProject projOp = (POProject)op; if(projOp.isStar()){ Tuple trslt = (Tuple) temp.result; Tuple rslt = (Tuple) res.result; for(int i=0;itrslt.size();i++) rslt.append(trslt.get(i)); continue; } } It seems to be unwrapping the tuple before passing it to the function. There is no comments so we are not sure why it is there; will need to run tests to see if removing it would solve this issue and not create others. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF
[ https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-597: --- Status: Patch Available (was: Open) Fixed by commenting the POStar code in POUserFunc and made minor changes to FindQuantiles TestFRJoin. Changes TestBuiltin to include the commented assert statement for arity. Pig does not handdle correctly the case where * is passed to UDF -- Key: PIG-597 URL: https://issues.apache.org/jira/browse/PIG-597 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Shravan Matthur Narayanamurthy Attachments: 597.patch Script: == A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) 5; DUMP B; Error: = 2009-01-05 21:46:56,355 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc - Caught error from UDF org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple] Problem: === Santhosh tracked this to the following code in POUserFunc.java: if(op instanceof POProject op.getResultType() == DataType.TUPLE){ POProject projOp = (POProject)op; if(projOp.isStar()){ Tuple trslt = (Tuple) temp.result; Tuple rslt = (Tuple) res.result; for(int i=0;itrslt.size();i++) rslt.append(trslt.get(i)); continue; } } It seems to be unwrapping the tuple before passing it to the function. There is no comments so we are not sure why it is there; will need to run tests to see if removing it would solve this issue and not create others. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF
[ https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-597: --- Attachment: 597.patch Pig does not handdle correctly the case where * is passed to UDF -- Key: PIG-597 URL: https://issues.apache.org/jira/browse/PIG-597 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Shravan Matthur Narayanamurthy Attachments: 597.patch Script: == A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) 5; DUMP B; Error: = 2009-01-05 21:46:56,355 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc - Caught error from UDF org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple] Problem: === Santhosh tracked this to the following code in POUserFunc.java: if(op instanceof POProject op.getResultType() == DataType.TUPLE){ POProject projOp = (POProject)op; if(projOp.isStar()){ Tuple trslt = (Tuple) temp.result; Tuple rslt = (Tuple) res.result; for(int i=0;itrslt.size();i++) rslt.append(trslt.get(i)); continue; } } It seems to be unwrapping the tuple before passing it to the function. There is no comments so we are not sure why it is there; will need to run tests to see if removing it would solve this issue and not create others. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF
[ https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-597: --- Attachment: (was: 597.patch) Pig does not handdle correctly the case where * is passed to UDF -- Key: PIG-597 URL: https://issues.apache.org/jira/browse/PIG-597 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Shravan Matthur Narayanamurthy Attachments: 597.patch Script: == A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) 5; DUMP B; Error: = 2009-01-05 21:46:56,355 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc - Caught error from UDF org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple] Problem: === Santhosh tracked this to the following code in POUserFunc.java: if(op instanceof POProject op.getResultType() == DataType.TUPLE){ POProject projOp = (POProject)op; if(projOp.isStar()){ Tuple trslt = (Tuple) temp.result; Tuple rslt = (Tuple) res.result; for(int i=0;itrslt.size();i++) rslt.append(trslt.get(i)); continue; } } It seems to be unwrapping the tuple before passing it to the function. There is no comments so we are not sure why it is there; will need to run tests to see if removing it would solve this issue and not create others. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF
[ https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-597: --- Attachment: 597.patch Pig does not handdle correctly the case where * is passed to UDF -- Key: PIG-597 URL: https://issues.apache.org/jira/browse/PIG-597 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Shravan Matthur Narayanamurthy Attachments: 597.patch Script: == A = LOAD 'foo' USING PigStorage('\t'); B = FILTER A BY ARITY(*) 5; DUMP B; Error: = 2009-01-05 21:46:56,355 [main] ERROR org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc - Caught error from UDF org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be cast to org.apache.pig.data.Tuple] Problem: === Santhosh tracked this to the following code in POUserFunc.java: if(op instanceof POProject op.getResultType() == DataType.TUPLE){ POProject projOp = (POProject)op; if(projOp.isStar()){ Tuple trslt = (Tuple) temp.result; Tuple rslt = (Tuple) res.result; for(int i=0;itrslt.size();i++) rslt.append(trslt.get(i)); continue; } } It seems to be unwrapping the tuple before passing it to the function. There is no comments so we are not sure why it is there; will need to run tests to see if removing it would solve this issue and not create others. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-553) EvalFunc.finish() not getting called
[ https://issues.apache.org/jira/browse/PIG-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-553: --- Status: Patch Available (was: Open) Fixed the local mode by writing a visitor that calls the EvalFunc.finish() method. Waiting for comments from others on my earlier query EvalFunc.finish() not getting called Key: PIG-553 URL: https://issues.apache.org/jira/browse/PIG-553 Project: Pig Issue Type: Bug Affects Versions: types_branch Environment: local mode Reporter: Christopher Olston Assignee: Shravan Matthur Narayanamurthy My EvalFunc's finish() method doesn't seem to get invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-553) EvalFunc.finish() not getting called
[ https://issues.apache.org/jira/browse/PIG-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-553: --- Attachment: 553.patch EvalFunc.finish() not getting called Key: PIG-553 URL: https://issues.apache.org/jira/browse/PIG-553 Project: Pig Issue Type: Bug Affects Versions: types_branch Environment: local mode Reporter: Christopher Olston Assignee: Shravan Matthur Narayanamurthy Attachments: 553.patch My EvalFunc's finish() method doesn't seem to get invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-615) Wrong number of jobs with limit
[ https://issues.apache.org/jira/browse/PIG-615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-615: --- Status: Patch Available (was: Open) Here is the change I suggested as a patch. Hope this is what was expected Wrong number of jobs with limit --- Key: PIG-615 URL: https://issues.apache.org/jira/browse/PIG-615 Project: Pig Issue Type: Bug Reporter: Olga Natkovich Assignee: Shravan Matthur Narayanamurthy Attachments: 615.patch -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-553) EvalFunc.finish() not getting called
[ https://issues.apache.org/jira/browse/PIG-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-553: --- Attachment: 553.patch Created TestCase that tests calling of finish in both MR local mode and in each mode both scenarios: one where udf is in map phase another where udf is in reduce phase. Note that I have not used the minicluster because the MR job will run in another VM and its an overkill to use IPC to figure out whether finish() was called. EvalFunc.finish() not getting called Key: PIG-553 URL: https://issues.apache.org/jira/browse/PIG-553 Project: Pig Issue Type: Bug Affects Versions: types_branch Environment: local mode Reporter: Christopher Olston Assignee: Shravan Matthur Narayanamurthy Attachments: 553.patch My EvalFunc's finish() method doesn't seem to get invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-553) EvalFunc.finish() not getting called
[ https://issues.apache.org/jira/browse/PIG-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-553: --- Attachment: (was: 553.patch) EvalFunc.finish() not getting called Key: PIG-553 URL: https://issues.apache.org/jira/browse/PIG-553 Project: Pig Issue Type: Bug Affects Versions: types_branch Environment: local mode Reporter: Christopher Olston Assignee: Shravan Matthur Narayanamurthy My EvalFunc's finish() method doesn't seem to get invoked. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
[ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-545: --- Status: Patch Available (was: Open) PERFORMANCE: Sampler for order bys does not produce a good distribution --- Key: PIG-545 URL: https://issues.apache.org/jira/browse/PIG-545 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Amir Youssefi Fix For: types_branch In running tests on actual data, I've noticed that the final reduce of an order by has skewed partitions. Some reduces finish in a few seconds while some run for 20 minutes. Getting a better distribution should lead to much better performance for order by. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
[ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671222#action_12671222 ] Shravan Matthur Narayanamurthy commented on PIG-545: It worked with the changes to sort. It produced an even distribution and took about 3 mins lesser. There is still a slight tuning to be done as the first partition is not getting enough data. I will try to tweak it a bit and check if its better than the current one. PERFORMANCE: Sampler for order bys does not produce a good distribution --- Key: PIG-545 URL: https://issues.apache.org/jira/browse/PIG-545 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: WRP.patch In running tests on actual data, I've noticed that the final reduce of an order by has skewed partitions. Some reduces finish in a few seconds while some run for 20 minutes. Getting a better distribution should lead to much better performance for order by. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution
[ https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shravan Matthur Narayanamurthy updated PIG-545: --- Attachment: WRP1.patch Ran some tests and this quantiles scheme seems to have the least deviation from perfect distribution. Also, the time took for L10 has reduced. It took 8 mins vs 7 mins for the old code. But it produces a good distribution as shown below: The patch also modifies MRCompiler to fix sort on multiple fields with different order for each column. New algorithm: {noformat} /part-0r 3396866140 /part-1r 3388565356 /part-2r 3412419093 /part-3r 3404673062 /part-4r 3407805613 /part-5r 3399685590 /part-6r 3374470156 /part-7r 3407210410 /part-8r 3392022575 /part-9r 3403592598 /part-00010r 3407005509 /part-00011r 3392739807 /part-00012r 3407132246 /part-00013r 3393974442 /part-00014r 3394310422 /part-00015r 3397676923 /part-00016r 3408960794 /part-00017r 3407120924 /part-00018r 339878 /part-00019r 3398831802 /part-00020r 3381319493 /part-00021r 3397961816 /part-00022r 3408716378 /part-00023r 3401850651 /part-00024r 3394624621 /part-00025r 3411533286 /part-00026r 3397598333 /part-00027r 3402013011 /part-00028r 3412664722 /part-00029r 3390615865 /part-00030r 3402257701 /part-00031r 3404278892 /part-00032r 3408376085 /part-00033r 3403230193 /part-00034r 3396062725 /part-00035r 3403166437 /part-00036r 3396123295 /part-00037r 3400208557 /part-00038r 3396028297 /part-00039r 3428541846 {noformat} Old Algorithm: {noformat} /part-0r 339703 /part-1r 3396917259 /part-2r 3388958263 /part-3r 3412109839 /part-4r 3405626251 /part-5r 3411808194 /part-6r 3385084639 /part-7r 3618796205 /part-8r 359754649 /part-9r 3506719655 /part-00010r 3403039137 /part-00011r 3406540458 /part-00012r 3395629722 /part-00013r 3404795418 /part-00014r 3394881722 /part-00015r 3393959841 /part-00016r 3398194260 /part-00017r 3408370148 /part-00018r 3334248039 /part-00019r 3260118680 /part-00020r 3642453106 /part-00021r 3383168594 /part-00022r 3364791108 /part-00023r 3408601454 /part-00024r 3404588449 /part-00025r 3392940424 /part-00026r 3413354408 /part-00027r 3412538285 /part-00028r 3385894942 /part-00029r 3412674723 /part-00030r 3392572446 /part-00031r 3403012671 /part-00032r 3398679596 /part-00033r 3410864380 /part-00034r 3405389743 /part-00035r 3397248129 /part-00036r 3401438264 /part-00037r 3396456821 /part-00038r 3402122621 /part-00039r 3816408998 {noformat} PERFORMANCE: Sampler for order bys does not produce a good distribution --- Key: PIG-545 URL: https://issues.apache.org/jira/browse/PIG-545 Project: Pig Issue Type: Bug Components: impl Affects Versions: types_branch Reporter: Alan Gates Assignee: Pradeep Kamath Fix For: types_branch Attachments: WRP.patch, WRP1.patch In running tests on actual data, I've noticed that the final reduce of an order by has skewed partitions. Some reduces finish in a few seconds while some run for 20 minutes. Getting a better distribution should lead to much better performance for order by. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.