from:"Shravan Matthur Narayanamurthy $JIRA$"

[jira] Updated: (PIG-554) Fragment Replicate Join

2008-12-02 Thread Shravan Matthur Narayanamurthy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shravan Matthur Narayanamurthy updated PIG-554:
---

Attachment: frjofflat.patch

Fragment Replicate Join
---

Key: PIG-554
URL: https://issues.apache.org/jira/browse/PIG-554
Project: Pig
Issue Type: New Feature
Affects Versions: types_branch
Reporter: Shravan Matthur Narayanamurthy
Assignee: Shravan Matthur Narayanamurthy
Fix For: types_branch

Attachments: frjofflat.patch

Fragment Replicate Join(FRJ) is useful when we want a join between a huge
table and a very small table (fitting in memory small) and the join doesn't
expand the data by much. The idea is to distribute the processing of the huge
files by fragmenting it and replicating the small file to all machines
receiving a fragment of the huge file. Because of the availability of the
entire small file, the join becomes a trivial task without needing any break
in the pipeline. Exhaustive test have done to determine the improvement we
get out of FRJ. Will post the details in a wiki and add a link here
The patch makes changes to parts of the code where new operators are
introduced. Currently, when a new operator is introduced, its alias is not
set. For schema computation I have modified this behaviour to set the alias
of the new operator to that of its predecessor. The logical side of the patch
mimics the cogroup behavior as join syntax closely resembles that of cogroup.
Currently, this patch doesn't have support for joins other than inner joins.
The rest of the code has been documented.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-554) Fragment Replicate Join

2008-12-12 Thread Shravan Matthur Narayanamurthy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shravan Matthur Narayanamurthy updated PIG-554:
---

Attachment: frjofflat1.patch

Fragment Replicate Join
---

Attachments: frjofflat.patch, frjofflat1.patch

Fragment Replicate Join(FRJ) is useful when we want a join between a huge
table and a very small table (fitting in memory small) and the join doesn't
expand the data by much. The idea is to distribute the processing of the huge
files by fragmenting it and replicating the small file to all machines
receiving a fragment of the huge file. Because of the availability of the
entire small file, the join becomes a trivial task without needing any break
in the pipeline. Exhaustive test have done to determine the improvement we
get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
The patch makes changes to parts of the code where new operators are
introduced. Currently, when a new operator is introduced, its alias is not
set. For schema computation I have modified this behaviour to set the alias
of the new operator to that of its predecessor. The logical side of the patch
mimics the cogroup behavior as join syntax closely resembles that of cogroup.
Currently, this patch doesn't have support for joins other than inner joins.
The rest of the code has been documented.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-554) Fragment Replicate Join

2008-12-12 Thread Shravan Matthur Narayanamurthy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12655938#action_12655938
]

Shravan Matthur Narayanamurthy commented on PIG-554:

The latest one has the fixes mentioned above. Please take a look.

Fragment Replicate Join
---

Attachments: frjofflat.patch, frjofflat1.patch

Fragment Replicate Join(FRJ) is useful when we want a join between a huge
table and a very small table (fitting in memory small) and the join doesn't
expand the data by much. The idea is to distribute the processing of the huge
files by fragmenting it and replicating the small file to all machines
receiving a fragment of the huge file. Because of the availability of the
entire small file, the join becomes a trivial task without needing any break
in the pipeline. Exhaustive test have done to determine the improvement we
get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
The patch makes changes to parts of the code where new operators are
introduced. Currently, when a new operator is introduced, its alias is not
set. For schema computation I have modified this behaviour to set the alias
of the new operator to that of its predecessor. The logical side of the patch
mimics the cogroup behavior as join syntax closely resembles that of cogroup.
Currently, this patch doesn't have support for joins other than inner joins.
The rest of the code has been documented.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-554) Fragment Replicate Join

2008-12-23 Thread Shravan Matthur Narayanamurthy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12658834#action_12658834
]

Shravan Matthur Narayanamurthy commented on PIG-554:

1) Consider the following script:
A = load 'file1';
B = load 'file2';
C = filter A by $010;
D = filter B by $010;
E = join C by $0, D by $0 using replicated;

We need to materialize the result of D before we can use it as replicated
input. Also DC has not been used as it doesn't support directories iirc (we
will have to handle many complications manually) and the load specification in
pig can contain regexps too. Also as the size of the replicated file is small
it doesn't make too much diff.

2) Instead of writing all the code to handle the various combinations of the
group item specification, I chose to use LR which already does it. I think I
store only the plain tuple(extracted from the LR ouput) and not the LR output
in the hashtables. So it doesn't add to any memory overhead. The LR is used
only to separate out key value and these are stored as a mapping from key to
value (plain tuples).

Fragment Replicate Join
---

Attachments: frjofflat.patch, frjofflat1.patch

Fragment Replicate Join(FRJ) is useful when we want a join between a huge
table and a very small table (fitting in memory small) and the join doesn't
expand the data by much. The idea is to distribute the processing of the huge
files by fragmenting it and replicating the small file to all machines
receiving a fragment of the huge file. Because of the availability of the
entire small file, the join becomes a trivial task without needing any break
in the pipeline. Exhaustive test have done to determine the improvement we
get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
The patch makes changes to parts of the code where new operators are
introduced. Currently, when a new operator is introduced, its alias is not
set. For schema computation I have modified this behaviour to set the alias
of the new operator to that of its predecessor. The logical side of the patch
mimics the cogroup behavior as join syntax closely resembles that of cogroup.
Currently, this patch doesn't have support for joins other than inner joins.
The rest of the code has been documented.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-554) Fragment Replicate Join

2009-01-06 Thread Shravan Matthur Narayanamurthy (JIRA)

[
https://issues.apache.org/jira/browse/PIG-554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12661449#action_12661449
]

Shravan Matthur Narayanamurthy commented on PIG-554:

(1) is a good catch! Really hadn't thought about this.

(2) Hashtable to HashMap is fine but should we be storing DataBag instead of
List? I thought DataBag took more space than List because of which the number
of tuples we can handle decreases.

I think you forgot to include TestFRJoin in the patch. Rest looks good to me.

Fragment Replicate Join
---

Attachments: frjofflat.patch, frjofflat1.patch, PIG-554-v3.patch

Fragment Replicate Join(FRJ) is useful when we want a join between a huge
table and a very small table (fitting in memory small) and the join doesn't
expand the data by much. The idea is to distribute the processing of the huge
files by fragmenting it and replicating the small file to all machines
receiving a fragment of the huge file. Because of the availability of the
entire small file, the join becomes a trivial task without needing any break
in the pipeline. Exhaustive test have done to determine the improvement we
get out of FRJ. Here are the details: http://wiki.apache.org/pig/PigFRJoin
The patch makes changes to parts of the code where new operators are
introduced. Currently, when a new operator is introduced, its alias is not
set. For schema computation I have modified this behaviour to set the alias
of the new operator to that of its predecessor. The logical side of the patch
mimics the cogroup behavior as join syntax closely resembles that of cogroup.
Currently, this patch doesn't have support for joins other than inner joins.
The rest of the code has been documented.

--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

2009-01-13 Thread Shravan Matthur Narayanamurthy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12663306#action_12663306
 ] 

Shravan Matthur Narayanamurthy commented on PIG-597:


The exception is being thrown from ARITY where it is trying to convert the 
first field of the tuple into a tuple. However, since we have a star, the tuple 
is not wrapped inside another tuple and hence the exception.

This was done in order to model the trunk behavior which is that there is an 
implicit flatten in front of a *. If we want to retain this behavior, then we 
need to change ARITY  other functions which were written with the assumption 
that POUserFunc will wrap anything inside a tuple though most of these 
functions will be useless when we have a UDF which outputs a tuple. To give an 
example, say we have a function which returns a tuple and we want to find its 
arity, ARITY(TupleRetUDF(*)) will always return one since POUserFunc will wrap 
the output of TupleRetUDF into another tuple and ARITY is changed to return 
just the size of the input tuple and not the size of the first field.

However, if we comment this code, then we need to modify FindQuantiles to 
consider the fact that everything will be wrapped inside a tuple  the behavior 
is not conditional upon the use of a star. I think this is better and Olga 
seems to agree as per her previous comment. Any other thoughts? Retain trunk 
behavior or change it?

 Pig does not handdle correctly the case where * is passed to UDF
 --

 Key: PIG-597
 URL: https://issues.apache.org/jira/browse/PIG-597
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Shravan Matthur Narayanamurthy

 Script:
 ==
 A = LOAD 'foo' USING PigStorage('\t');
 B = FILTER A BY ARITY(*)  5;
 DUMP B;
 Error:
 =
 2009-01-05 21:46:56,355 [main] ERROR
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
 - Caught error from UDF
 org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast 
 to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be 
 cast to org.apache.pig.data.Tuple]
 Problem:
 ===
 Santhosh tracked this to the following code in POUserFunc.java:
 if(op instanceof POProject 
 op.getResultType() == DataType.TUPLE){
 POProject projOp = (POProject)op;
 if(projOp.isStar()){
 Tuple trslt = (Tuple) temp.result;
 Tuple rslt = (Tuple) res.result;
 for(int i=0;itrslt.size();i++)
 rslt.append(trslt.get(i));
 continue;
 }
 }
 It seems to be unwrapping the tuple before passing it to the function. There 
 is no comments so we are not sure why it is there; will need to run tests to 
 see if removing it would solve this issue and not create others.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

2009-01-16 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-597:
---

Status: Patch Available  (was: Open)

Fixed by commenting the POStar code in POUserFunc and made minor changes to 
FindQuantiles  TestFRJoin. Changes TestBuiltin to include the commented assert 
statement for arity.

 Pig does not handdle correctly the case where * is passed to UDF
 --

 Key: PIG-597
 URL: https://issues.apache.org/jira/browse/PIG-597
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Shravan Matthur Narayanamurthy
 Attachments: 597.patch


 Script:
 ==
 A = LOAD 'foo' USING PigStorage('\t');
 B = FILTER A BY ARITY(*)  5;
 DUMP B;
 Error:
 =
 2009-01-05 21:46:56,355 [main] ERROR
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
 - Caught error from UDF
 org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast 
 to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be 
 cast to org.apache.pig.data.Tuple]
 Problem:
 ===
 Santhosh tracked this to the following code in POUserFunc.java:
 if(op instanceof POProject 
 op.getResultType() == DataType.TUPLE){
 POProject projOp = (POProject)op;
 if(projOp.isStar()){
 Tuple trslt = (Tuple) temp.result;
 Tuple rslt = (Tuple) res.result;
 for(int i=0;itrslt.size();i++)
 rslt.append(trslt.get(i));
 continue;
 }
 }
 It seems to be unwrapping the tuple before passing it to the function. There 
 is no comments so we are not sure why it is there; will need to run tests to 
 see if removing it would solve this issue and not create others.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

2009-01-16 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-597:
---

Attachment: 597.patch

 Pig does not handdle correctly the case where * is passed to UDF
 --

 Key: PIG-597
 URL: https://issues.apache.org/jira/browse/PIG-597
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Shravan Matthur Narayanamurthy
 Attachments: 597.patch


 Script:
 ==
 A = LOAD 'foo' USING PigStorage('\t');
 B = FILTER A BY ARITY(*)  5;
 DUMP B;
 Error:
 =
 2009-01-05 21:46:56,355 [main] ERROR
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
 - Caught error from UDF
 org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast 
 to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be 
 cast to org.apache.pig.data.Tuple]
 Problem:
 ===
 Santhosh tracked this to the following code in POUserFunc.java:
 if(op instanceof POProject 
 op.getResultType() == DataType.TUPLE){
 POProject projOp = (POProject)op;
 if(projOp.isStar()){
 Tuple trslt = (Tuple) temp.result;
 Tuple rslt = (Tuple) res.result;
 for(int i=0;itrslt.size();i++)
 rslt.append(trslt.get(i));
 continue;
 }
 }
 It seems to be unwrapping the tuple before passing it to the function. There 
 is no comments so we are not sure why it is there; will need to run tests to 
 see if removing it would solve this issue and not create others.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

2009-01-16 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-597:
---

Attachment: (was: 597.patch)

 Pig does not handdle correctly the case where * is passed to UDF
 --

 Key: PIG-597
 URL: https://issues.apache.org/jira/browse/PIG-597
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Shravan Matthur Narayanamurthy
 Attachments: 597.patch


 Script:
 ==
 A = LOAD 'foo' USING PigStorage('\t');
 B = FILTER A BY ARITY(*)  5;
 DUMP B;
 Error:
 =
 2009-01-05 21:46:56,355 [main] ERROR
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
 - Caught error from UDF
 org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast 
 to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be 
 cast to org.apache.pig.data.Tuple]
 Problem:
 ===
 Santhosh tracked this to the following code in POUserFunc.java:
 if(op instanceof POProject 
 op.getResultType() == DataType.TUPLE){
 POProject projOp = (POProject)op;
 if(projOp.isStar()){
 Tuple trslt = (Tuple) temp.result;
 Tuple rslt = (Tuple) res.result;
 for(int i=0;itrslt.size();i++)
 rslt.append(trslt.get(i));
 continue;
 }
 }
 It seems to be unwrapping the tuple before passing it to the function. There 
 is no comments so we are not sure why it is there; will need to run tests to 
 see if removing it would solve this issue and not create others.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

2009-01-16 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-597?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-597:
---

Attachment: 597.patch

 Pig does not handdle correctly the case where * is passed to UDF
 --

 Key: PIG-597
 URL: https://issues.apache.org/jira/browse/PIG-597
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Shravan Matthur Narayanamurthy
 Attachments: 597.patch


 Script:
 ==
 A = LOAD 'foo' USING PigStorage('\t');
 B = FILTER A BY ARITY(*)  5;
 DUMP B;
 Error:
 =
 2009-01-05 21:46:56,355 [main] ERROR
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POUserFunc
 - Caught error from UDF
 org.apache.pig.builtin.ARITY[org.apache.pig.data.DataByteArray cannot be cast 
 to org.apache.pig.data.Tuple [org.apache.pig.data.DataByteArray cannot be 
 cast to org.apache.pig.data.Tuple]
 Problem:
 ===
 Santhosh tracked this to the following code in POUserFunc.java:
 if(op instanceof POProject 
 op.getResultType() == DataType.TUPLE){
 POProject projOp = (POProject)op;
 if(projOp.isStar()){
 Tuple trslt = (Tuple) temp.result;
 Tuple rslt = (Tuple) res.result;
 for(int i=0;itrslt.size();i++)
 rslt.append(trslt.get(i));
 continue;
 }
 }
 It seems to be unwrapping the tuple before passing it to the function. There 
 is no comments so we are not sure why it is there; will need to run tests to 
 see if removing it would solve this issue and not create others.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-553) EvalFunc.finish() not getting called

2009-01-20 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-553:
---

Status: Patch Available  (was: Open)

Fixed the local mode by writing a visitor that calls the EvalFunc.finish() 
method. Waiting for comments from others on my earlier query

 EvalFunc.finish() not getting called
 

 Key: PIG-553
 URL: https://issues.apache.org/jira/browse/PIG-553
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
 Environment: local mode
Reporter: Christopher Olston
Assignee: Shravan Matthur Narayanamurthy

 My EvalFunc's finish() method doesn't seem to get invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-553) EvalFunc.finish() not getting called

2009-01-20 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-553:
---

Attachment: 553.patch

 EvalFunc.finish() not getting called
 

 Key: PIG-553
 URL: https://issues.apache.org/jira/browse/PIG-553
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
 Environment: local mode
Reporter: Christopher Olston
Assignee: Shravan Matthur Narayanamurthy
 Attachments: 553.patch


 My EvalFunc's finish() method doesn't seem to get invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-615) Wrong number of jobs with limit

2009-01-20 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-615?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-615:
---

Status: Patch Available  (was: Open)

Here is the change I suggested as a patch. Hope this is what was expected

 Wrong number of jobs with limit
 ---

 Key: PIG-615
 URL: https://issues.apache.org/jira/browse/PIG-615
 Project: Pig
  Issue Type: Bug
Reporter: Olga Natkovich
Assignee: Shravan Matthur Narayanamurthy
 Attachments: 615.patch




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-553) EvalFunc.finish() not getting called

2009-01-28 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-553:
---

Attachment: 553.patch

Created TestCase that tests calling of finish in both MR  local mode and in 
each mode both scenarios: one where udf is in map phase  another where udf is 
in reduce phase.

Note that I have not used the minicluster because the MR job will run in 
another VM and its an overkill to use IPC to figure out whether finish() was 
called.

 EvalFunc.finish() not getting called
 

 Key: PIG-553
 URL: https://issues.apache.org/jira/browse/PIG-553
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
 Environment: local mode
Reporter: Christopher Olston
Assignee: Shravan Matthur Narayanamurthy
 Attachments: 553.patch


 My EvalFunc's finish() method doesn't seem to get invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-553) EvalFunc.finish() not getting called

2009-01-28 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-553:
---

Attachment: (was: 553.patch)

 EvalFunc.finish() not getting called
 

 Key: PIG-553
 URL: https://issues.apache.org/jira/browse/PIG-553
 Project: Pig
  Issue Type: Bug
Affects Versions: types_branch
 Environment: local mode
Reporter: Christopher Olston
Assignee: Shravan Matthur Narayanamurthy

 My EvalFunc's finish() method doesn't seem to get invoked.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

2009-02-05 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-545:
---

Status: Patch Available  (was: Open)

 PERFORMANCE: Sampler for order bys does not produce a good distribution
 ---

 Key: PIG-545
 URL: https://issues.apache.org/jira/browse/PIG-545
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Amir Youssefi
 Fix For: types_branch


 In running tests on actual data, I've noticed that the final reduce of an 
 order by has skewed partitions.  Some reduces finish in a few seconds while 
 some run for 20 minutes.  Getting a better distribution should lead to much 
 better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

2009-02-06 Thread Shravan Matthur Narayanamurthy (JIRA)


[ 
https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12671222#action_12671222
 ] 

Shravan Matthur Narayanamurthy commented on PIG-545:


It worked with the changes to sort. It produced an even distribution and took 
about 3 mins lesser. There is still a slight tuning to be done as the first 
partition is not getting enough data. I will try to tweak it a bit and check if 
its better than the current one.

 PERFORMANCE: Sampler for order bys does not produce a good distribution
 ---

 Key: PIG-545
 URL: https://issues.apache.org/jira/browse/PIG-545
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: WRP.patch


 In running tests on actual data, I've noticed that the final reduce of an 
 order by has skewed partitions.  Some reduces finish in a few seconds while 
 some run for 20 minutes.  Getting a better distribution should lead to much 
 better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

2009-02-09 Thread Shravan Matthur Narayanamurthy (JIRA)


 [ 
https://issues.apache.org/jira/browse/PIG-545?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Shravan Matthur Narayanamurthy updated PIG-545:
---

Attachment: WRP1.patch

Ran some tests and this quantiles scheme seems to have the least deviation from 
perfect distribution. Also, the time took for L10 has reduced. It took 8 mins 
vs 7 mins for the old code. But it produces a good distribution as shown below: 
The patch also modifies MRCompiler to fix sort on multiple fields with 
different order for each column.
New algorithm:
{noformat}
/part-0r 3396866140
/part-1r 3388565356
/part-2r 3412419093
/part-3r 3404673062
/part-4r 3407805613
/part-5r 3399685590
/part-6r 3374470156
/part-7r 3407210410
/part-8r 3392022575
/part-9r 3403592598
/part-00010r 3407005509
/part-00011r 3392739807
/part-00012r 3407132246
/part-00013r 3393974442
/part-00014r 3394310422
/part-00015r 3397676923
/part-00016r 3408960794
/part-00017r 3407120924
/part-00018r 339878
/part-00019r 3398831802
/part-00020r 3381319493
/part-00021r 3397961816
/part-00022r 3408716378
/part-00023r 3401850651
/part-00024r 3394624621
/part-00025r 3411533286
/part-00026r 3397598333
/part-00027r 3402013011
/part-00028r 3412664722
/part-00029r 3390615865
/part-00030r 3402257701
/part-00031r 3404278892
/part-00032r 3408376085
/part-00033r 3403230193
/part-00034r 3396062725
/part-00035r 3403166437
/part-00036r 3396123295
/part-00037r 3400208557
/part-00038r 3396028297
/part-00039r 3428541846
{noformat}
Old Algorithm:
{noformat}
/part-0r 339703
/part-1r 3396917259
/part-2r 3388958263
/part-3r 3412109839
/part-4r 3405626251
/part-5r 3411808194
/part-6r 3385084639
/part-7r 3618796205
/part-8r 359754649
/part-9r 3506719655
/part-00010r 3403039137
/part-00011r 3406540458
/part-00012r 3395629722
/part-00013r 3404795418
/part-00014r 3394881722
/part-00015r 3393959841
/part-00016r 3398194260
/part-00017r 3408370148
/part-00018r 3334248039
/part-00019r 3260118680
/part-00020r 3642453106
/part-00021r 3383168594
/part-00022r 3364791108
/part-00023r 3408601454
/part-00024r 3404588449
/part-00025r 3392940424
/part-00026r 3413354408
/part-00027r 3412538285
/part-00028r 3385894942
/part-00029r 3412674723
/part-00030r 3392572446
/part-00031r 3403012671
/part-00032r 3398679596
/part-00033r 3410864380
/part-00034r 3405389743
/part-00035r 3397248129
/part-00036r 3401438264
/part-00037r 3396456821
/part-00038r 3402122621
/part-00039r 3816408998
{noformat}

 PERFORMANCE: Sampler for order bys does not produce a good distribution
 ---

 Key: PIG-545
 URL: https://issues.apache.org/jira/browse/PIG-545
 Project: Pig
  Issue Type: Bug
  Components: impl
Affects Versions: types_branch
Reporter: Alan Gates
Assignee: Pradeep Kamath
 Fix For: types_branch

 Attachments: WRP.patch, WRP1.patch


 In running tests on actual data, I've noticed that the final reduce of an 
 order by has skewed partitions.  Some reduces finish in a few seconds while 
 some run for 20 minutes.  Getting a better distribution should lead to much 
 better performance for order by.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PIG-554) Fragment Replicate Join

[jira] Updated: (PIG-554) Fragment Replicate Join

[jira] Commented: (PIG-554) Fragment Replicate Join

[jira] Commented: (PIG-554) Fragment Replicate Join

[jira] Commented: (PIG-554) Fragment Replicate Join

[jira] Commented: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

[jira] Updated: (PIG-597) Pig does not handdle correctly the case where * is passed to UDF

[jira] Updated: (PIG-553) EvalFunc.finish() not getting called

[jira] Updated: (PIG-553) EvalFunc.finish() not getting called

[jira] Updated: (PIG-615) Wrong number of jobs with limit

[jira] Updated: (PIG-553) EvalFunc.finish() not getting called

[jira] Updated: (PIG-553) EvalFunc.finish() not getting called

[jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

[jira] Commented: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

[jira] Updated: (PIG-545) PERFORMANCE: Sampler for order bys does not produce a good distribution

18 matches

Site Navigation

Mail list logo

Footer information