[jira] [Commented] (PIG-5041) RoundRobinPartitioner is not deterministic when order of input records change

2016-10-13 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574140#comment-15574140
 ] 

Daniel Dai commented on PIG-5041:
-

I don't mean the fetch, I mean the order of bag.iterator

> RoundRobinPartitioner is not deterministic when order of input records change
> -
>
> Key: PIG-5041
> URL: https://issues.apache.org/jira/browse/PIG-5041
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.16.1
>
> Attachments: PIG-5041-1.patch, PIG-5041-2.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
> up successfully pulling partitions from first run of the map while other half 
> could pull from the rerun after shuffle fetch failures. If the data is not 
> partitioned by the Partitioner exactly the same way every time then it could 
> lead to incorrect results (loss of records and duplicated records). 
> There is a good probability of order of input records changing
> - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
> but values can be in any order as the shuffle and merge depends on the order 
> in which inputs are fetched. Anything involving FLATTEN can produce different 
> order of output records.
> - With UnorderedKVInput, the records could be in any order depending on 
> order of shuffle fetch. 
> RoundRobinPartitioner can partition records differently everytime as order of 
> input records change which is very bad. We need to get rid of 
> RoundRobinPartitioner. Since the key is empty whenever we use  
> RoundRobinPartitioner we need to partitioning based on hashcode of values to 
> produce consistent partitioning.
> Partitioning based on hashcode is required for correctness, but disadvantage 
> is that it
> - adds a lot of performance overhead with hashcode computation
> - with the random distribution due to hashcode (as opposed to batched 
> round robin) input records sorted on some column could get distributed to 
> different reducers and if union is followed by a store, the output can have 
> bad compression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5041) RoundRobinPartitioner is not deterministic when order of input records change

2016-10-13 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15574013#comment-15574013
 ] 

Rohini Palaniswamy commented on PIG-5041:
-

bq. Bag can also be a key. 
  In distinct? That definitely is a possibility and I think we will have to try 
fix for that case. I don't see folks using it as a key in group by, order by or 
join. 

bq. I think that only happens when client/server running different jvm
   Mapreduce/Tez shuffle has no guarantee on the order of the values in a 
Key,List input to the reducer. It all depends on which map's shuffle 
output is fetched first during merge process. So the values in the bag can be 
of any order and it does not depend on the jvm.

> RoundRobinPartitioner is not deterministic when order of input records change
> -
>
> Key: PIG-5041
> URL: https://issues.apache.org/jira/browse/PIG-5041
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.16.1
>
> Attachments: PIG-5041-1.patch, PIG-5041-2.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
> up successfully pulling partitions from first run of the map while other half 
> could pull from the rerun after shuffle fetch failures. If the data is not 
> partitioned by the Partitioner exactly the same way every time then it could 
> lead to incorrect results (loss of records and duplicated records). 
> There is a good probability of order of input records changing
> - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
> but values can be in any order as the shuffle and merge depends on the order 
> in which inputs are fetched. Anything involving FLATTEN can produce different 
> order of output records.
> - With UnorderedKVInput, the records could be in any order depending on 
> order of shuffle fetch. 
> RoundRobinPartitioner can partition records differently everytime as order of 
> input records change which is very bad. We need to get rid of 
> RoundRobinPartitioner. Since the key is empty whenever we use  
> RoundRobinPartitioner we need to partitioning based on hashcode of values to 
> produce consistent partitioning.
> Partitioning based on hashcode is required for correctness, but disadvantage 
> is that it
> - adds a lot of performance overhead with hashcode computation
> - with the random distribution due to hashcode (as opposed to batched 
> round robin) input records sorted on some column could get distributed to 
> different reducers and if union is followed by a store, the output can have 
> bad compression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5041) RoundRobinPartitioner is not deterministic when order of input records change

2016-10-13 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573661#comment-15573661
 ] 

Daniel Dai commented on PIG-5041:
-

Bag can also be a key. This is a problem more than HashValuePartitioner. I 
think that only happens when client/server running different jvm. But if that 
happens, I believe there will be more headache than non-deterministic 
partitioner.

> RoundRobinPartitioner is not deterministic when order of input records change
> -
>
> Key: PIG-5041
> URL: https://issues.apache.org/jira/browse/PIG-5041
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.16.1
>
> Attachments: PIG-5041-1.patch, PIG-5041-2.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
> up successfully pulling partitions from first run of the map while other half 
> could pull from the rerun after shuffle fetch failures. If the data is not 
> partitioned by the Partitioner exactly the same way every time then it could 
> lead to incorrect results (loss of records and duplicated records). 
> There is a good probability of order of input records changing
> - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
> but values can be in any order as the shuffle and merge depends on the order 
> in which inputs are fetched. Anything involving FLATTEN can produce different 
> order of output records.
> - With UnorderedKVInput, the records could be in any order depending on 
> order of shuffle fetch. 
> RoundRobinPartitioner can partition records differently everytime as order of 
> input records change which is very bad. We need to get rid of 
> RoundRobinPartitioner. Since the key is empty whenever we use  
> RoundRobinPartitioner we need to partitioning based on hashcode of values to 
> produce consistent partitioning.
> Partitioning based on hashcode is required for correctness, but disadvantage 
> is that it
> - adds a lot of performance overhead with hashcode computation
> - with the random distribution due to hashcode (as opposed to batched 
> round robin) input records sorted on some column could get distributed to 
> different reducers and if union is followed by a store, the output can have 
> bad compression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5041) RoundRobinPartitioner is not deterministic when order of input records change

2016-10-13 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15573080#comment-15573080
 ] 

Koji Noguchi commented on PIG-5041:
---

Rohini, you seem to have one debug statement in the patch.

As for using hashcode, I'm not sure if it'll work for Tuple with a bag.
Looking at DefaultAbstractBag, I see that hashcode is taken without sorting.
If same logic of non-determistic order of bag applies here, I'm afraid hashcode 
may differ by retries.

Outside of this jira, I'm worried with the implementation of DefaultAbstractBag 
where taking hashcode does NOT sort but compareTo() does.

> RoundRobinPartitioner is not deterministic when order of input records change
> -
>
> Key: PIG-5041
> URL: https://issues.apache.org/jira/browse/PIG-5041
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.16.1
>
> Attachments: PIG-5041-1.patch
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
> up successfully pulling partitions from first run of the map while other half 
> could pull from the rerun after shuffle fetch failures. If the data is not 
> partitioned by the Partitioner exactly the same way every time then it could 
> lead to incorrect results (loss of records and duplicated records). 
> There is a good probability of order of input records changing
> - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
> but values can be in any order as the shuffle and merge depends on the order 
> in which inputs are fetched. Anything involving FLATTEN can produce different 
> order of output records.
> - With UnorderedKVInput, the records could be in any order depending on 
> order of shuffle fetch. 
> RoundRobinPartitioner can partition records differently everytime as order of 
> input records change which is very bad. We need to get rid of 
> RoundRobinPartitioner. Since the key is empty whenever we use  
> RoundRobinPartitioner we need to partitioning based on hashcode of values to 
> produce consistent partitioning.
> Partitioning based on hashcode is required for correctness, but disadvantage 
> is that it
> - adds a lot of performance overhead with hashcode computation
> - with the random distribution due to hashcode (as opposed to batched 
> round robin) input records sorted on some column could get distributed to 
> different reducers and if union is followed by a store, the output can have 
> bad compression.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-5041) RoundRobinPartitioner is not deterministic when order of input records change

2016-10-12 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-5041?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15569348#comment-15569348
 ] 

Rohini Palaniswamy commented on PIG-5041:
-

Currently UnorderedPartitionedKVOutput+UnorderedKVInput+RoundRobinPartitioner 
only comes into play only when union optimizer is turned off or union optimizer 
could not be applied due to some condition.   We hit the issue with a script 
that had two levels of union followed by replicate join which does not have 
union optimization with vertex groups due to  PIG-3856.

{code}
a = load 'file:///tmp/input' as (x:int, y:chararray);
b = load 'file:///tmp/input' as (y:chararray, x:int);
c = union onschema a, b;" +
d = load 'file:///tmp/input1' as (x:int, z:chararray);
e = join c by x, d by x using 'replicated';
f = load 'file:///tmp/input' as (y:chararray, x:int);
g = union onschema e, f;" +
h = join g by y, d by y using 'replicated';
i = group h by x;
store i into 'file:///tmp/pigoutput';
{code}

Vertex processing c,e had unorderedinput and produced unorderedoutput with 
RoundRobinPartitioner. Rerun of one of the tasks in this vertex due to shuffle 
fetch failure caused incorrect results to be sent to vertex processing g,h. 
This is because first run processed in the order of a,b and second run 
processed in the order of b,a based on whichever shuffle file was fetched 
first. Map output of vertices processing a and b could also run into the 
non-deterministic partitioning issue if they had any operation that changed 
order of records but in this case since they read records from input file and 
output in same order without any intermediate processing those will not have 
issue on rerun even with RoundRobinPartitioner.  


> RoundRobinPartitioner is not deterministic when order of input records change
> -
>
> Key: PIG-5041
> URL: https://issues.apache.org/jira/browse/PIG-5041
> Project: Pig
>  Issue Type: Bug
>Reporter: Rohini Palaniswamy
>Assignee: Rohini Palaniswamy
>Priority: Critical
> Fix For: 0.16.1
>
>
> Maps can be rerun due to shuffle fetch failures. Half of the reducers can end 
> up successfully pulling partitions from first run of the map while other half 
> could pull from the rerun after shuffle fetch failures. If the data is not 
> partitioned by the Partitioner exactly the same way every time then it could 
> lead to incorrect results (loss of records and duplicated records). 
> There is a good probability of order of input records changing
> - With OrderedGroupedMergedKVInput (shuffle input), they keys are sorted 
> but values can be in any order as the shuffle and merge depends on the order 
> in which inputs are fetched. Anything involving FLATTEN can produce different 
> order of output records.
> - With UnorderedKVInput, the records could be in any order depending on 
> order of shuffle fetch. 
> RoundRobinPartitioner can partition records differently everytime as order of 
> input records change which is very bad. We need to get rid of 
> RoundRobinPartitioner. Since the key is empty whenever we use  
> RoundRobinPartitioner we need to partitioning based on hashcode of values to 
> produce consistent partitioning. It adds a lot of performance overhead, but 
> required for correctness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)