[jira] Created: (PIG-1264) Skewed join sampler misses out the key with the highest frequency

2010-02-26 Thread Sriranjan Manjunath (JIRA)
Skewed join sampler misses out the key with the highest frequency
-

 Key: PIG-1264
 URL: https://issues.apache.org/jira/browse/PIG-1264
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Richard Ding
 Fix For: 0.7.0


I am noticing two issues with the sampler used in skewed join:
1. It does not allocate multiple reducers to the key with the highest frequency.
2. It seems to be allocating the same number of reducers to every key (8 in 
this case).

Query:

a = load 'studenttab10k' using PigStorage() as (name, age, gpa);
b = load 'votertab10k' as (name, age, registration, contributions);
e = join a by name right, b by name using skewed parallel 8;
store e into 'SkewedJoin_9.out';


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1266) Show spill count on the pig console at the end of the job

2010-02-26 Thread Sriranjan Manjunath (JIRA)
Show spill count on the pig console at the end of the job
-

 Key: PIG-1266
 URL: https://issues.apache.org/jira/browse/PIG-1266
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath


Currently the spill count is displayed only on the job tracker log. It should 
be displayed on the console as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1266) Show spill count on the pig console at the end of the job

2010-02-26 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1266:
-

Status: Patch Available  (was: Open)

 Show spill count on the pig console at the end of the job
 -

 Key: PIG-1266
 URL: https://issues.apache.org/jira/browse/PIG-1266
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1266.patch


 Currently the spill count is displayed only on the job tracker log. It should 
 be displayed on the console as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1266) Show spill count on the pig console at the end of the job

2010-02-26 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1266:
-

Attachment: PIG_1266.patch

The patch does not contain any unit tests. The change is cosmetic and I have 
manually verified that the spill count is displayed at the end of script 
execution.

 Show spill count on the pig console at the end of the job
 -

 Key: PIG-1266
 URL: https://issues.apache.org/jira/browse/PIG-1266
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: PIG_1266.patch


 Currently the spill count is displayed only on the job tracker log. It should 
 be displayed on the console as well.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1209) Port POJoinPackage to proactively spill

2010-01-28 Thread Sriranjan Manjunath (JIRA)
Port POJoinPackage to proactively spill
---

 Key: PIG-1209
 URL: https://issues.apache.org/jira/browse/PIG-1209
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath


POPackage proactively spills the bag whereas POJoinPackage still uses the 
SpillableMemoryManager. We should port this to use InternalCacheBag which 
proactively spills.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1102) Collect number of spills per job

2009-12-23 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794217#action_12794217
 ] 

Sriranjan Manjunath commented on PIG-1102:
--

(3) refers to the case where we try to guess the number of records that fit 
into memory and start spilling the other records. InternalCachedBag.java 
addresses this case:

+if (cacheLimit!= 0  mContents.size() % cacheLimit == 0) {
+/* Increment the spill count*/
+incSpillCount(PigCounters.PROACTIVE_SPILL_COUNT);  
  
+}
 }

cacheLimit holds the number of records that can be held in memory whereas 
mContents is the tuple that holds all the records. Here, I do not increment the 
counter for every record. Instead I count every n'th record, n being the 
cacheLimit.

This however, does not increment the counter by the buffer size. Incrementing 
it by the buffer size will give us a value which approximately equal to the 
number of spilled records.

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Open  (was: Patch Available)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Patch Available  (was: Open)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Attachment: PIG_1102.patch.1

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Open  (was: Patch Available)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Patch Available  (was: Open)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793771#action_12793771
 ] 

Sriranjan Manjunath commented on PIG-1102:
--

1. The default is -1 to distinguish it from the case where there is no spill. 
The value is set to -1, if counters could not be initialized which is an 
exception.
2. warn is a misnomer. I reused an existing function which updates the counter 
if initialized. If the counter is not initialized, it dumps a warning.
3,4. I have fixed these and submitted a new patch.


 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Open  (was: Patch Available)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Attachment: (was: PIG_1102.patch.1)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Attachment: PIG_1102.patch.1

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Patch Available  (was: Open)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch, PIG_1102.patch.1


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1102) Collect number of spills per job

2009-12-18 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792680#action_12792680
 ] 

Sriranjan Manjunath commented on PIG-1102:
--

I ran the test again on my local machine, and it passes. The test failed 
because of too many open file descriptors. Is this a hudson related issue?

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Attachment: (was: PIG_1102.patch)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Patch Available  (was: Open)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Open  (was: Patch Available)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Status: Patch Available  (was: Open)

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1102) Collect number of spills per job

2009-12-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1102:
-

Attachment: PIG_1102.patch

There are no test cases included in the patch since it was difficult to 
consistently spill in a unit test case. I have manually tested the change. The 
easiest way to test this to load a huge data bag (1gb or so) and watch the map 
reduce UI. The UI will show new counters - SPILLABLE_MEMORY_MANAGER_SPILL_COUNT 
or PROACTIVE_SPILL_COUNT depending on the type of POPackage used.

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_1102.patch


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1143:
-

Attachment: PIG_1143.patch.1

I have added the successive join and multiquery unit tests.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1143:
-

Status: Patch Available  (was: Open)

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1143:
-

Attachment: PIG_1143.patch

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1143:
-

Status: Patch Available  (was: Open)

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788971#action_12788971
 ] 

Sriranjan Manjunath commented on PIG-1143:
--

To describe the problem in more detail, the current implementation does not 
handle a glob efficiently. When the sample loader encounters a directory (or 
combinations thereof), it gets the element descriptors of all the files inside 
the directory to compute the file sizes.
For ex: A = load {view, click} will result in computing file sizes of all the 
files underneath both view and click directories. If we have a large number 
of mappers, this will result in a ton of hdfs system calls, clogging the name 
node.

I intend to modify Poisson Sample Loader as follows. The algorithm for 
computing the total number of samples remains the same. However, it will not be 
computed by every mapper. I will be using the UDFContext object to share this 
information across mappers. Since mapper/ reducers can only read the 
information from UDFContext, the slicer will store this information. The slicer 
will compute the sampler count for the first map. As before, PigSlice will call 
computeSamples() for the first map. It will then store this value as a property 
in the UDFContext object. The Slicer will check UDFContext to see if this value 
is set and if it is, it will use it instead of computing it again. I intend to 
use pig.input.0.sampleCount as the key.

This solution will reduce the fileSize() invocations to a minimum and should 
reduce the burden on the name node.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789023#action_12789023
 ] 

Sriranjan Manjunath commented on PIG-1143:
--

The file size in the documentation refers to the size on disk. In order to 
account for compression, encoding etc. a configurable parameter - 
pig.inputfile.conversionfactor is provided. I agree that this cannot be set to 
a good value for compressed data. It is just a guidance. The implications of 
setting it to a bad value are minimal - we will end up sampling little more 
than the required number of samples (unless you set it to a fraction).

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789063#action_12789063
 ] 

Sriranjan Manjunath commented on PIG-1143:
--

I am OK with using InputSplits.getLength() as long as these provide you a good 
estimate of the file size. Without the population size, poisson samplers do now 
work well.

Samplers expect the data to be in BinStorage. If not, the first job reads it 
and stores it into BinStorage. The only exception being if the join follows a 
load/store only MR job.


 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-09 Thread Sriranjan Manjunath (JIRA)
Poisson Sample Loader should compute the number of samples required only once
-

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath


The current poisson sampler forces each of the maps to compute the sample 
number. This is redundant and causes issues when a large directory is specified 
in the join. The sampler should be changed to calculate the sample count only 
once and this information should be shared with the remaining mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1134) Skewed Join sampling job overwhelms the name node

2009-12-08 Thread Sriranjan Manjunath (JIRA)
Skewed Join sampling job overwhelms the name node
-

 Key: PIG-1134
 URL: https://issues.apache.org/jira/browse/PIG-1134
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath


The map tasks of the sampling job estimate the file size. For a large directory 
and a large number of maps the file system calls over whelm the name node. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1134) Skewed Join sampling job overwhelms the name node

2009-12-08 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1134:
-

Attachment: PIG-1134.patch

As a stop gap, I have replaced PoissonSampleLoader with RandomSampleLoader. 
This does not obtain the input size. Instead it obtains 100 samples per block. 

 Skewed Join sampling job overwhelms the name node
 -

 Key: PIG-1134
 URL: https://issues.apache.org/jira/browse/PIG-1134
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG-1134.patch


 The map tasks of the sampling job estimate the file size. For a large 
 directory and a large number of maps the file system calls over whelm the 
 name node. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1134) Skewed Join sampling job overwhelms the name node

2009-12-08 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1134:
-

Status: Patch Available  (was: Open)

 Skewed Join sampling job overwhelms the name node
 -

 Key: PIG-1134
 URL: https://issues.apache.org/jira/browse/PIG-1134
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG-1134.patch


 The map tasks of the sampling job estimate the file size. For a large 
 directory and a large number of maps the file system calls over whelm the 
 name node. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1135) skewed join partitioner returns negative partition index

2009-12-08 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787912#action_12787912
 ] 

Sriranjan Manjunath commented on PIG-1135:
--

Ran skewed join end-end / unit tests against this patch and random sample 
loader, and they passed.

 skewed join partitioner returns negative partition index 
 -

 Key: PIG-1135
 URL: https://issues.apache.org/jira/browse/PIG-1135
 Project: Pig
  Issue Type: Improvement
Reporter: Ying He
Assignee: Sriranjan Manjunath
 Attachments: PIG_1135.patch


 skewed join returns negative reducer index

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure

2009-12-04 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1105:
-

Attachment: PIG-1105.patch

 COUNT_STAR accumulate interface implementation cases failure
 

 Key: PIG-1105
 URL: https://issues.apache.org/jira/browse/PIG-1105
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG-1105.1.patch, PIG-1105.patch


 COUNT_STAR.accumulate is calling sum() which is supposed to be used by 
 intermediate and final parts of algebraic interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure

2009-12-04 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1105:
-

Attachment: (was: PIG-1105.patch)

 COUNT_STAR accumulate interface implementation cases failure
 

 Key: PIG-1105
 URL: https://issues.apache.org/jira/browse/PIG-1105
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG-1105.1.patch


 COUNT_STAR.accumulate is calling sum() which is supposed to be used by 
 intermediate and final parts of algebraic interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure

2009-12-04 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1105:
-

Status: Open  (was: Patch Available)

 COUNT_STAR accumulate interface implementation cases failure
 

 Key: PIG-1105
 URL: https://issues.apache.org/jira/browse/PIG-1105
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG-1105.1.patch, PIG-1105.2.patch


 COUNT_STAR.accumulate is calling sum() which is supposed to be used by 
 intermediate and final parts of algebraic interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure

2009-12-04 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1105:
-

Status: Patch Available  (was: Open)

 COUNT_STAR accumulate interface implementation cases failure
 

 Key: PIG-1105
 URL: https://issues.apache.org/jira/browse/PIG-1105
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG-1105.1.patch, PIG-1105.2.patch


 COUNT_STAR.accumulate is calling sum() which is supposed to be used by 
 intermediate and final parts of algebraic interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure

2009-12-04 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1105:
-

Status: Open  (was: Patch Available)

Cancelling since the patch does not have all the changes.

 COUNT_STAR accumulate interface implementation cases failure
 

 Key: PIG-1105
 URL: https://issues.apache.org/jira/browse/PIG-1105
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG-1105.1.patch, PIG-1105.2.patch


 COUNT_STAR.accumulate is calling sum() which is supposed to be used by 
 intermediate and final parts of algebraic interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure

2009-12-04 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1105:
-

Attachment: (was: PIG-1105.2.patch)

 COUNT_STAR accumulate interface implementation cases failure
 

 Key: PIG-1105
 URL: https://issues.apache.org/jira/browse/PIG-1105
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG-1105.1.patch, PIG-1105.2.patch


 COUNT_STAR.accumulate is calling sum() which is supposed to be used by 
 intermediate and final parts of algebraic interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure

2009-12-04 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1105:
-

Status: Patch Available  (was: Open)

 COUNT_STAR accumulate interface implementation cases failure
 

 Key: PIG-1105
 URL: https://issues.apache.org/jira/browse/PIG-1105
 Project: Pig
  Issue Type: Bug
  Components: impl
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG-1105.1.patch, PIG-1105.2.patch


 COUNT_STAR.accumulate is calling sum() which is supposed to be used by 
 intermediate and final parts of algebraic interface.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1102) Collect number of spills per job

2009-12-01 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784148#action_12784148
 ] 

Sriranjan Manjunath commented on PIG-1102:
--

Hadoop currently does not provide us average CPU usage  / mem usage per job. It 
even does not provide the number of spills per job. I have created a jira 
requesting the same: https://issues.apache.org/jira/browse/MAPREDUCE-1257

The only information we can currently gather is the number of spill records.

 Collect number of spills per job
 

 Key: PIG-1102
 URL: https://issues.apache.org/jira/browse/PIG-1102
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0


 Memory shortage is one of the main performance issues in Pig. Knowing when we 
 spill do the disk is useful for understanding query performance and also to 
 see how certain changes in Pig effect that.
 Other interesting stats to collect would be average CPU usage and max mem 
 usage but I am not sure if this information is easily retrievable.
 Using Hadoop counters for this would make sense.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-872:


Attachment: (was: PIG_872.patch)

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath

 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-872:


Status: Open  (was: Patch Available)

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath

 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-872:


Status: Patch Available  (was: Open)

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch.1


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-22 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-872:


Attachment: PIG_872.patch.1

Fixed both the issues.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch.1


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-19 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779867#action_12779867
 ] 

Sriranjan Manjunath commented on PIG-872:
-

Olga, I agree with your 1st point. I will get rid of the test case.
To rectify 2, shouldn't maprReduceOper.getReplFiles() return only the 
replicated files? What's the rationale behind returning a null for the 
fragmented input? I could change it to what Ashutosh suggested, but it would 
just be cleaner if fragmented input was not represented by a null.


 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-872:


Attachment: PIG_872.patch

I have verified that the job.xml has mapred.cache.files set to the replicated 
files.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1079) Modify merge join to use distributed cache to maintain the index

2009-11-10 Thread Sriranjan Manjunath (JIRA)
Modify merge join to use distributed cache to maintain the index


 Key: PIG-1079
 URL: https://issues.apache.org/jira/browse/PIG-1079
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-10-30 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1035:
-

Attachment: 1035.patch

The attached patch contains modifications to support outer skewed join. It 
follows the same semantics as regular join. Some of the code used by regular 
join is moved to a common file - CompilerUtils and used by both.

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-10-30 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1035:
-

Status: Patch Available  (was: Open)

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-10-30 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1035:
-

Status: Patch Available  (was: Open)

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035new.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1035) support for skewed outer join

2009-10-30 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1035:
-

Status: Open  (was: Patch Available)

 support for skewed outer join
 -

 Key: PIG-1035
 URL: https://issues.apache.org/jira/browse/PIG-1035
 Project: Pig
  Issue Type: New Feature
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: 1035new.patch


 Similarly to skewed inner join, skewed outer join will help to scale in the 
 presense of join keys that don't fit into memory

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Attachment: (was: pig_1048.patch)

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath

 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Status: Open  (was: Patch Available)

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath

 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-29 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771693#action_12771693
 ] 

Sriranjan Manjunath commented on PIG-1048:
--

I have also modified a skewed join test case to check if atleast one key is 
present in more than 1 partition instead of checking for all the keys being 
present in multiple partitions. Since the dataset was too small sampler with 
the RLR change did not detect these small keys causing the unit test to fail.

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Attachment: pig_1048.patch

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Status: Patch Available  (was: Open)

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Status: Open  (was: Patch Available)

Re-uploaded the same patch. Attaching a new one.

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath

 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Attachment: pig_1048.patch

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Status: Patch Available  (was: Open)

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-27 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Status: Patch Available  (was: Open)

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations

2009-10-27 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1048:
-

Attachment: pig_1048.patch

The patch solves this issue.

 inner join using 'skewed' produces multiple rows for keys with single row in 
 both input relations
 -

 Key: PIG-1048
 URL: https://issues.apache.org/jira/browse/PIG-1048
 Project: Pig
  Issue Type: Bug
Reporter: Thejas M Nair
Assignee: Sriranjan Manjunath
 Attachments: pig_1048.patch


 ${code}
 grunt cat students.txt   
 asdfxc  M   23  12.44
 qwerF   21  14.44
 uhsdf   M   34  12.11
 zxldf   M   21  12.56
 qwerF   23  145.5
 oiueM   54  23.33
  l1 = load 'students.txt';
 l2 = load 'students.txt';  
 j = join l1 by $0, l2 by $0 ; 
 store j into 'tmp.txt' 
 grunt cat tmp.txt
 oiueM   54  23.33   oiueM   54  23.33
 oiueM   54  23.33   oiueM   54  23.33
 qwerF   21  14.44   qwerF   21  14.44
 qwerF   21  14.44   qwerF   23  145.5
 qwerF   23  145.5   qwerF   21  14.44
 qwerF   23  145.5   qwerF   23  145.5
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 uhsdf   M   34  12.11   uhsdf   M   34  12.11
 zxldf   M   21  12.56   zxldf   M   21  12.56
 zxldf   M   21  12.56   zxldf   M   21  12.56
 asdfxc  M   23  12.44   asdfxc  M   23  12.44
 asdfxc  M   23  12.44   asdfxc  M   23  12.44$
 ${code}

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1017) Converts strings to text in Pig

2009-10-23 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1017:
-

Status: Open  (was: Patch Available)

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: stotext.patch


 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1017) Converts strings to text in Pig

2009-10-23 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1017:
-

Attachment: (was: stotext.patch)

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1017) Converts strings to text in Pig

2009-10-23 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1017:
-

Attachment: stotext.patch

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: stotext.patch


 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1017) Converts strings to text in Pig

2009-10-20 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767952#action_12767952
 ] 

Sriranjan Manjunath commented on PIG-1017:
--

The release audit warnings are related to html files.

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: stotext.patch


 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1017) Converts strings to text in Pig

2009-10-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1017:
-

Status: Patch Available  (was: Open)

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: stotext.patch


 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-1017) Converts strings to text in Pig

2009-10-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-1017:
-

Attachment: stotext.patch

The patch will fail MRCompiler and LogToPhyTransalator unit tests since we need 
to replace the golden files. The rest should pass.

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
 Attachments: stotext.patch


 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Assigned: (PIG-1017) Converts strings to text in Pig

2009-10-16 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath reassigned PIG-1017:


Assignee: Sriranjan Manjunath

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: stotext.patch


 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1017) Converts strings to text in Pig

2009-10-13 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765380#action_12765380
 ] 

Sriranjan Manjunath commented on PIG-1017:
--

Pigmix results before and after converting strings to text:

||Pigmix query||Trunk||Modified code||
|L1| 3:2|2:24|
|L2| 2:6|1:23|
|L3| 3:36|3:49|
|L4| 1:42|1:49|
|L5| 1:49|1:49|
|L6| 1:47|3:3|
|L7| 1:44|1:49|
|L8| 1:19|1:18|
|L9| 4:6|5:35|
|L10| 8:52|7:56|
|L11| 2:26|1:34|
|L12| 1:57|1:54|


 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath

 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1017) Converts strings to text in Pig

2009-10-13 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765381#action_12765381
 ] 

Sriranjan Manjunath commented on PIG-1017:
--

Something fishy is going on. I ran L6 a couple more times with the modified 
code and it completed in 1:8

 Converts strings to text in Pig
 ---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath

 Strings in Java are UTF-16 and takes 2 bytes. Text 
 (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show 
 significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-1017) Converts strings to text in Pig

2009-10-12 Thread Sriranjan Manjunath (JIRA)
Converts strings to text in Pig
---

 Key: PIG-1017
 URL: https://issues.apache.org/jira/browse/PIG-1017
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath


Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) 
stores the data in UTF-8 and could show significant reductions in memory.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-964) Handling null keys in skewed join

2009-09-17 Thread Sriranjan Manjunath (JIRA)
Handling null keys in skewed join
-

 Key: PIG-964
 URL: https://issues.apache.org/jira/browse/PIG-964
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath


The tuple size is calculated incorrectly and thus the skewed join ends up 
expecting a large number of reducers. Further, skewed join should not bail out 
after the second job if the number of reducers specified by the user is low. It 
should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-964) Handling null in skewed join

2009-09-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-964:


Attachment: skjoin2b.patch

Attached patch solves both the issues.

 Handling null  in skewed join
 -

 Key: PIG-964
 URL: https://issues.apache.org/jira/browse/PIG-964
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoin2b.patch


 For null tuples, the tuple size is calculated incorrectly and thus  skewed 
 join ends up expecting a large number of reducers. Further, skewed join 
 should not bail out after the second job if the number of reducers specified 
 by the user is low. It should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-964) Handling null in skewed join

2009-09-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-964:


Description: For null tuples, the tuple size is calculated incorrectly and 
thus  skewed join ends up expecting a large number of reducers. Further, skewed 
join should not bail out after the second job if the number of reducers 
specified by the user is low. It should print a warning message and continue 
execution.  (was: The tuple size is calculated incorrectly and thus the skewed 
join ends up expecting a large number of reducers. Further, skewed join should 
not bail out after the second job if the number of reducers specified by the 
user is low. It should print a warning message and continue execution.)
Summary: Handling null  in skewed join  (was: Handling null keys in 
skewed join)

 Handling null  in skewed join
 -

 Key: PIG-964
 URL: https://issues.apache.org/jira/browse/PIG-964
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoin2b.patch


 For null tuples, the tuple size is calculated incorrectly and thus  skewed 
 join ends up expecting a large number of reducers. Further, skewed join 
 should not bail out after the second job if the number of reducers specified 
 by the user is low. It should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-964) Handling null in skewed join

2009-09-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-964:


Assignee: Sriranjan Manjunath
  Status: Patch Available  (was: Open)

 Handling null  in skewed join
 -

 Key: PIG-964
 URL: https://issues.apache.org/jira/browse/PIG-964
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: skjoin2b.patch


 For null tuples, the tuple size is calculated incorrectly and thus  skewed 
 join ends up expecting a large number of reducers. Further, skewed join 
 should not bail out after the second job if the number of reducers specified 
 by the user is low. It should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-964) Handling null in skewed join

2009-09-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-964:


Status: Open  (was: Patch Available)

 Handling null  in skewed join
 -

 Key: PIG-964
 URL: https://issues.apache.org/jira/browse/PIG-964
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: skjoin2b.patch


 For null tuples, the tuple size is calculated incorrectly and thus  skewed 
 join ends up expecting a large number of reducers. Further, skewed join 
 should not bail out after the second job if the number of reducers specified 
 by the user is low. It should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-964) Handling null in skewed join

2009-09-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-964:


Attachment: (was: skjoin2b.patch)

 Handling null  in skewed join
 -

 Key: PIG-964
 URL: https://issues.apache.org/jira/browse/PIG-964
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: skewedjoinnull.patch


 For null tuples, the tuple size is calculated incorrectly and thus  skewed 
 join ends up expecting a large number of reducers. Further, skewed join 
 should not bail out after the second job if the number of reducers specified 
 by the user is low. It should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-964) Handling null in skewed join

2009-09-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-964:


Attachment: skewedjoinnull.patch

Cleared end-end tests and added a new unit test to check for nulls in the 
dataset.

 Handling null  in skewed join
 -

 Key: PIG-964
 URL: https://issues.apache.org/jira/browse/PIG-964
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: skewedjoinnull.patch


 For null tuples, the tuple size is calculated incorrectly and thus  skewed 
 join ends up expecting a large number of reducers. Further, skewed join 
 should not bail out after the second job if the number of reducers specified 
 by the user is low. It should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-964) Handling null in skewed join

2009-09-17 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-964:


Status: Patch Available  (was: Open)

 Handling null  in skewed join
 -

 Key: PIG-964
 URL: https://issues.apache.org/jira/browse/PIG-964
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: skewedjoinnull.patch


 For null tuples, the tuple size is calculated incorrectly and thus  skewed 
 join ends up expecting a large number of reducers. Further, skewed join 
 should not bail out after the second job if the number of reducers specified 
 by the user is low. It should print a warning message and continue execution.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-962) Skewed join creates 3 map reduce jobs

2009-09-15 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-962:


Description: The first job is a load / store job which loads the data from 
PigStorage and stores it in BinStorage. This hampers performance. The desired 
behavior is for the sampler to read from PigStorage instead of relying on the 
first load/store job. Skewed join should thus be 2 M/R jobs and not 3.  (was: 
The first job is a load / store job which loads the data from PigStorage and 
stores it in BinStorage. This hampers performance. The desired behavior is for 
the sampler to read from PigStorage instead of relying on the first load/store 
job.)

 Skewed join creates 3 map reduce jobs
 -

 Key: PIG-962
 URL: https://issues.apache.org/jira/browse/PIG-962
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath

 The first job is a load / store job which loads the data from PigStorage and 
 stores it in BinStorage. This hampers performance. The desired behavior is 
 for the sampler to read from PigStorage instead of relying on the first 
 load/store job. Skewed join should thus be 2 M/R jobs and not 3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Created: (PIG-962) Skewed join creates 3 map reduce jobs

2009-09-15 Thread Sriranjan Manjunath (JIRA)
Skewed join creates 3 map reduce jobs
-

 Key: PIG-962
 URL: https://issues.apache.org/jira/browse/PIG-962
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath


The first job is a load / store job which loads the data from PigStorage and 
stores it in BinStorage. This hampers performance. The desired behavior is for 
the sampler to read from PigStorage instead of relying on the first load/store 
job.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-962) Skewed join creates 3 map reduce jobs

2009-09-15 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-962:


Status: Patch Available  (was: Open)

 Skewed join creates 3 map reduce jobs
 -

 Key: PIG-962
 URL: https://issues.apache.org/jira/browse/PIG-962
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: skewedjoin2job.patch


 The first job is a load / store job which loads the data from PigStorage and 
 stores it in BinStorage. This hampers performance. The desired behavior is 
 for the sampler to read from PigStorage instead of relying on the first 
 load/store job. Skewed join should thus be 2 M/R jobs and not 3.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-935) Skewed join throws an exception when used with map keys

2009-09-02 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12750297#action_12750297
 ] 

Sriranjan Manjunath commented on PIG-935:
-

The unit tests are unrelated to my patch.

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-890) Create a sampler interface and improve the skewed join sampler

2009-09-02 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-890:


Status: Open  (was: Patch Available)

 Create a sampler interface and improve the skewed join sampler
 --

 Key: PIG-890
 URL: https://issues.apache.org/jira/browse/PIG-890
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
 Attachments: sampler.patch


 We need a different sampler for order by and skewed join. We thus need a 
 better sampling interface. The design of the same is described here: 
 http://wiki.apache.org/pig/PigSampler

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-890) Create a sampler interface and improve the skewed join sampler

2009-09-02 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-890:


Attachment: (was: sampler.patch)

 Create a sampler interface and improve the skewed join sampler
 --

 Key: PIG-890
 URL: https://issues.apache.org/jira/browse/PIG-890
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
 Attachments: samplerinterface.patch


 We need a different sampler for order by and skewed join. We thus need a 
 better sampling interface. The design of the same is described here: 
 http://wiki.apache.org/pig/PigSampler

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-890) Create a sampler interface and improve the skewed join sampler

2009-09-02 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-890:


Status: Patch Available  (was: Open)

 Create a sampler interface and improve the skewed join sampler
 --

 Key: PIG-890
 URL: https://issues.apache.org/jira/browse/PIG-890
 Project: Pig
  Issue Type: Improvement
Reporter: Sriranjan Manjunath
 Attachments: samplerinterface.patch


 We need a different sampler for order by and skewed join. We thus need a 
 better sampling interface. The design of the same is described here: 
 http://wiki.apache.org/pig/PigSampler

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-942) Maps are not implicitly casted

2009-09-02 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12750796#action_12750796
 ] 

Sriranjan Manjunath commented on PIG-942:
-

Here's the complete script:

A = load 'map.txt' as (e);
B = load 'map.txt' as (f);
C = join A by (chararray)e#'100', B by (chararray)f#'100';
dump C;

 Maps are not implicitly casted
 --

 Key: PIG-942
 URL: https://issues.apache.org/jira/browse/PIG-942
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath

 A = load 'foo' as (m) throws the following exception when foo has maps.
 java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be 
 cast to java.util.Map
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231)
 at 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93)
 at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
 at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227)
 at 
 org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198)
 The same works if I explicitly cast m to a map: A = load 'foo' as (m:[])

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-09-01 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Attachment: skmapbug.patch

Added code to explicitly check for -1 in orderby

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-09-01 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Attachment: (was: skjoinmapbug.patch)

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-09-01 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Status: Patch Available  (was: Open)

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-09-01 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Status: Open  (was: Patch Available)

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoinmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-08-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Attachment: skjoinmapbug.patch

Fixed the issue with unit tests

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoinmapbug.patch, skjoinmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-08-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Status: Open  (was: Patch Available)

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoinmapbug.patch, skjoinmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-08-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Attachment: (was: skjoinmapbug.patch)

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoinmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-08-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Attachment: (was: skjoinmapbug.patch)

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoinmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-08-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Attachment: skjoinmapbug.patch

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoinmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-08-29 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Status: Patch Available  (was: Open)

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoinmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys

2009-08-26 Thread Sriranjan Manjunath (JIRA)

 [ 
https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sriranjan Manjunath updated PIG-935:


Status: Patch Available  (was: Open)

The attached patch solves this issue.

 Skewed join throws an exception when used with map keys
 ---

 Key: PIG-935
 URL: https://issues.apache.org/jira/browse/PIG-935
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
 Attachments: skjoinmapbug.patch


 Skewed join throws a runtime exception for the following query:
 A = load 'map.txt' as (e);
 B = load 'map.txt' as (f);
 C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed;
 explain C;
 Exception:
 Caused by: java.lang.ClassCastException: 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast
  cannot be cast to 
 org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO
 Project
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492)
 at 
 org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894)
 ... 27 more

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



  1   2   >