[jira] Created: (PIG-1264) Skewed join sampler misses out the key with the highest frequency
Skewed join sampler misses out the key with the highest frequency - Key: PIG-1264 URL: https://issues.apache.org/jira/browse/PIG-1264 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Richard Ding Fix For: 0.7.0 I am noticing two issues with the sampler used in skewed join: 1. It does not allocate multiple reducers to the key with the highest frequency. 2. It seems to be allocating the same number of reducers to every key (8 in this case). Query: a = load 'studenttab10k' using PigStorage() as (name, age, gpa); b = load 'votertab10k' as (name, age, registration, contributions); e = join a by name right, b by name using skewed parallel 8; store e into 'SkewedJoin_9.out'; -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1266) Show spill count on the pig console at the end of the job
Show spill count on the pig console at the end of the job - Key: PIG-1266 URL: https://issues.apache.org/jira/browse/PIG-1266 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Currently the spill count is displayed only on the job tracker log. It should be displayed on the console as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1266) Show spill count on the pig console at the end of the job
[ https://issues.apache.org/jira/browse/PIG-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1266: - Status: Patch Available (was: Open) Show spill count on the pig console at the end of the job - Key: PIG-1266 URL: https://issues.apache.org/jira/browse/PIG-1266 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1266.patch Currently the spill count is displayed only on the job tracker log. It should be displayed on the console as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1266) Show spill count on the pig console at the end of the job
[ https://issues.apache.org/jira/browse/PIG-1266?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1266: - Attachment: PIG_1266.patch The patch does not contain any unit tests. The change is cosmetic and I have manually verified that the spill count is displayed at the end of script execution. Show spill count on the pig console at the end of the job - Key: PIG-1266 URL: https://issues.apache.org/jira/browse/PIG-1266 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: PIG_1266.patch Currently the spill count is displayed only on the job tracker log. It should be displayed on the console as well. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1209) Port POJoinPackage to proactively spill
Port POJoinPackage to proactively spill --- Key: PIG-1209 URL: https://issues.apache.org/jira/browse/PIG-1209 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath POPackage proactively spills the bag whereas POJoinPackage still uses the SpillableMemoryManager. We should port this to use InternalCacheBag which proactively spills. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12794217#action_12794217 ] Sriranjan Manjunath commented on PIG-1102: -- (3) refers to the case where we try to guess the number of records that fit into memory and start spilling the other records. InternalCachedBag.java addresses this case: +if (cacheLimit!= 0 mContents.size() % cacheLimit == 0) { +/* Increment the spill count*/ +incSpillCount(PigCounters.PROACTIVE_SPILL_COUNT); +} } cacheLimit holds the number of records that can be held in memory whereas mContents is the tuple that holds all the records. Here, I do not increment the counter for every record. Instead I count every n'th record, n being the cacheLimit. This however, does not increment the counter by the buffer size. Incrementing it by the buffer size will give us a value which approximately equal to the number of spilled records. Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Open (was: Patch Available) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Patch Available (was: Open) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Attachment: PIG_1102.patch.1 Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Open (was: Patch Available) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Patch Available (was: Open) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793771#action_12793771 ] Sriranjan Manjunath commented on PIG-1102: -- 1. The default is -1 to distinguish it from the case where there is no spill. The value is set to -1, if counters could not be initialized which is an exception. 2. warn is a misnomer. I reused an existing function which updates the counter if initialized. If the counter is not initialized, it dumps a warning. 3,4. I have fixed these and submitted a new patch. Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Open (was: Patch Available) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Attachment: (was: PIG_1102.patch.1) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Attachment: PIG_1102.patch.1 Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Patch Available (was: Open) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch, PIG_1102.patch.1 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12792680#action_12792680 ] Sriranjan Manjunath commented on PIG-1102: -- I ran the test again on my local machine, and it passes. The test failed because of too many open file descriptors. Is this a hudson related issue? Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Attachment: (was: PIG_1102.patch) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Patch Available (was: Open) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Open (was: Patch Available) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Status: Patch Available (was: Open) Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1102: - Attachment: PIG_1102.patch There are no test cases included in the patch since it was difficult to consistently spill in a unit test case. I have manually tested the change. The easiest way to test this to load a huge data bag (1gb or so) and watch the map reduce UI. The UI will show new counters - SPILLABLE_MEMORY_MANAGER_SPILL_COUNT or PROACTIVE_SPILL_COUNT depending on the type of POPackage used. Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_1102.patch Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1143: - Attachment: PIG_1143.patch.1 I have added the successive join and multiquery unit tests. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1143: - Status: Patch Available (was: Open) Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1143: - Attachment: PIG_1143.patch Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1143: - Status: Patch Available (was: Open) Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788971#action_12788971 ] Sriranjan Manjunath commented on PIG-1143: -- To describe the problem in more detail, the current implementation does not handle a glob efficiently. When the sample loader encounters a directory (or combinations thereof), it gets the element descriptors of all the files inside the directory to compute the file sizes. For ex: A = load {view, click} will result in computing file sizes of all the files underneath both view and click directories. If we have a large number of mappers, this will result in a ton of hdfs system calls, clogging the name node. I intend to modify Poisson Sample Loader as follows. The algorithm for computing the total number of samples remains the same. However, it will not be computed by every mapper. I will be using the UDFContext object to share this information across mappers. Since mapper/ reducers can only read the information from UDFContext, the slicer will store this information. The slicer will compute the sampler count for the first map. As before, PigSlice will call computeSamples() for the first map. It will then store this value as a property in the UDFContext object. The Slicer will check UDFContext to see if this value is set and if it is, it will use it instead of computing it again. I intend to use pig.input.0.sampleCount as the key. This solution will reduce the fileSize() invocations to a minimum and should reduce the burden on the name node. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789023#action_12789023 ] Sriranjan Manjunath commented on PIG-1143: -- The file size in the documentation refers to the size on disk. In order to account for compression, encoding etc. a configurable parameter - pig.inputfile.conversionfactor is provided. I agree that this cannot be set to a good value for compressed data. It is just a guidance. The implications of setting it to a bad value are minimal - we will end up sampling little more than the required number of samples (unless you set it to a fraction). Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789063#action_12789063 ] Sriranjan Manjunath commented on PIG-1143: -- I am OK with using InputSplits.getLength() as long as these provide you a good estimate of the file size. Without the population size, poisson samplers do now work well. Samplers expect the data to be in BinStorage. If not, the first job reads it and stores it into BinStorage. The only exception being if the join follows a load/store only MR job. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1134) Skewed Join sampling job overwhelms the name node
Skewed Join sampling job overwhelms the name node - Key: PIG-1134 URL: https://issues.apache.org/jira/browse/PIG-1134 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The map tasks of the sampling job estimate the file size. For a large directory and a large number of maps the file system calls over whelm the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1134) Skewed Join sampling job overwhelms the name node
[ https://issues.apache.org/jira/browse/PIG-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1134: - Attachment: PIG-1134.patch As a stop gap, I have replaced PoissonSampleLoader with RandomSampleLoader. This does not obtain the input size. Instead it obtains 100 samples per block. Skewed Join sampling job overwhelms the name node - Key: PIG-1134 URL: https://issues.apache.org/jira/browse/PIG-1134 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG-1134.patch The map tasks of the sampling job estimate the file size. For a large directory and a large number of maps the file system calls over whelm the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1134) Skewed Join sampling job overwhelms the name node
[ https://issues.apache.org/jira/browse/PIG-1134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1134: - Status: Patch Available (was: Open) Skewed Join sampling job overwhelms the name node - Key: PIG-1134 URL: https://issues.apache.org/jira/browse/PIG-1134 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG-1134.patch The map tasks of the sampling job estimate the file size. For a large directory and a large number of maps the file system calls over whelm the name node. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1135) skewed join partitioner returns negative partition index
[ https://issues.apache.org/jira/browse/PIG-1135?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12787912#action_12787912 ] Sriranjan Manjunath commented on PIG-1135: -- Ran skewed join end-end / unit tests against this patch and random sample loader, and they passed. skewed join partitioner returns negative partition index - Key: PIG-1135 URL: https://issues.apache.org/jira/browse/PIG-1135 Project: Pig Issue Type: Improvement Reporter: Ying He Assignee: Sriranjan Manjunath Attachments: PIG_1135.patch skewed join returns negative reducer index -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure
[ https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1105: - Attachment: PIG-1105.patch COUNT_STAR accumulate interface implementation cases failure Key: PIG-1105 URL: https://issues.apache.org/jira/browse/PIG-1105 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG-1105.1.patch, PIG-1105.patch COUNT_STAR.accumulate is calling sum() which is supposed to be used by intermediate and final parts of algebraic interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure
[ https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1105: - Attachment: (was: PIG-1105.patch) COUNT_STAR accumulate interface implementation cases failure Key: PIG-1105 URL: https://issues.apache.org/jira/browse/PIG-1105 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG-1105.1.patch COUNT_STAR.accumulate is calling sum() which is supposed to be used by intermediate and final parts of algebraic interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure
[ https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1105: - Status: Open (was: Patch Available) COUNT_STAR accumulate interface implementation cases failure Key: PIG-1105 URL: https://issues.apache.org/jira/browse/PIG-1105 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG-1105.1.patch, PIG-1105.2.patch COUNT_STAR.accumulate is calling sum() which is supposed to be used by intermediate and final parts of algebraic interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure
[ https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1105: - Status: Patch Available (was: Open) COUNT_STAR accumulate interface implementation cases failure Key: PIG-1105 URL: https://issues.apache.org/jira/browse/PIG-1105 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG-1105.1.patch, PIG-1105.2.patch COUNT_STAR.accumulate is calling sum() which is supposed to be used by intermediate and final parts of algebraic interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure
[ https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1105: - Status: Open (was: Patch Available) Cancelling since the patch does not have all the changes. COUNT_STAR accumulate interface implementation cases failure Key: PIG-1105 URL: https://issues.apache.org/jira/browse/PIG-1105 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG-1105.1.patch, PIG-1105.2.patch COUNT_STAR.accumulate is calling sum() which is supposed to be used by intermediate and final parts of algebraic interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure
[ https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1105: - Attachment: (was: PIG-1105.2.patch) COUNT_STAR accumulate interface implementation cases failure Key: PIG-1105 URL: https://issues.apache.org/jira/browse/PIG-1105 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG-1105.1.patch, PIG-1105.2.patch COUNT_STAR.accumulate is calling sum() which is supposed to be used by intermediate and final parts of algebraic interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1105) COUNT_STAR accumulate interface implementation cases failure
[ https://issues.apache.org/jira/browse/PIG-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1105: - Status: Patch Available (was: Open) COUNT_STAR accumulate interface implementation cases failure Key: PIG-1105 URL: https://issues.apache.org/jira/browse/PIG-1105 Project: Pig Issue Type: Bug Components: impl Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG-1105.1.patch, PIG-1105.2.patch COUNT_STAR.accumulate is calling sum() which is supposed to be used by intermediate and final parts of algebraic interface. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1102) Collect number of spills per job
[ https://issues.apache.org/jira/browse/PIG-1102?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12784148#action_12784148 ] Sriranjan Manjunath commented on PIG-1102: -- Hadoop currently does not provide us average CPU usage / mem usage per job. It even does not provide the number of spills per job. I have created a jira requesting the same: https://issues.apache.org/jira/browse/MAPREDUCE-1257 The only information we can currently gather is the number of spill records. Collect number of spills per job Key: PIG-1102 URL: https://issues.apache.org/jira/browse/PIG-1102 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Memory shortage is one of the main performance issues in Pig. Knowing when we spill do the disk is useful for understanding query performance and also to see how certain changes in Pig effect that. Other interesting stats to collect would be average CPU usage and max mem usage but I am not sure if this information is easily retrievable. Using Hadoop counters for this would make sense. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-872: Attachment: (was: PIG_872.patch) use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-872: Status: Open (was: Patch Available) use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-872: Status: Patch Available (was: Open) use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch.1 Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-872: Attachment: PIG_872.patch.1 Fixed both the issues. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch.1 Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779867#action_12779867 ] Sriranjan Manjunath commented on PIG-872: - Olga, I agree with your 1st point. I will get rid of the test case. To rectify 2, shouldn't maprReduceOper.getReplFiles() return only the replicated files? What's the rationale behind returning a null for the fragmented input? I could change it to what Ashutosh suggested, but it would just be cleaner if fragmented input was not represented by a null. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-872: Attachment: PIG_872.patch I have verified that the job.xml has mapred.cache.files set to the replicated files. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1079) Modify merge join to use distributed cache to maintain the index
Modify merge join to use distributed cache to maintain the index Key: PIG-1079 URL: https://issues.apache.org/jira/browse/PIG-1079 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1035: - Attachment: 1035.patch The attached patch contains modifications to support outer skewed join. It follows the same semantics as regular join. Some of the code used by regular join is moved to a common file - CompilerUtils and used by both. support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1035: - Status: Patch Available (was: Open) support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1035: - Status: Patch Available (was: Open) support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035new.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1035) support for skewed outer join
[ https://issues.apache.org/jira/browse/PIG-1035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1035: - Status: Open (was: Patch Available) support for skewed outer join - Key: PIG-1035 URL: https://issues.apache.org/jira/browse/PIG-1035 Project: Pig Issue Type: New Feature Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: 1035new.patch Similarly to skewed inner join, skewed outer join will help to scale in the presense of join keys that don't fit into memory -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Attachment: (was: pig_1048.patch) inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Status: Open (was: Patch Available) inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12771693#action_12771693 ] Sriranjan Manjunath commented on PIG-1048: -- I have also modified a skewed join test case to check if atleast one key is present in more than 1 partition instead of checking for all the keys being present in multiple partitions. Since the dataset was too small sampler with the RLR change did not detect these small keys causing the unit test to fail. inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Attachment: pig_1048.patch inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Status: Patch Available (was: Open) inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Status: Open (was: Patch Available) Re-uploaded the same patch. Attaching a new one. inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Attachment: pig_1048.patch inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Status: Patch Available (was: Open) inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Status: Patch Available (was: Open) inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1048) inner join using 'skewed' produces multiple rows for keys with single row in both input relations
[ https://issues.apache.org/jira/browse/PIG-1048?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1048: - Attachment: pig_1048.patch The patch solves this issue. inner join using 'skewed' produces multiple rows for keys with single row in both input relations - Key: PIG-1048 URL: https://issues.apache.org/jira/browse/PIG-1048 Project: Pig Issue Type: Bug Reporter: Thejas M Nair Assignee: Sriranjan Manjunath Attachments: pig_1048.patch ${code} grunt cat students.txt asdfxc M 23 12.44 qwerF 21 14.44 uhsdf M 34 12.11 zxldf M 21 12.56 qwerF 23 145.5 oiueM 54 23.33 l1 = load 'students.txt'; l2 = load 'students.txt'; j = join l1 by $0, l2 by $0 ; store j into 'tmp.txt' grunt cat tmp.txt oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 oiueM 54 23.33 qwerF 21 14.44 qwerF 21 14.44 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 qwerF 21 14.44 qwerF 23 145.5 qwerF 23 145.5 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 uhsdf M 34 12.11 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 zxldf M 21 12.56 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44 asdfxc M 23 12.44$ ${code} -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1017: - Status: Open (was: Patch Available) Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: stotext.patch Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1017: - Attachment: (was: stotext.patch) Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1017: - Attachment: stotext.patch Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: stotext.patch Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12767952#action_12767952 ] Sriranjan Manjunath commented on PIG-1017: -- The release audit warnings are related to html files. Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: stotext.patch Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1017: - Status: Patch Available (was: Open) Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: stotext.patch Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-1017: - Attachment: stotext.patch The patch will fail MRCompiler and LogToPhyTransalator unit tests since we need to replace the golden files. The rest should pass. Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: stotext.patch Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Assigned: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath reassigned PIG-1017: Assignee: Sriranjan Manjunath Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: stotext.patch Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765380#action_12765380 ] Sriranjan Manjunath commented on PIG-1017: -- Pigmix results before and after converting strings to text: ||Pigmix query||Trunk||Modified code|| |L1| 3:2|2:24| |L2| 2:6|1:23| |L3| 3:36|3:49| |L4| 1:42|1:49| |L5| 1:49|1:49| |L6| 1:47|3:3| |L7| 1:44|1:49| |L8| 1:19|1:18| |L9| 4:6|5:35| |L10| 8:52|7:56| |L11| 2:26|1:34| |L12| 1:57|1:54| Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1017) Converts strings to text in Pig
[ https://issues.apache.org/jira/browse/PIG-1017?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12765381#action_12765381 ] Sriranjan Manjunath commented on PIG-1017: -- Something fishy is going on. I ran L6 a couple more times with the modified code and it completed in 1:8 Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-1017) Converts strings to text in Pig
Converts strings to text in Pig --- Key: PIG-1017 URL: https://issues.apache.org/jira/browse/PIG-1017 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Strings in Java are UTF-16 and takes 2 bytes. Text (org.apache.hadoop.io.Text) stores the data in UTF-8 and could show significant reductions in memory. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-964) Handling null keys in skewed join
Handling null keys in skewed join - Key: PIG-964 URL: https://issues.apache.org/jira/browse/PIG-964 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath The tuple size is calculated incorrectly and thus the skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-964) Handling null in skewed join
[ https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-964: Attachment: skjoin2b.patch Attached patch solves both the issues. Handling null in skewed join - Key: PIG-964 URL: https://issues.apache.org/jira/browse/PIG-964 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoin2b.patch For null tuples, the tuple size is calculated incorrectly and thus skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-964) Handling null in skewed join
[ https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-964: Description: For null tuples, the tuple size is calculated incorrectly and thus skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. (was: The tuple size is calculated incorrectly and thus the skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution.) Summary: Handling null in skewed join (was: Handling null keys in skewed join) Handling null in skewed join - Key: PIG-964 URL: https://issues.apache.org/jira/browse/PIG-964 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoin2b.patch For null tuples, the tuple size is calculated incorrectly and thus skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-964) Handling null in skewed join
[ https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-964: Assignee: Sriranjan Manjunath Status: Patch Available (was: Open) Handling null in skewed join - Key: PIG-964 URL: https://issues.apache.org/jira/browse/PIG-964 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: skjoin2b.patch For null tuples, the tuple size is calculated incorrectly and thus skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-964) Handling null in skewed join
[ https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-964: Status: Open (was: Patch Available) Handling null in skewed join - Key: PIG-964 URL: https://issues.apache.org/jira/browse/PIG-964 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: skjoin2b.patch For null tuples, the tuple size is calculated incorrectly and thus skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-964) Handling null in skewed join
[ https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-964: Attachment: (was: skjoin2b.patch) Handling null in skewed join - Key: PIG-964 URL: https://issues.apache.org/jira/browse/PIG-964 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: skewedjoinnull.patch For null tuples, the tuple size is calculated incorrectly and thus skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-964) Handling null in skewed join
[ https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-964: Attachment: skewedjoinnull.patch Cleared end-end tests and added a new unit test to check for nulls in the dataset. Handling null in skewed join - Key: PIG-964 URL: https://issues.apache.org/jira/browse/PIG-964 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: skewedjoinnull.patch For null tuples, the tuple size is calculated incorrectly and thus skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-964) Handling null in skewed join
[ https://issues.apache.org/jira/browse/PIG-964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-964: Status: Patch Available (was: Open) Handling null in skewed join - Key: PIG-964 URL: https://issues.apache.org/jira/browse/PIG-964 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: skewedjoinnull.patch For null tuples, the tuple size is calculated incorrectly and thus skewed join ends up expecting a large number of reducers. Further, skewed join should not bail out after the second job if the number of reducers specified by the user is low. It should print a warning message and continue execution. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-962) Skewed join creates 3 map reduce jobs
[ https://issues.apache.org/jira/browse/PIG-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-962: Description: The first job is a load / store job which loads the data from PigStorage and stores it in BinStorage. This hampers performance. The desired behavior is for the sampler to read from PigStorage instead of relying on the first load/store job. Skewed join should thus be 2 M/R jobs and not 3. (was: The first job is a load / store job which loads the data from PigStorage and stores it in BinStorage. This hampers performance. The desired behavior is for the sampler to read from PigStorage instead of relying on the first load/store job.) Skewed join creates 3 map reduce jobs - Key: PIG-962 URL: https://issues.apache.org/jira/browse/PIG-962 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath The first job is a load / store job which loads the data from PigStorage and stores it in BinStorage. This hampers performance. The desired behavior is for the sampler to read from PigStorage instead of relying on the first load/store job. Skewed join should thus be 2 M/R jobs and not 3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Created: (PIG-962) Skewed join creates 3 map reduce jobs
Skewed join creates 3 map reduce jobs - Key: PIG-962 URL: https://issues.apache.org/jira/browse/PIG-962 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath The first job is a load / store job which loads the data from PigStorage and stores it in BinStorage. This hampers performance. The desired behavior is for the sampler to read from PigStorage instead of relying on the first load/store job. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-962) Skewed join creates 3 map reduce jobs
[ https://issues.apache.org/jira/browse/PIG-962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-962: Status: Patch Available (was: Open) Skewed join creates 3 map reduce jobs - Key: PIG-962 URL: https://issues.apache.org/jira/browse/PIG-962 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: skewedjoin2job.patch The first job is a load / store job which loads the data from PigStorage and stores it in BinStorage. This hampers performance. The desired behavior is for the sampler to read from PigStorage instead of relying on the first load/store job. Skewed join should thus be 2 M/R jobs and not 3. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12750297#action_12750297 ] Sriranjan Manjunath commented on PIG-935: - The unit tests are unrelated to my patch. Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-890) Create a sampler interface and improve the skewed join sampler
[ https://issues.apache.org/jira/browse/PIG-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-890: Status: Open (was: Patch Available) Create a sampler interface and improve the skewed join sampler -- Key: PIG-890 URL: https://issues.apache.org/jira/browse/PIG-890 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: sampler.patch We need a different sampler for order by and skewed join. We thus need a better sampling interface. The design of the same is described here: http://wiki.apache.org/pig/PigSampler -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-890) Create a sampler interface and improve the skewed join sampler
[ https://issues.apache.org/jira/browse/PIG-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-890: Attachment: (was: sampler.patch) Create a sampler interface and improve the skewed join sampler -- Key: PIG-890 URL: https://issues.apache.org/jira/browse/PIG-890 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: samplerinterface.patch We need a different sampler for order by and skewed join. We thus need a better sampling interface. The design of the same is described here: http://wiki.apache.org/pig/PigSampler -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-890) Create a sampler interface and improve the skewed join sampler
[ https://issues.apache.org/jira/browse/PIG-890?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-890: Status: Patch Available (was: Open) Create a sampler interface and improve the skewed join sampler -- Key: PIG-890 URL: https://issues.apache.org/jira/browse/PIG-890 Project: Pig Issue Type: Improvement Reporter: Sriranjan Manjunath Attachments: samplerinterface.patch We need a different sampler for order by and skewed join. We thus need a better sampling interface. The design of the same is described here: http://wiki.apache.org/pig/PigSampler -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-942) Maps are not implicitly casted
[ https://issues.apache.org/jira/browse/PIG-942?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12750796#action_12750796 ] Sriranjan Manjunath commented on PIG-942: - Here's the complete script: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'100', B by (chararray)f#'100'; dump C; Maps are not implicitly casted -- Key: PIG-942 URL: https://issues.apache.org/jira/browse/PIG-942 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath A = load 'foo' as (m) throws the following exception when foo has maps. java.lang.ClassCastException: org.apache.pig.data.DataByteArray cannot be cast to java.util.Map at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:98) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POMapLookUp.getNext(POMapLookUp.java:115) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:612) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.processPlan(POForEach.java:278) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POForEach.getNext(POForEach.java:204) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:231) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.relationalOperators.POLocalRearrange.getNext(POLocalRearrange.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.runPipeline(PigMapBase.java:249) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapBase.map(PigMapBase.java:240) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.PigMapReduce$Map.map(PigMapReduce.java:93) at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:227) at org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2198) The same works if I explicitly cast m to a map: A = load 'foo' as (m:[]) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Attachment: skmapbug.patch Added code to explicitly check for -1 in orderby Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Attachment: (was: skjoinmapbug.patch) Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Status: Patch Available (was: Open) Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Status: Open (was: Patch Available) Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoinmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Attachment: skjoinmapbug.patch Fixed the issue with unit tests Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoinmapbug.patch, skjoinmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Status: Open (was: Patch Available) Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoinmapbug.patch, skjoinmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Attachment: (was: skjoinmapbug.patch) Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoinmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Attachment: (was: skjoinmapbug.patch) Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoinmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Attachment: skjoinmapbug.patch Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoinmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Status: Patch Available (was: Open) Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoinmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Updated: (PIG-935) Skewed join throws an exception when used with map keys
[ https://issues.apache.org/jira/browse/PIG-935?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Sriranjan Manjunath updated PIG-935: Status: Patch Available (was: Open) The attached patch solves this issue. Skewed join throws an exception when used with map keys --- Key: PIG-935 URL: https://issues.apache.org/jira/browse/PIG-935 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Attachments: skjoinmapbug.patch Skewed join throws a runtime exception for the following query: A = load 'map.txt' as (e); B = load 'map.txt' as (f); C = join A by (chararray)e#'a', B by (chararray)f#'a' using skewed; explain C; Exception: Caused by: java.lang.ClassCastException: org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast cannot be cast to org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.PO Project at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSortCols(MRCompiler.java:1492) at org.apache.pig.backend.hadoop.executionengine.mapReduceLayer.MRCompiler.getSamplingJob(MRCompiler.java:1894) ... 27 more -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.