[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793296#action_12793296 ] Olga Natkovich commented on PIG-1143: - +1 on the code changes. There is a extra debug trace in the code that I will remove as part of the commit Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793322#action_12793322 ] Olga Natkovich commented on PIG-1143: - patch committed to the trunk. Will commit to 0.6 branch tomorrow. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Fix For: 0.6.0 Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791872#action_12791872 ] Hadoop QA commented on PIG-1143: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12428266/PIG_1143.patch.1 against trunk revision 891499. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 9 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/console This message is automatically generated. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch, PIG_1143.patch.1 The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790900#action_12790900 ] Olga Natkovich commented on PIG-1143: - I am reviewing this patch Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790983#action_12790983 ] Olga Natkovich commented on PIG-1143: - I think this needs to be tested with multiple skew joins both in a case of single store and multiquery. Please, add unit tests Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790566#action_12790566 ] Hadoop QA commented on PIG-1143: +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12427980/PIG_1143.patch against trunk revision 890553. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 6 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/console This message is automatically generated. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath Attachments: PIG_1143.patch The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788971#action_12788971 ] Sriranjan Manjunath commented on PIG-1143: -- To describe the problem in more detail, the current implementation does not handle a glob efficiently. When the sample loader encounters a directory (or combinations thereof), it gets the element descriptors of all the files inside the directory to compute the file sizes. For ex: A = load {view, click} will result in computing file sizes of all the files underneath both view and click directories. If we have a large number of mappers, this will result in a ton of hdfs system calls, clogging the name node. I intend to modify Poisson Sample Loader as follows. The algorithm for computing the total number of samples remains the same. However, it will not be computed by every mapper. I will be using the UDFContext object to share this information across mappers. Since mapper/ reducers can only read the information from UDFContext, the slicer will store this information. The slicer will compute the sampler count for the first map. As before, PigSlice will call computeSamples() for the first map. It will then store this value as a property in the UDFContext object. The Slicer will check UDFContext to see if this value is set and if it is, it will use it instead of computing it again. I intend to use pig.input.0.sampleCount as the key. This solution will reduce the fileSize() invocations to a minimum and should reduce the burden on the name node. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788998#action_12788998 ] Olga Natkovich commented on PIG-1143: - Sounds like a good approach. We need to figure out how this will translate into Load-Store redesign and make sure to port it there once the patch is available. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789013#action_12789013 ] Thejas M Nair commented on PIG-1143: The PoissonSampleLoader implementation in Load-store redesign does not check the file size and has a different approach for the following reason (as mentioned in PIG-1062) - With new interfaces in load-store redesign, pig can compute the file size by adding up size of each split (from InputSplit.getLenght()) . But the documentation of the function does not make it clear if this is size on disk , compressed/uncompressed etc. Looks like it just needs to be some number proportional to size of the file. Assuming it is size on disk (uncompressed), using this to estimate the total memory it will require is tricky, one has to make assumptions about the compression ratio and the serialization method. Using Tuple.getMemorySize() while sampling will give more accurate numbers for reducer memory that it will consume. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789015#action_12789015 ] Thejas M Nair commented on PIG-1143: To summarize my above comment, the approach in load-store redesign of not using the file-size at all is better . Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789023#action_12789023 ] Sriranjan Manjunath commented on PIG-1143: -- The file size in the documentation refers to the size on disk. In order to account for compression, encoding etc. a configurable parameter - pig.inputfile.conversionfactor is provided. I agree that this cannot be set to a good value for compressed data. It is just a guidance. The implications of setting it to a bad value are minimal - we will end up sampling little more than the required number of samples (unless you set it to a fraction). Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789063#action_12789063 ] Sriranjan Manjunath commented on PIG-1143: -- I am OK with using InputSplits.getLength() as long as these provide you a good estimate of the file size. Without the population size, poisson samplers do now work well. Samplers expect the data to be in BinStorage. If not, the first job reads it and stores it into BinStorage. The only exception being if the join follows a load/store only MR job. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once
[ https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789069#action_12789069 ] Thejas M Nair commented on PIG-1143: If the data is going to be in BinStorage, my comments regarding the approach for this patch are not applicable. But the patch does not need to be ported to load-store redesign branch. Poisson Sample Loader should compute the number of samples required only once - Key: PIG-1143 URL: https://issues.apache.org/jira/browse/PIG-1143 Project: Pig Issue Type: Bug Reporter: Sriranjan Manjunath Assignee: Sriranjan Manjunath The current poisson sampler forces each of the maps to compute the sample number. This is redundant and causes issues when a large directory is specified in the join. The sampler should be changed to calculate the sample count only once and this information should be shared with the remaining mappers. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.