[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783728#action_12783728 ] Olga Natkovich commented on PIG-872: patch committed to 0.6.0 branch use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Fix For: 0.7.0 Attachments: PIG_872.patch.1 Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781415#action_12781415 ] Hadoop QA commented on PIG-872: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425805/PIG_872.patch.1 against trunk revision 882818. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. -1 core tests. The patch failed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/console This message is automatically generated. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch.1 Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781560#action_12781560 ] Hadoop QA commented on PIG-872: --- -1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425805/PIG_872.patch.1 against trunk revision 882818. +1 @author. The patch does not contain any @author tags. -1 tests included. The patch doesn't appear to include any new or modified tests. Please justify why no tests are needed for this patch. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/console This message is automatically generated. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch.1 Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779867#action_12779867 ] Sriranjan Manjunath commented on PIG-872: - Olga, I agree with your 1st point. I will get rid of the test case. To rectify 2, shouldn't maprReduceOper.getReplFiles() return only the replicated files? What's the rationale behind returning a null for the fragmented input? I could change it to what Ashutosh suggested, but it would just be cleaner if fragmented input was not represented by a null. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780148#action_12780148 ] Olga Natkovich commented on PIG-872: I am fine if you want to remove it as long as it does not break any existing functionality. I am not sure why it is present in the list. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779110#action_12779110 ] Olga Natkovich commented on PIG-872: I am reviewing this patch use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779238#action_12779238 ] Olga Natkovich commented on PIG-872: Patch looks good. A couple of comments: (1) Looks like the newly added test just have a plain replicated join. I think this is already covered in TestFRJoin.java. (2) JobControlCompiler.java 943 lines: the comment says that the fragmented input is usually null and that it is the first one. Would it be more clear if we just started with the second file and did not have to check for null every time since null in any other position should be treated as error. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779256#action_12779256 ] Ashutosh Chauhan commented on PIG-872: -- There already exists MapReduceOper.getFragment() which will return the index of fragmented input in replFiles[]. This index should be used to identify the fragmented input instead of null check. And then as Olga suggested any other null is an error. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778732#action_12778732 ] Hadoop QA commented on PIG-872: --- +1 overall. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12425174/PIG_872.patch against trunk revision 881008. +1 @author. The patch does not contain any @author tags. +1 tests included. The patch appears to include 3 new or modified tests. +1 javadoc. The javadoc tool did not generate any warning messages. +1 javac. The applied patch does not increase the total number of javac compiler warnings. +1 findbugs. The patch does not introduce any new Findbugs warnings. +1 release audit. The applied patch does not increase the total number of release audit warnings. +1 core tests. The patch passed core unit tests. +1 contrib tests. The patch passed contrib unit tests. Test results: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/testReport/ Findbugs warnings: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html Console output: http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/console This message is automatically generated. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Sriranjan Manjunath Attachments: PIG_872.patch Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727648#action_12727648 ] Pradeep Kamath commented on PIG-872: Distributed cache can be used for the case where the replicate input for the fragment-replicate join is a file present on DFS at query start time. When the replicate input is an intermediate output in the query (say the output of a filter), the client side will know the filename of this intermediate output - but we have to check that hadoop will honor the ditributed cache property only while launching the job corresponding to the FR join (so that at that time, the file is present on dfs) - if this is not the case, the approach may not work for intermediate output scenarios. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join
[ https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727660#action_12727660 ] Milind Bhandarkar commented on PIG-872: --- A couple of things: As Pradeep says, only the hadoop job that performs FR join needs to add the replicated dataset to distributed cache. Second, make sure that the replicated dataset has high replication, such as 10 (or the same replication as job.jar). For already materialized dataset, Pig need not do anything but only warn if the replication factor is small (e.g. 3) But if the replicated dataset is being produced as an intermediate output by Pig, it needs to be generated with high replication factor. use distributed cache for the replicated data set in FR join Key: PIG-872 URL: https://issues.apache.org/jira/browse/PIG-872 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Currently, the replicated file is read directly from DFS by all maps. If the number of the concurrent maps is huge, we can overwhelm the NameNode with open calls. Using distributed cache will address the issue and might also give a performance boost since the file will be copied locally once and the reused by all tasks running on the same machine. The basic approach would be to use cacheArchive to place the file into the cache on the frontend and on the backend, the tasks would need to refer to the data using path from the cache. Note that cacheArchive does not work in Hadoop local mode. (Not a problem for us right now as we don't use it.) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.