[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12783728#action_12783728
 ] 

Olga Natkovich commented on PIG-872:


patch committed to 0.6.0 branch

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Fix For: 0.7.0

 Attachments: PIG_872.patch.1


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781415#action_12781415
 ] 

Hadoop QA commented on PIG-872:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425805/PIG_872.patch.1
  against trunk revision 882818.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

-1 core tests.  The patch failed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/165/console

This message is automatically generated.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch.1


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-23 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12781560#action_12781560
 ] 

Hadoop QA commented on PIG-872:
---

-1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425805/PIG_872.patch.1
  against trunk revision 882818.

+1 @author.  The patch does not contain any @author tags.

-1 tests included.  The patch doesn't appear to include any new or modified 
tests.
Please justify why no tests are needed for this patch.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/51/console

This message is automatically generated.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch.1


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-19 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779867#action_12779867
 ] 

Sriranjan Manjunath commented on PIG-872:
-

Olga, I agree with your 1st point. I will get rid of the test case.
To rectify 2, shouldn't maprReduceOper.getReplFiles() return only the 
replicated files? What's the rationale behind returning a null for the 
fragmented input? I could change it to what Ashutosh suggested, but it would 
just be cleaner if fragmented input was not represented by a null.


 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-19 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12780148#action_12780148
 ] 

Olga Natkovich commented on PIG-872:


I am fine if you want to remove it as long as it does not break any existing 
functionality. I am not sure why it is present in the list.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779110#action_12779110
 ] 

Olga Natkovich commented on PIG-872:


I am reviewing this patch

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-17 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779238#action_12779238
 ] 

Olga Natkovich commented on PIG-872:


Patch looks good. A couple of comments:

(1) Looks like the newly added test just have a plain replicated join. I think 
this is already covered in TestFRJoin.java.
(2) JobControlCompiler.java 943 lines: the comment says that the fragmented 
input is usually null and that it is the first one. Would it be more clear if 
we just started with the second file and did not have to check for null every 
time since null in any other position should be treated as error.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-17 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12779256#action_12779256
 ] 

Ashutosh Chauhan commented on PIG-872:
--

There already exists MapReduceOper.getFragment() which will return the index of 
fragmented input in replFiles[]. This index should be used to identify the 
fragmented input instead of null check. And then as Olga suggested any other 
null is an error.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-11-16 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12778732#action_12778732
 ] 

Hadoop QA commented on PIG-872:
---

+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12425174/PIG_872.patch
  against trunk revision 881008.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/157/console

This message is automatically generated.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Sriranjan Manjunath
 Attachments: PIG_872.patch


 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-07-06 Thread Pradeep Kamath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727648#action_12727648
 ] 

Pradeep Kamath commented on PIG-872:


Distributed cache can be used for the case where the replicate input for the 
fragment-replicate join is a file present on DFS at query start time. When the 
replicate input is an intermediate output in the query (say the output of a 
filter), the client side will know the filename of this intermediate output - 
but we have to check that hadoop will honor the ditributed cache property only 
while launching the job corresponding to the FR join (so that at that time, the 
file is present on dfs) - if this is not the case, the approach may not work 
for intermediate output scenarios.

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-872) use distributed cache for the replicated data set in FR join

2009-07-06 Thread Milind Bhandarkar (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-872?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12727660#action_12727660
 ] 

Milind Bhandarkar commented on PIG-872:
---

A couple of things:

As Pradeep says, only the hadoop job that performs FR join needs to add the 
replicated dataset to distributed cache.

Second, make sure that the replicated dataset has high replication, such as 10 
(or the same replication as job.jar). For already materialized dataset, Pig 
need not do anything but only warn if the replication factor is small (e.g. 3) 
But if the replicated dataset is being produced as an intermediate output by 
Pig, it needs to be generated with high replication factor. 

 use distributed cache for the replicated data set in FR join
 

 Key: PIG-872
 URL: https://issues.apache.org/jira/browse/PIG-872
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, the replicated file is read directly from DFS by all maps. If the 
 number of the concurrent maps is huge, we can overwhelm the NameNode with 
 open calls.
 Using distributed cache will address the issue and might also give a 
 performance boost since the file will be copied locally once and the reused 
 by all tasks running on the same machine.
 The basic approach would be to use cacheArchive to place the file into the 
 cache on the frontend and on the backend, the tasks would need to refer to 
 the data using path from the cache.
 Note that cacheArchive does not work in Hadoop local mode. (Not a problem for 
 us right now as we don't use it.)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.