[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-21 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793296#action_12793296
 ] 

Olga Natkovich commented on PIG-1143:
-

+1 on the code changes. There is a extra debug trace in the code that I will 
remove as part of the commit

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-21 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12793322#action_12793322
 ] 

Olga Natkovich commented on PIG-1143:
-

patch committed to the trunk. Will commit to 0.6 branch tomorrow.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Fix For: 0.6.0

 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-17 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12791872#action_12791872
 ] 

Hadoop QA commented on PIG-1143:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12428266/PIG_1143.patch.1
  against trunk revision 891499.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 9 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/135/console

This message is automatically generated.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch, PIG_1143.patch.1


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-15 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790900#action_12790900
 ] 

Olga Natkovich commented on PIG-1143:
-

I am reviewing this patch

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-15 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790983#action_12790983
 ] 

Olga Natkovich commented on PIG-1143:
-

I think this needs to be tested with multiple skew joins both in a case of 
single store and multiquery. Please, add unit tests

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-14 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12790566#action_12790566
 ] 

Hadoop QA commented on PIG-1143:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12427980/PIG_1143.patch
  against trunk revision 890553.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 6 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h8.grid.sp2.yahoo.net/123/console

This message is automatically generated.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath
 Attachments: PIG_1143.patch


 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788971#action_12788971
 ] 

Sriranjan Manjunath commented on PIG-1143:
--

To describe the problem in more detail, the current implementation does not 
handle a glob efficiently. When the sample loader encounters a directory (or 
combinations thereof), it gets the element descriptors of all the files inside 
the directory to compute the file sizes.
For ex: A = load {view, click} will result in computing file sizes of all the 
files underneath both view and click directories. If we have a large number 
of mappers, this will result in a ton of hdfs system calls, clogging the name 
node.

I intend to modify Poisson Sample Loader as follows. The algorithm for 
computing the total number of samples remains the same. However, it will not be 
computed by every mapper. I will be using the UDFContext object to share this 
information across mappers. Since mapper/ reducers can only read the 
information from UDFContext, the slicer will store this information. The slicer 
will compute the sampler count for the first map. As before, PigSlice will call 
computeSamples() for the first map. It will then store this value as a property 
in the UDFContext object. The Slicer will check UDFContext to see if this value 
is set and if it is, it will use it instead of computing it again. I intend to 
use pig.input.0.sampleCount as the key.

This solution will reduce the fileSize() invocations to a minimum and should 
reduce the burden on the name node.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12788998#action_12788998
 ] 

Olga Natkovich commented on PIG-1143:
-

Sounds like a good approach. We need to figure out how this will translate into 
Load-Store redesign and make sure to port it there once the patch is available.

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789013#action_12789013
 ] 

Thejas M Nair commented on PIG-1143:


The PoissonSampleLoader implementation in Load-store redesign does not check 
the file size and has a different approach for the following reason (as 
mentioned in PIG-1062) -

With new interfaces in load-store redesign, pig can compute the file size by 
adding up size of each split (from InputSplit.getLenght()) . But the 
documentation of the function does not make it clear if this is size on disk , 
compressed/uncompressed etc. Looks like it just needs to be some number 
proportional to size of the file. Assuming it is size on disk (uncompressed), 
using this to estimate the total memory it will require is tricky, one has to 
make assumptions about the compression ratio and the serialization method.
Using Tuple.getMemorySize() while sampling will give more accurate numbers for 
reducer memory that it will consume. 


 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789015#action_12789015
 ] 

Thejas M Nair commented on PIG-1143:


To summarize my above comment, the approach in load-store redesign of not using 
the file-size at all is better .

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789023#action_12789023
 ] 

Sriranjan Manjunath commented on PIG-1143:
--

The file size in the documentation refers to the size on disk. In order to 
account for compression, encoding etc. a configurable parameter - 
pig.inputfile.conversionfactor is provided. I agree that this cannot be set to 
a good value for compressed data. It is just a guidance. The implications of 
setting it to a bad value are minimal - we will end up sampling little more 
than the required number of samples (unless you set it to a fraction).

 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Sriranjan Manjunath (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789063#action_12789063
 ] 

Sriranjan Manjunath commented on PIG-1143:
--

I am OK with using InputSplits.getLength() as long as these provide you a good 
estimate of the file size. Without the population size, poisson samplers do now 
work well.

Samplers expect the data to be in BinStorage. If not, the first job reads it 
and stores it into BinStorage. The only exception being if the join follows a 
load/store only MR job.


 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1143) Poisson Sample Loader should compute the number of samples required only once

2009-12-10 Thread Thejas M Nair (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12789069#action_12789069
 ] 

Thejas M Nair commented on PIG-1143:


If the data is going to be in BinStorage, my comments regarding the approach 
for this patch are not applicable. But the patch does not need to be ported to 
load-store redesign branch.


 Poisson Sample Loader should compute the number of samples required only once
 -

 Key: PIG-1143
 URL: https://issues.apache.org/jira/browse/PIG-1143
 Project: Pig
  Issue Type: Bug
Reporter: Sriranjan Manjunath
Assignee: Sriranjan Manjunath

 The current poisson sampler forces each of the maps to compute the sample 
 number. This is redundant and causes issues when a large directory is 
 specified in the join. The sampler should be changed to calculate the sample 
 count only once and this information should be shared with the remaining 
 mappers.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.