[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-11-02 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772890#action_12772890
 ] 

Ashutosh Chauhan commented on PIG-1037:
---

Thanks for the explanation, Alan. 

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Fix For: 0.6.0
>
> Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-11-02 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772772#action_12772772
 ] 

Alan Gates commented on PIG-1037:
-

The difference is much more than switching from dumping one tuple at a time to 
multiple tuples.  It is about how spilling is activated.  In the past, spilling 
was passive; it was done when the JVM informed us that memory was getting low.  
This did not work well as the JVM only checks memory usage when it garbage 
collects.  So by the time pig was notified of a low memory condition it was 
often too late.  We often ran out of memory while trying to spill.  Now 
instead, spilling is active.  Pig sets aside a buffer for a bag to put its 
tuples in.  For default bags, once this buffer is full any additional tuples 
are written to disk.  For sorted or distinct bags, once the buffer is full it 
is sorted and dumped to disk, and new records go into the buffer.

This particular patch only adds the change for sorted and distinct bags.  
PIG-975 contains the original patch for default bags.


> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Fix For: 0.6.0
>
> Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-11-01 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12772410#action_12772410
 ] 

Ashutosh Chauhan commented on PIG-1037:
---

I am kinda late on this, but I would appreciate if someone can provide brief 
description of how this patch improves the memory layout and alleviates the 
spill problem. I took a quick look at the patch. 
According to my understanding, previously when memory is about to get exhausted 
Pig will start writing to the disk one tuple at a time. With this new patch, 
once the memory limit is hit whole bag is spilled to disk, at that point 
in-memory bag contains no tuples. If in-memory bag fills again, all of its 
content are spilled to disk in entirety again and so on.. So this patch ensures 
that we are not spilling one tuple at a time, but a full bag a time. Is this 
correct or am I missing something ?

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Fix For: 0.6.0
>
> Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-10-27 Thread Hadoop QA (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770693#action_12770693
 ] 

Hadoop QA commented on PIG-1037:


+1 overall.  Here are the results of testing the latest attachment 
  http://issues.apache.org/jira/secure/attachment/12423339/PIG-1037.patch3
  against trunk revision 830034.

+1 @author.  The patch does not contain any @author tags.

+1 tests included.  The patch appears to include 3 new or modified tests.

+1 javadoc.  The javadoc tool did not generate any warning messages.

+1 javac.  The applied patch does not increase the total number of javac 
compiler warnings.

+1 findbugs.  The patch does not introduce any new Findbugs warnings.

+1 release audit.  The applied patch does not increase the total number of 
release audit warnings.

+1 core tests.  The patch passed core unit tests.

+1 contrib tests.  The patch passed contrib unit tests.

Test results: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/116/testReport/
Findbugs warnings: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/116/artifact/trunk/build/test/findbugs/newPatchFindbugsWarnings.html
Console output: 
http://hudson.zones.apache.org/hudson/job/Pig-Patch-h7.grid.sp2.yahoo.net/116/console

This message is automatically generated.

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG-1037.patch, PIG-1037.patch2, PIG-1037.patch3
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-10-26 Thread Ying He (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770246#action_12770246
 ] 

Ying He commented on PIG-1037:
--

Alan, thanks for the feedback.

For the calculation of average size, I think the cost to calculate 100 times 
should be very minimal. It shouldn't be noticeable of any performance impact.  
so I'd like to keep it logically correct.  It might be possible of very big 
tuples, such as those with Map type of fields.  

For the comments and synchronization, I am going to make the change.

> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG-1037.patch, PIG-1037.patch2
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-1037) better memory layout and spill for sorted and distinct bags

2009-10-26 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-1037?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12770206#action_12770206
 ] 

Alan Gates commented on PIG-1037:
-

Comments:

In InternalSortedBag.add, you are calculating the average size every time you 
add a tuple for the first 100 tuples.  Rather than do the calculations every 
time, wouldn't it be better wait until you get to 100 tuples then calculate the 
average?  This would miss the case where you can store less than 100 tuples, 
but that seems unlikely.

Some of the comments in InternalSortedBag that were copied over from the 
previous code, such as dealing with spills in the midst of reading, are no 
longer true.  They should be removed since they will cause confusion on how the 
code works.

I think the synchronized blocks in InternalSortedBag can be removed.  They were 
there before because spills could be triggered by a separate thread.  Since 
that is no longer true we should be able to remove these.  This will remove a 
lock/unlock on every read of a record out of the bag and should provide some 
speed up.



> better memory layout and spill for sorted and distinct bags
> ---
>
> Key: PIG-1037
> URL: https://issues.apache.org/jira/browse/PIG-1037
> Project: Pig
>  Issue Type: Improvement
>Reporter: Olga Natkovich
>Assignee: Ying He
> Attachments: PIG-1037.patch, PIG-1037.patch2
>
>


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.