[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-09-14 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755019#action_12755019
 ] 

Alan Gates commented on PIG-793:


Sri is looking into the array vs arraylist changes as well.

 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Alan Gates

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-09-12 Thread Ashutosh Chauhan (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754491#action_12754491
 ] 

Ashutosh Chauhan commented on PIG-793:
--

In addition to String Vs Text, Alan also mentioned using array instead of 
ArrayListObject. Did any took a look at that? I think that change should also 
help. When I benchmarked merge join, nearly 20-30% CPU time was spent in 
arraylist's operations, which should benefit a lot if an array is used instead. 
So, changing to arrays should help both in memory and CPU runtime at the cost 
of expensive appends.

Also, some small benefits can be gained by very simple changes introduced in 
https://issues.apache.org/jira/browse/PIG-513

 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Alan Gates

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-09-10 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753916#action_12753916
 ] 

Olga Natkovich commented on PIG-793:


Clarification from Alan on the String vs. Text comparison:

The 16/36 24/52 numbers noted in the bug are correct.  Let me explain them.  
Text has a 16 byte overhead in and of itself, plus 16 bytes for the array that 
holds the data, plus 20 bytes for the data.  String has a 24 byte overhead for 
itself, plus 12 bytes for whatever it holds the data in, plus 40 bytes for the 
data.  So overall, I guess it would have been clearer had I said Text has a 32 
byte over head and String 36, and then Text stores the data in one byte per 
characters (assumingASCII) while String stores it in 2 (ASCII or not).  There 
is some guesswork involved here, since I'm just looking at output from Java 
memory tools.  We could retest this with larger strings and make sure the 
results are consistent.


 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Alan Gates

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-06-27 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724880#action_12724880
 ] 

Alan Gates commented on PIG-793:


The cost for storing data raw is:

16 bytes for the tuple object
12 bytes for the byte array object
12 bytes + 2 bytes/field for a short[] to hold offsets into the byte[]
Then as you say above for the data itself, plus 1 byte per field to store type 
and nullness.

So our example tuple would take ~85 bytes.

But in general, yes you can do much better with raw bytes.  We played with this 
some and we found that the cost of Tuple.get/set goes up 10x because of the 
need to turn the bytes into objects.  In a typical query this added about 2x to 
the overall run time.  The solution to this would be to rewrite all the Pig 
operators to work on byte data instead of objects.  This is a large project, 
and doesn't solve the UDFs.  We could pay the performance penalty for UDFs, or 
we could change the UDFs to take byte data.  Currently many of our users are 
asking for the ability to write UDFs in Python or other scripting languages.  
If we instead go the other way and basically make them write C style Java I 
don't think that will be popular.

What we're playing with now (changing ArrayListObject to Object[] and String 
to Text) will reap somewhere around 50% of the benefits in terms of memory 
savings as going to fully raw data.  But it's around 10% of the work.  I'm not 
excluding moving to storing everything in a byte[] in the future.  But I want 
to see if for a little work now we can get a descent amount of improvement.

 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Alan Gates

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-06-26 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724594#action_12724594
 ] 

Alan Gates commented on PIG-793:


Using jmap, I've been toying around with our DefaultTuple implementation to see 
how much memory it takes.  For a tuple with 3 elements, one int, one double, 
one 20 character string I see it taking:

16 bytes for the Tuple object
24 bytes for the ArrayListObject in the tuple
~26 bytes for pointers in the ArrayList
16 bytes for the Integer
16 bytes for the Double
24 bytes for the String overhead
~52 bytes for the String data

Pointers in the ArrayList and character data in the String appear to be padded 
and vary somewhat depending on how I run the experiments.

I played with changing the ArrayListObject in DefaultTuple to an Object[].  
There are two advantages, the 24 bytes of ArrayList shrinks to 12 for the 
Object[], and as I wrote it to always have the Object[] be exactly the right 
size there is no padding cost.  The downside to this is append becomes a more 
expensive operation because it's growing the Object[] by one every time.  
However, after some investigation I believe that most places we use append can 
be changed to use set, thus alieviating this issue.  I'm working on a patch to 
change this.  Once I have that done I'll report on how that changes memory 
usage as well as any performance gains or losses.

A related item I would like to look into is using Hadoop's Text instead of 
String to back chararray.  Text takes 16 bytes of overhead + 36 bytes for 
string data to store 20 characters, versus the 24 / 52 of String.  Obviously 
this would be a huge change and needs to have very impressive results to be 
considered.  I'll play with it and report results here.


 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Alan Gates

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-06-26 Thread David Ciemiewicz (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724703#action_12724703
 ] 

David Ciemiewicz commented on PIG-793:
--

Alan,

This sounds good, but it sounds like it is only 12 out of 174 bytes that you 
are saving or less than 10%.

Amdahl's law says this isn't sufficient in the grand scheme of things and so I 
won't expect a huge payback.

It seems like an optimal encoding of the same tuple would be something like:

1 or 2 bytes for an index to the structure describing the contents of the tuple 
(keep a list of these tuple structures)
4 bytes for the int
8 bytes for the double
1 or 2 bytes for string length in fixed positions
20 bytes for string

Total is 36 bytes or an 80% reduction in memory versus 174 bytes.

If memory and not CPU is what is slowing down Pig processing, then Hong Tang's 
LazyTuple or something like it ultimately going to be what is needed.


 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich
Assignee: Alan Gates

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-05-01 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705169#action_12705169
 ] 

Olga Natkovich commented on PIG-793:


This would require to compile code on the fly, right? Up till now we were 
trying to avoid.

 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation

2009-05-01 Thread Hong Tang (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705188#action_12705188
 ] 

Hong Tang commented on PIG-793:
---

Two ideas:

# when loading tuple from serialized data, keep it as a byte array and only 
instantiate datums when get/set calls are made. This would help if we are 
moving tuples from one container to another container.
{code}
class LazyTuple implements Tuple {
  ArrayListObject fields; // null if not deserialized
  DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format.
}
{code} 
# improving DataByteArray. it may be changed to an interface (need get(), 
offset(), and length() ), and use a DataByteArrayFactory to create instances in 
two ways: 
## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to 
keep a private copy of the buffer.
## DataByteArrayCreateShared(). if the input buffer can be shared with the data 
byte array object. In this case, the contract would be that caller will no 
longer access the portion of byte array from offset to offset+length 
(exclusive).

There could be three different implementations of this:
- The current implementation will be used for createPrivate().
- An implementation for small buffers (offset/length can be represented in 
short/short).
- An implementation for large buffers (offset/length are int/int, and length is 
larger enough)

Note that the change to DataByteArray would break the current semantics where 
the offset is always 0, and length is always the length of the buffer.


 Improving memory efficiency of Tuple implementation
 ---

 Key: PIG-793
 URL: https://issues.apache.org/jira/browse/PIG-793
 Project: Pig
  Issue Type: Improvement
Reporter: Olga Natkovich

 Currently, our tuple is a real pig and uses a lot of extra memory. 
 There are several places where we can improve memory efficiency:
 (1) Laying out memory for the fields rather than using java objects since 
 since each object for a numeric field takes 16 bytes
 (2) For the cases where we know the schema using Java arrays rather than 
 ArrayList.
 There might be more.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.