[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12755019#action_12755019 ] Alan Gates commented on PIG-793: Sri is looking into the array vs arraylist changes as well. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Alan Gates Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12754491#action_12754491 ] Ashutosh Chauhan commented on PIG-793: -- In addition to String Vs Text, Alan also mentioned using array instead of ArrayListObject. Did any took a look at that? I think that change should also help. When I benchmarked merge join, nearly 20-30% CPU time was spent in arraylist's operations, which should benefit a lot if an array is used instead. So, changing to arrays should help both in memory and CPU runtime at the cost of expensive appends. Also, some small benefits can be gained by very simple changes introduced in https://issues.apache.org/jira/browse/PIG-513 Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Alan Gates Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12753916#action_12753916 ] Olga Natkovich commented on PIG-793: Clarification from Alan on the String vs. Text comparison: The 16/36 24/52 numbers noted in the bug are correct. Let me explain them. Text has a 16 byte overhead in and of itself, plus 16 bytes for the array that holds the data, plus 20 bytes for the data. String has a 24 byte overhead for itself, plus 12 bytes for whatever it holds the data in, plus 40 bytes for the data. So overall, I guess it would have been clearer had I said Text has a 32 byte over head and String 36, and then Text stores the data in one byte per characters (assumingASCII) while String stores it in 2 (ASCII or not). There is some guesswork involved here, since I'm just looking at output from Java memory tools. We could retest this with larger strings and make sure the results are consistent. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Alan Gates Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724880#action_12724880 ] Alan Gates commented on PIG-793: The cost for storing data raw is: 16 bytes for the tuple object 12 bytes for the byte array object 12 bytes + 2 bytes/field for a short[] to hold offsets into the byte[] Then as you say above for the data itself, plus 1 byte per field to store type and nullness. So our example tuple would take ~85 bytes. But in general, yes you can do much better with raw bytes. We played with this some and we found that the cost of Tuple.get/set goes up 10x because of the need to turn the bytes into objects. In a typical query this added about 2x to the overall run time. The solution to this would be to rewrite all the Pig operators to work on byte data instead of objects. This is a large project, and doesn't solve the UDFs. We could pay the performance penalty for UDFs, or we could change the UDFs to take byte data. Currently many of our users are asking for the ability to write UDFs in Python or other scripting languages. If we instead go the other way and basically make them write C style Java I don't think that will be popular. What we're playing with now (changing ArrayListObject to Object[] and String to Text) will reap somewhere around 50% of the benefits in terms of memory savings as going to fully raw data. But it's around 10% of the work. I'm not excluding moving to storing everything in a byte[] in the future. But I want to see if for a little work now we can get a descent amount of improvement. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Alan Gates Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724594#action_12724594 ] Alan Gates commented on PIG-793: Using jmap, I've been toying around with our DefaultTuple implementation to see how much memory it takes. For a tuple with 3 elements, one int, one double, one 20 character string I see it taking: 16 bytes for the Tuple object 24 bytes for the ArrayListObject in the tuple ~26 bytes for pointers in the ArrayList 16 bytes for the Integer 16 bytes for the Double 24 bytes for the String overhead ~52 bytes for the String data Pointers in the ArrayList and character data in the String appear to be padded and vary somewhat depending on how I run the experiments. I played with changing the ArrayListObject in DefaultTuple to an Object[]. There are two advantages, the 24 bytes of ArrayList shrinks to 12 for the Object[], and as I wrote it to always have the Object[] be exactly the right size there is no padding cost. The downside to this is append becomes a more expensive operation because it's growing the Object[] by one every time. However, after some investigation I believe that most places we use append can be changed to use set, thus alieviating this issue. I'm working on a patch to change this. Once I have that done I'll report on how that changes memory usage as well as any performance gains or losses. A related item I would like to look into is using Hadoop's Text instead of String to back chararray. Text takes 16 bytes of overhead + 36 bytes for string data to store 20 characters, versus the 24 / 52 of String. Obviously this would be a huge change and needs to have very impressive results to be considered. I'll play with it and report results here. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Alan Gates Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12724703#action_12724703 ] David Ciemiewicz commented on PIG-793: -- Alan, This sounds good, but it sounds like it is only 12 out of 174 bytes that you are saving or less than 10%. Amdahl's law says this isn't sufficient in the grand scheme of things and so I won't expect a huge payback. It seems like an optimal encoding of the same tuple would be something like: 1 or 2 bytes for an index to the structure describing the contents of the tuple (keep a list of these tuple structures) 4 bytes for the int 8 bytes for the double 1 or 2 bytes for string length in fixed positions 20 bytes for string Total is 36 bytes or an 80% reduction in memory versus 174 bytes. If memory and not CPU is what is slowing down Pig processing, then Hong Tang's LazyTuple or something like it ultimately going to be what is needed. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Assignee: Alan Gates Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705169#action_12705169 ] Olga Natkovich commented on PIG-793: This would require to compile code on the fly, right? Up till now we were trying to avoid. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.
[jira] Commented: (PIG-793) Improving memory efficiency of Tuple implementation
[ https://issues.apache.org/jira/browse/PIG-793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12705188#action_12705188 ] Hong Tang commented on PIG-793: --- Two ideas: # when loading tuple from serialized data, keep it as a byte array and only instantiate datums when get/set calls are made. This would help if we are moving tuples from one container to another container. {code} class LazyTuple implements Tuple { ArrayListObject fields; // null if not deserialized DataByteArray lazyBytes; // e.g. serialized bytes of tuple in avro format. } {code} # improving DataByteArray. it may be changed to an interface (need get(), offset(), and length() ), and use a DataByteArrayFactory to create instances in two ways: ## DataByteArrayFactor.createPrivate(byte[], offset, length), if we need to keep a private copy of the buffer. ## DataByteArrayCreateShared(). if the input buffer can be shared with the data byte array object. In this case, the contract would be that caller will no longer access the portion of byte array from offset to offset+length (exclusive). There could be three different implementations of this: - The current implementation will be used for createPrivate(). - An implementation for small buffers (offset/length can be represented in short/short). - An implementation for large buffers (offset/length are int/int, and length is larger enough) Note that the change to DataByteArray would break the current semantics where the offset is always 0, and length is always the length of the buffer. Improving memory efficiency of Tuple implementation --- Key: PIG-793 URL: https://issues.apache.org/jira/browse/PIG-793 Project: Pig Issue Type: Improvement Reporter: Olga Natkovich Currently, our tuple is a real pig and uses a lot of extra memory. There are several places where we can improve memory efficiency: (1) Laying out memory for the fields rather than using java objects since since each object for a numeric field takes 16 bytes (2) For the cases where we know the schema using Java arrays rather than ArrayList. There might be more. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.