[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

2009-01-30 Thread Laukik Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669099#action_12669099
 ] 

Laukik Chitnis commented on PIG-560:


In the current patch, when the length is <65536, the string to UTF8 
conversion is happening twice -- once with String::getBytes() and once 
with DataOutput::writeUTF()
Instead of writeUTF(), how about using writeShort() followed by 
writeBytes() since we would already have the length and the UTF8 bytes?




> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> ---
>
> Key: PIG-560
> URL: https://issues.apache.org/jira/browse/PIG-560
> Project: Pig
>  Issue Type: Bug
>Affects Versions: types_branch
>Reporter: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

2009-01-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669098#action_12669098
 ] 

Olga Natkovich commented on PIG-560:


+1

> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> ---
>
> Key: PIG-560
> URL: https://issues.apache.org/jira/browse/PIG-560
> Project: Pig
>  Issue Type: Bug
>Affects Versions: types_branch
>Reporter: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

2009-01-30 Thread Laukik Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669097#action_12669097
 ] 

Laukik Chitnis commented on PIG-560:


In the current patch, when the length is <65536, the string to UTF8 conversion 
is happening twice -- once with String::getBytes() and once with 
DataOutput::writeUTF()

To avoid that, instead of writeUTF(), how about using writeShort() followed by 
writeBytes() since we would already have the length and the UTF8 bytes? 


> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> ---
>
> Key: PIG-560
> URL: https://issues.apache.org/jira/browse/PIG-560
> Project: Pig
>  Issue Type: Bug
>Affects Versions: types_branch
>Reporter: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

2009-01-30 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12669060#action_12669060
 ] 

Olga Natkovich commented on PIG-560:


I think the unit test was added for BinaryStorage not BinStorage.

> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> ---
>
> Key: PIG-560
> URL: https://issues.apache.org/jira/browse/PIG-560
> Project: Pig
>  Issue Type: Bug
>Affects Versions: types_branch
>Reporter: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: PIG-560.patch, utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

2009-01-29 Thread Olga Natkovich (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668710#action_12668710
 ] 

Olga Natkovich commented on PIG-560:


We know that since we put this changes into the production, only one other 
person complained so we are pretty certain it is a very rare case. I agree with 
Alan that we should only pay the penalty on long strings

> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> ---
>
> Key: PIG-560
> URL: https://issues.apache.org/jira/browse/PIG-560
> Project: Pig
>  Issue Type: Bug
>Affects Versions: types_branch
>Reporter: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

2009-01-29 Thread Laukik Chitnis (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668688#action_12668688
 ] 

Laukik Chitnis commented on PIG-560:


The writeUTF() method was adding 2 bytes per string; we would actually be 
adding an int (32 bits) with this solution.

The new long string would then be required to be a new DataType, right? To make 
it transparent to the user, this DataType can just be used internally. Also, to 
keep things efficient, may be we can insert the string as this datatype only on 
getting the encoded-string-too-long  UTFDataFormatException.

By the way, though it looks quite probable that the average length of a string 
used would be far less than 64k, do we have any statistic on the average length 
of (UTF converted) CHARARRAYs? This would also help us in determining how big 
an overhead the additional 16 bits actually is. 

> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> ---
>
> Key: PIG-560
> URL: https://issues.apache.org/jira/browse/PIG-560
> Project: Pig
>  Issue Type: Bug
>Affects Versions: types_branch
>Reporter: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.



[jira] Commented: (PIG-560) UTFDataFormatException (encoded string too long) is thrown when storing strings > 65536 bytes (in UTF8 form) using BinStorage()

2009-01-29 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12668682#action_12668682
 ] 

Alan Gates commented on PIG-560:


I'm concerned here that we're adding 2 bytes to every string we store for a 
case which should be quite rare (how often to people have strings longer than 
64K?)  Would it be better to have bin storage define a long string type that 
uses 4 bytes to encode it's length, and then test a string's length before 
writing it out and leave things as they are now for most strings and use the 
new long string for anything over 64K?

> UTFDataFormatException (encoded string too long) is thrown when storing 
> strings > 65536 bytes (in UTF8 form) using BinStorage()
> ---
>
> Key: PIG-560
> URL: https://issues.apache.org/jira/browse/PIG-560
> Project: Pig
>  Issue Type: Bug
>Affects Versions: types_branch
>Reporter: Pradeep Kamath
> Fix For: types_branch
>
> Attachments: utf-limit-patch.diff
>
>
> BinStorage() uses DataOutput.writeUTF() and DataInput.readUTF() Java API to 
> write out Strings as UTF-8 bytes and to read them back. From the Javadoc - 
> "First, the total number of bytes needed to represent all the characters of s 
> is calculated. If this number is larger than 65535, then a 
> UTFDataFormatException  is thrown. " (because the writeUTF() API uses 2 bytes 
> to represent the number of bytes). A way to get around this would be to not 
> use writeUTF()/ReadUTF() and instead hand convert the string to the 
> corresponding UTF-8 byte[]  (using String.getBytes("UTF-8") and then write 
> the length of the byte array as an int - this will allow a size of upto 2^32 
> (2 raised to 32).

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.