Re: [protobuf] EnumValueDescriptor doesn't provide toString()?

2010-05-17 Thread Kenton Varda
Right, because none of the other field value types are descriptors.  I see
your point -- since getField() returns an Object, it would certainly be nice
to be able to call toString() on it without knowing the type.  But, it's
also important that EnumValueDescriptor be consistent with other descriptor
classes, so we want to be careful not to mess up that consistency.

Instead of calling toString(), you could call
TextFormat.printFieldToString() to get a string representation of the field,
although it will include the field name.

On Mon, May 10, 2010 at 9:17 PM, Christopher Smith cbsm...@gmail.comwrote:

 Actually, toString() seems to work for me for every other value I get from
 a dynamic message *except* enums.

 --Chris

 On May 10, 2010, at 8:32 PM, Kenton Varda ken...@google.com wrote:

 I don't think we should add toString() to any of the descriptor classes
 unless we are going to implement it for *all* of them in some consistent
 way.  If we fill them in ad-hoc then they may be inconsistent, and we may
 not be able to change them to make them consistent without breaking users.

 On Mon, May 10, 2010 at 9:49 AM, Christopher Smith  cbsm...@gmail.com
 cbsm...@gmail.com wrote:

 I noticed EnumValueDescriptor uses the default toString() method. Why not
 override it to call getFullName()?

 --Chris

 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to protobuf@googlegroups.com
 proto...@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf%2bunsubscr...@googlegroups.com
 protobuf+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en
 http://groups.google.com/group/protobuf?hl=en.




-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Re: Java UTF-8 encoding/decoding: possible performance improvements

2010-05-17 Thread Kenton Varda
I see.  So in fact your code is quite possibly slower in non-ASCII cases?
 In fact, it sounds like having even one non-ASCII character would force
extra copies to occur, which I would guess would defeat the benefit, but
we'd need benchmarks to tell for sure.

On Fri, May 7, 2010 at 6:21 PM, Evan Jones ev...@mit.edu wrote:

 On May 7, 2010, at 18:54 , Kenton Varda wrote:

 I'd be very interested to hear why the JDK is not optimal here.


 I dug into this. I *think* the problem is that the JDK ends up allocating a
 huge temporary array for the UTF-8 data. Hence, the garbage collection cost
 is higher for the JDK's implementation, rather than my implementation.
 Basically the code does this:


 * allocate a new byte[] array that is string length * max bytes per
 character ( = 4 for the UTF encoder)
 * use the java.nio.charset.CharsetEncoder to encode the char[] into the
 byte[] (wrapped in CharBuffer / ByteBuffer).
 * copy the exact number of bytes out of the byte[] into a new byte[], and
 return that.

 The only trick the JDK gets to use that normal Java code can't is that
 they can access the string's char[] buffer directly, whereas I need to copy
 it out into a char[] array.


 Hence, I think what is happening is that the JDK allocates 4-5 times as
 much memory per encode than I do. In the cases where the data is ASCII, my
 code is faster, since it allocates exactly the right amount of space, and
 doesn't need an extra copy. When the data is not ASCII, my code may still be
 faster, since it doesn't overallocate quite as much (in exchange my code
 does many copies).


 Conclusion: there is a legitimate reason for this code to be faster than
 the JDK's code. But it still may not be worth including this patch in the
 main line protocol buffer code.


 Evan

 --
 Evan Jones
 http://evanjones.ca/



-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] EnumValueDescriptor doesn't provide toString()?

2010-05-17 Thread Christopher Smith
I grok the problem now. This is the only descriptor that is also a value. 
Probably should have a method/visitor specifically for getting the value string 
of an object that isn't implemented by descriptors.

--Chris

On May 17, 2010, at 12:35 PM, Kenton Varda ken...@google.com wrote:

 Right, because none of the other field value types are descriptors.  I see 
 your point -- since getField() returns an Object, it would certainly be nice 
 to be able to call toString() on it without knowing the type.  But, it's also 
 important that EnumValueDescriptor be consistent with other descriptor 
 classes, so we want to be careful not to mess up that consistency.
 
 Instead of calling toString(), you could call TextFormat.printFieldToString() 
 to get a string representation of the field, although it will include the 
 field name.
 
 On Mon, May 10, 2010 at 9:17 PM, Christopher Smith cbsm...@gmail.com wrote:
 Actually, toString() seems to work for me for every other value I get from a 
 dynamic message *except* enums.
 
 --Chris
 
 On May 10, 2010, at 8:32 PM, Kenton Varda ken...@google.com wrote:
 
 I don't think we should add toString() to any of the descriptor classes 
 unless we are going to implement it for *all* of them in some consistent 
 way.  If we fill them in ad-hoc then they may be inconsistent, and we may 
 not be able to change them to make them consistent without breaking users.
 
 On Mon, May 10, 2010 at 9:49 AM, Christopher Smith cbsm...@gmail.com wrote:
 I noticed EnumValueDescriptor uses the default toString() method. Why not 
 override it to call getFullName()?
 
 --Chris
 
 --
 You received this message because you are subscribed to the Google Groups 
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to 
 protobuf+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/protobuf?hl=en.
 
 
 

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Issue 188 in protobuf: protobuf fails to link after compiling with LDFLAGS=-Wl,--as-needed because of missing -lpthread

2010-05-17 Thread protobuf

Updates:
Status: NeedPatchFromUser

Comment #2 on issue 188 by ken...@google.com: protobuf fails to link after  
compiling with LDFLAGS=-Wl,--as-needed because of missing -lpthread

http://code.google.com/p/protobuf/issues/detail?id=188

We can't just switch the order, because if -pthread exists as a compiler  
flag, then it
is essential that we use it.  Just -lpthread is not good enough, because  
-pthread
tells GCC to output thread-safe code.  It sounds like acx_pthread.m4 may  
need to be

refactored somewhat to get this right.  Please feel free to submit a patch.

--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Issue 187 in protobuf: Command-line argument to override the optimize_for option

2010-05-17 Thread protobuf


Comment #4 on issue 187 by ken...@google.com: Command-line argument to  
override the optimize_for option

http://code.google.com/p/protobuf/issues/detail?id=187

I agree, we should be able to override options on the command-line.  The  
only problem
is that it's unclear how far this support needs to go.  Should you be able  
to override
message-level and field-level options, or just file-level?  Should an  
override apply
to an individual file, or should it apply to all the files it imports,  
too?  In the
case of optimize_for, we'd probably want the override to apply to imports,  
but for

something like java_package we probably don't.

--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Issue 188 in protobuf: protobuf fails to link after compiling with LDFLAGS=-Wl,--as-needed because of missing -lpthread

2010-05-17 Thread protobuf


Comment #3 on issue 188 by xarthisius.kk: protobuf fails to link after  
compiling with LDFLAGS=-Wl,--as-needed because of missing -lpthread

http://code.google.com/p/protobuf/issues/detail?id=188

On x86 Linux machine -pthread does two things:
1. defines _REENTRANT for the preprocessor during compiling
2. adds -lpthread when passed during linking
It does no other magic, and quoting man gcc:
... This option does not affect the thread safety of object code produced  
by the

compiler or that of libraries supplied with it.

Best regards,
Kacper Kowalik

--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Re: Issue 59 in protobuf: Add another option to support java_implement_interface

2010-05-17 Thread protobuf


Comment #13 on issue 59 by aantono: Add another option to support  
java_implement_interface

http://code.google.com/p/protobuf/issues/detail?id=59

just an FYI, as its mentioned in issue 82, there is already a set  
of formatters for JSON, XML, etc, as part of the

http://code.google.com/p/protobuf-java-format/ project.
I've been toying around with an idea of making a common interface that they  
would all implement, so maybe then
we could enhance the code generation part to accept any formatter/codec  
class which will be coded to a well-

known interface.

--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] systesting custom utf8 validation on remote c++ node using protocol buffers from python

2010-05-17 Thread JT Olds
Hello,

(I submitted this already via the protobuf google group web form, but
I think I screwed up. If not, sorry for the double post)

 I have a C++-based server using protocol buffers as the IDL, and I'm
trying to ensure that it rejects invalid UTF-8 strings. My systest
library is written in Python. The C++ protocol buffer library does not
seem to do any UTF-8 string checking on string types, whereas the
Python library does. So I added some UTF-8 validation testing to the
C++ server-side and I want to check that it works (in case a C++
client sends invalid UTF-8). Whenever I inject invalid UTF-8 into the
Python systests to make sure the server rejects the string, the Python
library complains.

Is there a way to override this behavior?

I don't want to change my protocol buffer definitions to be the bytes
type, because these really should be strings, and the Python library
is doing exactly what I want for the general case.

-JT

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] Protocol buffers and large data sets

2010-05-17 Thread sanikumbh
I wanted to get some opinion on large data sets and protocol buffers.
Protocol Buffer project page by google says that for data  1
megabytes, one should consider something different but they don’t
mention what would happen if one crosses this limit. Are there any
known failure modes when it comes to the large data sets?
What are your observations, recommendations from your experience on
this front?

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



[protobuf] systesting utf8 validation on remote node using protocol buffers from python

2010-05-17 Thread JT
Hello,
 I have a C++-based server using protocol buffers as the IDL, and I'm
trying to ensure that it rejects invalid UTF-8 strings. My systest
library is written in Python. The C++ protocol buffer library does not
seem to do any UTF-8 string checking on string types, whereas the
Python library does. Whenever I inject invalid UTF-8 into the Python
systests to make sure the server rejects the string, the Python
library complains.

Is there a way to override this behavior?

I don't want to change my protocol buffer definitions to be the bytes
type, because these really should be strings, and the Python library
is doing exactly what I want for the general case.

-JT

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] systesting custom utf8 validation on remote c++ node using protocol buffers from python

2010-05-17 Thread Jason Hsueh
If you compile with the macro GOOGLE_PROTOBUF_UTF8_VALIDATION_ENABLED
defined, the C++ code will do UTF8 validation. However, it doesn't prevent
the data from serializing or parsing, it will simply log an error message.
How would you like it to fail?

On Mon, May 17, 2010 at 3:15 PM, JT Olds jto...@xnet5.com wrote:

 Hello,

 (I submitted this already via the protobuf google group web form, but
 I think I screwed up. If not, sorry for the double post)

  I have a C++-based server using protocol buffers as the IDL, and I'm
 trying to ensure that it rejects invalid UTF-8 strings. My systest
 library is written in Python. The C++ protocol buffer library does not
 seem to do any UTF-8 string checking on string types, whereas the
 Python library does. So I added some UTF-8 validation testing to the
 C++ server-side and I want to check that it works (in case a C++
 client sends invalid UTF-8). Whenever I inject invalid UTF-8 into the
 Python systests to make sure the server rejects the string, the Python
 library complains.

 Is there a way to override this behavior?

 I don't want to change my protocol buffer definitions to be the bytes
 type, because these really should be strings, and the Python library
 is doing exactly what I want for the general case.

 -JT

 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf+unsubscr...@googlegroups.comprotobuf%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Protocol buffers and large data sets

2010-05-17 Thread Jason Hsueh
There is a default byte size limit of 64MB when parsing protocol buffers -
if a message is larger than that, it will fail to parse. This can be
configured if you really need to parse larger messages, but it is generally
not recommended. Additionally, ByteSize() returns a 32-bit integer, so
there's an implicit limit on the size of data that can be serialized.

You can certainly use protocol buffers in large data sets, but it's not
recommended to have your entire data set be represented by a single message.
Instead, see if you can break it up into smaller messages.

On Mon, May 17, 2010 at 1:05 PM, sanikumbh saniku...@gmail.com wrote:

 I wanted to get some opinion on large data sets and protocol buffers.
 Protocol Buffer project page by google says that for data  1
 megabytes, one should consider something different but they don’t
 mention what would happen if one crosses this limit. Are there any
 known failure modes when it comes to the large data sets?
 What are your observations, recommendations from your experience on
 this front?

 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf+unsubscr...@googlegroups.comprotobuf%2bunsubscr...@googlegroups.com
 .
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en.



-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Re: Java UTF-8 encoding/decoding: possible performance improvements

2010-05-17 Thread Evan Jones

On May 17, 2010, at 15:38 , Kenton Varda wrote:
I see.  So in fact your code is quite possibly slower in non-ASCII  
cases?  In fact, it sounds like having even one non-ASCII character  
would force extra copies to occur, which I would guess would defeat  
the benefit, but we'd need benchmarks to tell for sure.


Yes. I've been playing with this a bit in my spare time since the last  
email, but I don't have any results I'm happy with yet. Rough notes:


* Encoding is (quite a bit?) faster than String.getBytes() if you  
assume one byte per character.
* If you guess the number bytes per character poorly and have to do  
multiple allocations and copies, the regular Java version will win. If  
you get it right (even if you first guess 1 byte per character) it  
looks like it can be slightly faster or on par with the Java version.
* Re-using a temporary byte[] for string encoding may be faster than  
String.getBytes(), which effectively allocates a temporary byte[] each  
time.



I'm going to try to rework my code with a slightly different policy:

a) Assume 1 byte per character and attempt the encode. If we run out  
of space:
b) Use a shared temporary buffer and continue the encode. If we run  
out of space:
c) Allocate a worst case 4 byte per character buffer and finish the  
encode.



This should be much better than the JDK version for ASCII, a bit  
better for short strings that fit in the shared temporary buffer,  
and not significantly worse for the rest, but I'll need to test it to  
be sure.


This is sort of just a fun experiment for me at this point, so who  
knows when I may get around to actually finishing this.


Evan

--
Evan Jones
http://evanjones.ca/

--
You received this message because you are subscribed to the Google Groups Protocol 
Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] systesting custom utf8 validation on remote c++ node using protocol buffers from python

2010-05-17 Thread JT Olds
Okay, well it's slightly more complicated. My C++ application needs to
actually accept the technically invalid code points U+ and U+FFFE.
Otherwise, I need my server application to know when invalid UTF-8 has
happened. That's all fine. I have that all implemented. That's good.

The problem is I want to exercise that behavior from my Python systest
framework. The problem is the Python libs are trying to be too
helpful. While I normally want them to do UTF-8 validation, I *don't*
want them to during the systests, because I want to send bad UTF-8 to
the server.

Make sense? I'm trying to do bad things to make sure stuff still works
in a systest environment.

-JT

On Mon, May 17, 2010 at 4:51 PM, Jason Hsueh jas...@google.com wrote:
 If you compile with the macro GOOGLE_PROTOBUF_UTF8_VALIDATION_ENABLED
 defined, the C++ code will do UTF8 validation. However, it doesn't prevent
 the data from serializing or parsing, it will simply log an error message.
 How would you like it to fail?

 On Mon, May 17, 2010 at 3:15 PM, JT Olds jto...@xnet5.com wrote:

 Hello,

 (I submitted this already via the protobuf google group web form, but
 I think I screwed up. If not, sorry for the double post)

  I have a C++-based server using protocol buffers as the IDL, and I'm
 trying to ensure that it rejects invalid UTF-8 strings. My systest
 library is written in Python. The C++ protocol buffer library does not
 seem to do any UTF-8 string checking on string types, whereas the
 Python library does. So I added some UTF-8 validation testing to the
 C++ server-side and I want to check that it works (in case a C++
 client sends invalid UTF-8). Whenever I inject invalid UTF-8 into the
 Python systests to make sure the server rejects the string, the Python
 library complains.

 Is there a way to override this behavior?

 I don't want to change my protocol buffer definitions to be the bytes
 type, because these really should be strings, and the Python library
 is doing exactly what I want for the general case.

 -JT

 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en.




-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] systesting custom utf8 validation on remote c++ node using protocol buffers from python

2010-05-17 Thread JT Olds
It looks like I figured out a solution, though I'm not sure this is
the best way.

I have:

   pbuf = MyProtoBuf()
   pbuf.string_field =  # to make sure pbuf initialization stuff
works (sets _has_string_field, etc)
   pbuf._value_string_field = bad utf8
   f = pbuf.DESCRIPTOR.fields_by_number[pbuf.STRING_FIELD_NUMBER]
   f.type = f.TYPE_BYTES

On Mon, May 17, 2010 at 5:37 PM, JT Olds jto...@xnet5.com wrote:
 Okay, well it's slightly more complicated. My C++ application needs to
 actually accept the technically invalid code points U+ and U+FFFE.
 Otherwise, I need my server application to know when invalid UTF-8 has
 happened. That's all fine. I have that all implemented. That's good.

 The problem is I want to exercise that behavior from my Python systest
 framework. The problem is the Python libs are trying to be too
 helpful. While I normally want them to do UTF-8 validation, I *don't*
 want them to during the systests, because I want to send bad UTF-8 to
 the server.

 Make sense? I'm trying to do bad things to make sure stuff still works
 in a systest environment.

 -JT

 On Mon, May 17, 2010 at 4:51 PM, Jason Hsueh jas...@google.com wrote:
 If you compile with the macro GOOGLE_PROTOBUF_UTF8_VALIDATION_ENABLED
 defined, the C++ code will do UTF8 validation. However, it doesn't prevent
 the data from serializing or parsing, it will simply log an error message.
 How would you like it to fail?

 On Mon, May 17, 2010 at 3:15 PM, JT Olds jto...@xnet5.com wrote:

 Hello,

 (I submitted this already via the protobuf google group web form, but
 I think I screwed up. If not, sorry for the double post)

  I have a C++-based server using protocol buffers as the IDL, and I'm
 trying to ensure that it rejects invalid UTF-8 strings. My systest
 library is written in Python. The C++ protocol buffer library does not
 seem to do any UTF-8 string checking on string types, whereas the
 Python library does. So I added some UTF-8 validation testing to the
 C++ server-side and I want to check that it works (in case a C++
 client sends invalid UTF-8). Whenever I inject invalid UTF-8 into the
 Python systests to make sure the server rejects the string, the Python
 library complains.

 Is there a way to override this behavior?

 I don't want to change my protocol buffer definitions to be the bytes
 type, because these really should be strings, and the Python library
 is doing exactly what I want for the general case.

 -JT

 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en.





-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Java UTF-8 encoding/decoding: possible performance improvements

2010-05-17 Thread Christopher Smith
This does somewhat suggestive that it might be worthwhile specifically
tagging a field as ASCII only. There are enough cases of this that it
could be a huge win.


On 5/17/10, Evan Jones ev...@mit.edu wrote:
 On May 17, 2010, at 15:38 , Kenton Varda wrote:
 I see.  So in fact your code is quite possibly slower in non-ASCII
 cases?  In fact, it sounds like having even one non-ASCII character
 would force extra copies to occur, which I would guess would defeat
 the benefit, but we'd need benchmarks to tell for sure.

 Yes. I've been playing with this a bit in my spare time since the last
 email, but I don't have any results I'm happy with yet. Rough notes:

 * Encoding is (quite a bit?) faster than String.getBytes() if you
 assume one byte per character.
 * If you guess the number bytes per character poorly and have to do
 multiple allocations and copies, the regular Java version will win. If
 you get it right (even if you first guess 1 byte per character) it
 looks like it can be slightly faster or on par with the Java version.
 * Re-using a temporary byte[] for string encoding may be faster than
 String.getBytes(), which effectively allocates a temporary byte[] each
 time.


 I'm going to try to rework my code with a slightly different policy:

 a) Assume 1 byte per character and attempt the encode. If we run out
 of space:
 b) Use a shared temporary buffer and continue the encode. If we run
 out of space:
 c) Allocate a worst case 4 byte per character buffer and finish the
 encode.


 This should be much better than the JDK version for ASCII, a bit
 better for short strings that fit in the shared temporary buffer,
 and not significantly worse for the rest, but I'll need to test it to
 be sure.

 This is sort of just a fun experiment for me at this point, so who
 knows when I may get around to actually finishing this.

 Evan

 --
 Evan Jones
 http://evanjones.ca/

 --
 You received this message because you are subscribed to the Google Groups
 Protocol Buffers group.
 To post to this group, send email to proto...@googlegroups.com.
 To unsubscribe from this group, send email to
 protobuf+unsubscr...@googlegroups.com.
 For more options, visit this group at
 http://groups.google.com/group/protobuf?hl=en.



-- 
Sent from my mobile device

Chris

-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.



Re: [protobuf] Java UTF-8 encoding/decoding: possible performance improvements

2010-05-17 Thread Kenton Varda
What if you did a fast scan of the bytes first to see if any are non-ASCII?
 Maybe only do this fast scan if the data is short enough to fit in L1
cache?

On Mon, May 17, 2010 at 7:59 PM, Christopher Smith cbsm...@gmail.comwrote:

 This does somewhat suggestive that it might be worthwhile specifically
 tagging a field as ASCII only. There are enough cases of this that it
 could be a huge win.


 On 5/17/10, Evan Jones ev...@mit.edu wrote:
  On May 17, 2010, at 15:38 , Kenton Varda wrote:
  I see.  So in fact your code is quite possibly slower in non-ASCII
  cases?  In fact, it sounds like having even one non-ASCII character
  would force extra copies to occur, which I would guess would defeat
  the benefit, but we'd need benchmarks to tell for sure.
 
  Yes. I've been playing with this a bit in my spare time since the last
  email, but I don't have any results I'm happy with yet. Rough notes:
 
  * Encoding is (quite a bit?) faster than String.getBytes() if you
  assume one byte per character.
  * If you guess the number bytes per character poorly and have to do
  multiple allocations and copies, the regular Java version will win. If
  you get it right (even if you first guess 1 byte per character) it
  looks like it can be slightly faster or on par with the Java version.
  * Re-using a temporary byte[] for string encoding may be faster than
  String.getBytes(), which effectively allocates a temporary byte[] each
  time.
 
 
  I'm going to try to rework my code with a slightly different policy:
 
  a) Assume 1 byte per character and attempt the encode. If we run out
  of space:
  b) Use a shared temporary buffer and continue the encode. If we run
  out of space:
  c) Allocate a worst case 4 byte per character buffer and finish the
  encode.
 
 
  This should be much better than the JDK version for ASCII, a bit
  better for short strings that fit in the shared temporary buffer,
  and not significantly worse for the rest, but I'll need to test it to
  be sure.
 
  This is sort of just a fun experiment for me at this point, so who
  knows when I may get around to actually finishing this.
 
  Evan
 
  --
  Evan Jones
  http://evanjones.ca/
 
  --
  You received this message because you are subscribed to the Google Groups
  Protocol Buffers group.
  To post to this group, send email to proto...@googlegroups.com.
  To unsubscribe from this group, send email to
  protobuf+unsubscr...@googlegroups.comprotobuf%2bunsubscr...@googlegroups.com
 .
  For more options, visit this group at
  http://groups.google.com/group/protobuf?hl=en.
 
 

 --
 Sent from my mobile device

 Chris


-- 
You received this message because you are subscribed to the Google Groups 
Protocol Buffers group.
To post to this group, send email to proto...@googlegroups.com.
To unsubscribe from this group, send email to 
protobuf+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/protobuf?hl=en.