Hi all,

Attached is a patch which implements approach #1 to eliding intermediate copies from a method overload, TProtocol::readBinary(vector<uint8_t>&).

Bruce Simpson wrote:
So it seems, to deal with my need for vector<uint8_t> in my existing application's RPC methods, without introducing unnecessary allocations or copies on the receive path. this boils down to two design choices:
1.  Just go ahead and implement it in libthrift itself.
2. Add a helper method in my application's IPC library, specific to TBinaryProtocol, to read the blob from the TTransport directly.

Approach #1:
   Thrift is already using <vector> anyway, but not for T = uint8_t.
The additional cost of vector<uint8_t> usage in the library is zero for static linkage, if the linker knows to discard the template expansions from the text segment [6]. However, shared linkage is going introduce weak symbols for vector<uint8_t> methods due to template expansion, and thus add a small amount of runtime text cost to libthrift.so. If a limit is placed on the payload size upfront, things are a little easier, if say we wanted to implement TProtocol::readBinary(uint8_t* data, size_t& len), without requiring vector<uint8_t>. It is a more generic interface, but still the problem looms of how to resize the buffer if input is bigger than we expect, assuming the caller manages the buffer. Probably not a good idea to be using C-style buffer management here. Even so, there is no easy way to convert a uint8_t array to a vector without a copy [5].

So it might be best just to add TProtocol::readBinary(vector<uint8_t>& buf) to the library.

I measure the .so additional code at just over 4KB with Thrift's default optimization settings (-O2). The static library is slightly better. Most of the additional code is in the ELF .text segment; the rest of the overhead is probably in the dynsyms, and for the new method signatures.

The template expansions are pretty minimal, although I didn't profile with -frepo; this appears to be broke in FreeBSD's base system compiler. -fno-implicit-templates has per-translation-unit granularity, so it's difficult to shuffle the vector<uint8_t> instantiation elsewhere, without additional hacking.

Given that the Thrift code generator itself will marshal a byte[] array quite differently vs the binary type, this is pretty specific to XORP's use of Thrift, and it could be argued it doesn't belong in the library. If so, is there a more elegant way to get data in and out of the TProtocol interface, without additional copies, and still sharing the implementation in libthrift.so as far as possible?

On the other hand, the Thrift generator will see the 'binary' type and produce calls to the current TProtocol::readString() implementations, and those are going to use intermediate copies, too. So it could also be argued that this has general performance benefits.

Comments? Suggestions?

thanks,
BMS

Index: lib/cpp/src/protocol/TProtocolTap.h
===================================================================
--- lib/cpp/src/protocol/TProtocolTap.h (revision 883543)
+++ lib/cpp/src/protocol/TProtocolTap.h (working copy)
@@ -177,6 +177,13 @@
     return rv;
   }
 
+  uint32_t readBinary(std::vector<uint8_t>& buf) {
+    uint32_t rv = source_->readBinary(buf);
+    // XXX Needs writeBinary() overload patch to elide a buffer copy.
+    sink_->writeBinary(std::string((const char *)&buf[0], buf.length()));
+    return rv;
+  }
+
  private:
   boost::shared_ptr<TProtocol> source_;
   boost::shared_ptr<TProtocol> sink_;
Index: lib/cpp/src/protocol/TBinaryProtocol.h
===================================================================
--- lib/cpp/src/protocol/TBinaryProtocol.h      (revision 883543)
+++ lib/cpp/src/protocol/TBinaryProtocol.h      (working copy)
@@ -188,9 +188,13 @@
 
   uint32_t readBinary(std::string& str);
 
+  uint32_t readBinary(std::vector<uint8_t>& buf);
+
  protected:
   uint32_t readStringBody(std::string& str, int32_t sz);
 
+  uint32_t readBinaryBlob(std::vector<uint8_t>& buf, int32_t rsize);
+
   int32_t string_limit_;
   int32_t container_limit_;
 
Index: lib/cpp/src/protocol/TCompactProtocol.h
===================================================================
--- lib/cpp/src/protocol/TCompactProtocol.h     (revision 883543)
+++ lib/cpp/src/protocol/TCompactProtocol.h     (working copy)
@@ -216,6 +216,8 @@
 
   uint32_t readBinary(std::string& str);
 
+  uint32_t readBinary(std::vector<uint8_t>& buf);
+
   /*
    *These methods are here for the struct to call, but don't have any wire
    * encoding.
Index: lib/cpp/src/protocol/TDenseProtocol.cpp
===================================================================
--- lib/cpp/src/protocol/TDenseProtocol.cpp     (revision 883543)
+++ lib/cpp/src/protocol/TDenseProtocol.cpp     (working copy)
@@ -739,6 +739,12 @@
   return TDenseProtocol::readString(str);
 }
 
+uint32_t TDenseProtocol::readBinary(std::vector<uint8_t>& buf) {
+  checkTType(T_STRING);
+  stateTransition();
+  return subReadBinary(buf);
+}
+
 uint32_t TDenseProtocol::subReadI32(int32_t& i32) {
   uint64_t u64;
   uint32_t rv = vlqRead(u64);
@@ -759,4 +765,11 @@
   return xfer + readStringBody(str, size);
 }
 
+uint32_t TDenseProtocol::subReadBinary(std::vector<uint8_t>& buf) {
+  uint32_t xfer;
+  int32_t size;
+  xfer = subReadI32(size);
+  return xfer + readBinaryBlob(buf, size);
+}
+
 }}} // apache::thrift::protocol
Index: lib/cpp/src/protocol/TDenseProtocol.h
===================================================================
--- lib/cpp/src/protocol/TDenseProtocol.h       (revision 883543)
+++ lib/cpp/src/protocol/TDenseProtocol.h       (working copy)
@@ -205,6 +205,8 @@
 
   uint32_t readBinary(std::string& str);
 
+  uint32_t readBinary(std::vector<uint8_t>& buf);
+
   /*
    * Helper reading functions (don't do state transitions).
    */
@@ -212,6 +214,8 @@
 
   inline uint32_t subReadString(std::string& str);
 
+  inline uint32_t subReadBinary(std::vector<uint8_t>& buf);
+
   uint32_t subReadBool(bool& value) {
     return TBinaryProtocol::readBool(value);
   }
Index: lib/cpp/src/protocol/TOneWayProtocol.h
===================================================================
--- lib/cpp/src/protocol/TOneWayProtocol.h      (revision 883543)
+++ lib/cpp/src/protocol/TOneWayProtocol.h      (working copy)
@@ -154,6 +154,11 @@
         subclass_ + " does not support reading (yet).");
   }
 
+  uint32_t readBinary(std::vector<uint8_t>& buf) {
+    throw TProtocolException(TProtocolException::NOT_IMPLEMENTED,
+        subclass_ + " does not support reading (yet).");
+  }
+
  private:
   std::string subclass_;
 };
Index: lib/cpp/src/protocol/TJSONProtocol.cpp
===================================================================
--- lib/cpp/src/protocol/TJSONProtocol.cpp      (revision 883543)
+++ lib/cpp/src/protocol/TJSONProtocol.cpp      (working copy)
@@ -995,4 +995,14 @@
   return readJSONBase64(str);
 }
 
+uint32_t TJSONProtocol::readBinary(std::vector<uint8_t>& buf) {
+  // XXX Not optimized in any way. One way to implement this without copies
+  // is to do base64 decode in-place in buf.
+  std::string s;
+  uint32_t rsize = readJSONBase64(s);
+  buf.resize(s.size());
+  std::memmove(&buf[0], s.data(), s.size());
+  return (rsize);
+}
+
 }}} // apache::thrift::protocol
Index: lib/cpp/src/protocol/TJSONProtocol.h
===================================================================
--- lib/cpp/src/protocol/TJSONProtocol.h        (revision 883543)
+++ lib/cpp/src/protocol/TJSONProtocol.h        (working copy)
@@ -257,6 +257,8 @@
 
   uint32_t readBinary(std::string& str);
 
+  uint32_t readBinary(std::vector<uint8_t>& buf);
+
   class LookaheadReader {
 
    public:
Index: lib/cpp/src/protocol/TProtocol.h
===================================================================
--- lib/cpp/src/protocol/TProtocol.h    (revision 883543)
+++ lib/cpp/src/protocol/TProtocol.h    (working copy)
@@ -30,6 +30,7 @@
 #include <sys/types.h>
 #include <string>
 #include <map>
+#include <vector>
 
 
 // Use this to get around strict aliasing rules.
@@ -290,6 +291,8 @@
 
   virtual uint32_t readBinary(std::string& str) = 0;
 
+  virtual uint32_t readBinary(std::vector<uint8_t>& buf) = 0;
+
   uint32_t readBool(std::vector<bool>::reference ref) {
     bool value;
     uint32_t rv = readBool(value);
Index: lib/cpp/src/protocol/TBinaryProtocol.cpp
===================================================================
--- lib/cpp/src/protocol/TBinaryProtocol.cpp    (revision 883543)
+++ lib/cpp/src/protocol/TBinaryProtocol.cpp    (working copy)
@@ -360,6 +360,10 @@
   return TBinaryProtocol::readString(str);
 }
 
+// Read into a caller-provided string. A per-instance buffer private
+// to TBinaryProtocol is used to avoid large stack temporaries, as
+// TTransport::readAll() wants to use raw buffers, and string cannot
+// manage this storage. This approach is non-reentrant.
 uint32_t TBinaryProtocol::readStringBody(string& str, int32_t size) {
   uint32_t result = 0;
 
@@ -391,4 +395,43 @@
   return (uint32_t)size;
 }
 
+// Read into a caller-provided buffer, managed as a vector<uint8_t>.
+// Use this interface to avoid intermediate copies. The transport will
+// read directly into the provided vector, resizing it as needed.
+// The containerLimit property is used to enforce an upper bound on
+// the size of the binary blob read.
+uint32_t TBinaryProtocol::readBinary(std::vector<uint8_t>& buf) {
+  uint32_t result;
+  int32_t rsize;
+  result = readI32(rsize);
+  return result + readBinaryBlob(buf, rsize);
+}
+
+uint32_t TBinaryProtocol::readBinaryBlob(std::vector<uint8_t>& buf,
+                                         int32_t rsize) {
+  uint32_t result = 0;
+
+  // Catch error cases
+  if (rsize < 0) {
+    throw TProtocolException(TProtocolException::NEGATIVE_SIZE);
+  }
+  if (container_limit_ > 0 && rsize > container_limit_) {
+    throw TProtocolException(TProtocolException::SIZE_LIMIT);
+  }
+
+  // Catch empty blob case
+  if (rsize == 0) {
+    buf.clear();
+    return result;
+  }
+
+  // Cheap with -O1, uint8_t is a POD type; avoids copy on resize.
+  // May throw std::bad_alloc; we do not turn this into a TProtocolException,
+  // to avoid the overhead of try {}.
+  buf.clear();
+  buf.resize(rsize);
+  trans_->readAll(&buf[0], rsize);
+  return (uint32_t)rsize;
+}
+
 }}} // apache::thrift::protocol
Index: lib/cpp/src/protocol/TCompactProtocol.cpp
===================================================================
--- lib/cpp/src/protocol/TCompactProtocol.cpp   (revision 883543)
+++ lib/cpp/src/protocol/TCompactProtocol.cpp   (working copy)
@@ -589,7 +589,7 @@
 }
 
 /**
- * Read a byte[] from the wire.
+ * Read a byte[] from the wire into a std::string.
  */
 uint32_t TCompactProtocol::readBinary(std::string& str) {
   int32_t rsize = 0;
@@ -626,6 +626,35 @@
 }
 
 /**
+ * Read a byte[] from the wire into a std::vector<uint8_t>.
+ */
+uint32_t TCompactProtocol::readBinary(std::vector<uint8_t>& buf) {
+  int32_t rsize = 0;
+  int32_t size;
+
+  rsize += readVarint32(size);
+  // Catch empty blob case
+  if (size == 0) {
+    buf.clear();
+    return rsize;
+  }
+
+  // Catch error cases
+  if (size < 0) {
+    throw TProtocolException(TProtocolException::NEGATIVE_SIZE);
+  }
+  if (container_limit_ > 0 && size > container_limit_) {
+    throw TProtocolException(TProtocolException::SIZE_LIMIT);
+  }
+
+  buf.clear();
+  buf.resize(size);
+  trans_->readAll(&buf[0], size);
+
+  return rsize + (uint32_t)size;
+}
+
+/**
  * Read an i32 from the wire as a varint. The MSB of each byte is set
  * if there is another byte to follow. This can read up to 5 bytes.
  */
Eliding template instantiations?

But what about std::allocator inlines?
 Dropping sections only works if they're named individually.
 ld can do this directly, collect2 need only be involved if using -frepo.

 * Implementation of templates is v light in thrift.
 * Use of templates may be higher, i.e. STL, boost.
 * With -O0 -fno-inlines -save-temps, the situation is tragic, but it lets
   you see expansions in the assembler output.
 * -O1 is the minimum optimization level needed to ensure that code is inlined,
   and unused sections (i.e. linkonce due to template expansion) are discarded.

-frepo may offer slight savings when building the ELF .so:
 * -frepo requires a hierarchical approach, esp with shared libraries.
 * -frepo is broken in FreeBSD due to unmaintained GNU collect2.

On FreeBSD 7.2-STABLE/amd64:

libthrift.so, baseline r883543, and with vector<uint8_t> read:
   text    data     bss     dec     hex filename
 594284   21320    1000  616604   9689c ./lib/cpp/.libs/libthrift.so
 599634   21336    1000  621970   97d92 ./lib/cpp/.libs/libthrift.so

=> +text 5350, +data 16
==> size(1) obscures the measurement slightly, the actual difference in
    ELF .text is 4240 bytes.

libthrift.a, baseline r883543, and with vector<uint8_t> read:
   text    data     bss     dec     hex filename
 487564     172     810  488546   77462 (TOTALS)
 491014     172     810  491996   781dc (TOTALS)

=> +text 3450

# ----
# baseline run
gmake clean
svn revert -R .
gmake

# get section sizes
size -t ./lib/cpp/.libs/libthrift.a > /tmp/ass
size ./lib/cpp/.libs/libthrift.so > /tmp/ads

# get symbols, sizes
nm -SDP ./lib/cpp/.libs/libthrift.so > /tmp/ad
nm -SP ./lib/cpp/.libs/libthrift.a > /tmp/as

# ----

# patch run
gmake clean
patch -p0 -i <path-to-my-patch>
gmake

# get section sizes
size -t ./lib/cpp/.libs/libthrift.a > /tmp/bss
size ./lib/cpp/.libs/libthrift.so > /tmp/bds

# get symbols, sizes
nm -SDP ./lib/cpp/.libs/libthrift.so > /tmp/bd
nm -SP ./lib/cpp/.libs/libthrift.a > /tmp/bs

# ----

# compare dynamic libs
cat /tmp/ad | sort | uniq > /tmp/adx
cat /tmp/bd | sort | uniq > /tmp/bdx
cut -f1 -d' ' /tmp/ad | sort | uniq > /tmp/ad1
cut -f1 -d' ' /tmp/bd | sort | uniq > /tmp/bd1
# check for any eliminated symbols (regression?)
diff /tmp/ad1 /tmp/bd1 | grep '^<' | sort
# check for new code
diff /tmp/ad1 /tmp/bd1 | grep '^>' | awk '{print $NF}' | sort | uniq > /tmp/bdn
# what's the cost of that code?
join /tmp/bdn /tmp/bdx | c++filt

2455 bytes accounted for.
Data point:
apache::thrift::protocol::TDenseProtocol::subReadI32(int&) W be10 
0000000000000271  <- That's big, why?  Probably due to coalescing of local 
symbols.

# compare static libs
cat /tmp/as | sort | uniq > /tmp/asx
cat /tmp/bs | sort | uniq > /tmp/bsx
cut -f1 -d' ' /tmp/as | sort | uniq > /tmp/as1
cut -f1 -d' ' /tmp/bs | sort | uniq > /tmp/bs1
# check for any eliminated symbols (regression?)
diff /tmp/as1 /tmp/bs1 | grep '^<' | sort
# check for new code
diff /tmp/as1 /tmp/bs1 | grep '^>' | awk '{print $NF}' | sort | uniq > /tmp/bsn
# what's the cost of that code?
join /tmp/bsn /tmp/bsx | c++filt

Reply via email to