Re: [HACKERS] Faster compression, again
Well, the patent argument, used like this, looks like a wild card, which can be freely interpreted as a mortal danger for some, and a non-issue for others. A perfect scare-mongerer. Quite frankly, I don't buy that one implementation is safer because there is Google backing it. I can't think of any reason why byte-aligned LZ77 algorithm could face any risk. And btw, just look at the number of companies which had to pay protection money to Microsoft or face litigation with Apple because they were using Google's Android. It looks to me that Google is more a magnet for such dangers than a protector. Regarding test tools : Yes, this is correct, Snappy C has more fuzzer tools provided within the package. Regarding integration to BTRFS, and therefore into Linux, both implementation look on equal terms. I haven't seen anything which tells that one has more chances than the other being part of Linux 3.5. In fact, maybe both will be integrated at the same time. However, a little publicized fact is that quite a few people tried both implementation (Snappy C and LZ4), and there were more failures/difficulties reported on Snappy C. It doesn't mean that Snappy C is bad, just more complex to use. It seems that the LZ4 implementation is more straightforward : less dependancies, less risks, less time spent to properly optimize it, well, in a word, simpler. Le 5 avril 2012 01:11, Daniel Farina-4 [via PostgreSQL] ml-node+s1045698n5619199...@n5.nabble.com a écrit : On Tue, Apr 3, 2012 at 7:29 AM, Huchev [hidden email]http://user/SendEmail.jtp?type=nodenode=5619199i=0 wrote: For a C implementation, it could interesting to consider LZ4 algorithm, since it is written natively in this language. In contrast, Snappy has been ported to C by Andy from the original C++ Google code, which lso translate into less extensive usage and tests. From what I can tell, the C implementation of snappy has more tests than this LZ4 implementation, including a fuzz tester. It's a maintained part of Linux as well, and used for btrfs --- this is why it was ported. The high compression version of LZ4 is apparently LGPL. And, finally, there is the issue of patents: snappy has several multi-billion dollar companies that can be held liable (originator Google, as well as anyone connected to Linux), and to the best of my knowledge, nobody has been held to extortion yet. Consider me unconvinced as to this line of argument. -- fdr -- Sent via pgsql-hackers mailing list ([hidden email]http://user/SendEmail.jtp?type=nodenode=5619199i=1) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers -- If you reply to this email, your message will be added to the discussion below: http://postgresql.1045698.n5.nabble.com/Faster-compression-again-tp5565675p5619199.html To unsubscribe from Faster compression, again, click herehttp://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=unsubscribe_by_codenode=5565675code=aHVnb2NoZXZyYWluQGdtYWlsLmNvbXw1NTY1Njc1fDc3NzM5MzkwMA== . NAMLhttp://postgresql.1045698.n5.nabble.com/template/NamlServlet.jtp?macro=macro_viewerid=instant_html%21nabble%3Aemail.namlbase=nabble.naml.namespaces.BasicNamespace-nabble.view.web.template.NabbleNamespace-nabble.view.web.template.NodeNamespacebreadcrumbs=notify_subscribers%21nabble%3Aemail.naml-instant_emails%21nabble%3Aemail.naml-send_instant_email%21nabble%3Aemail.naml -- View this message in context: http://postgresql.1045698.n5.nabble.com/Faster-compression-again-tp5565675p5619870.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com.
Re: [HACKERS] Faster compression, again
On Tue, Apr 3, 2012 at 7:29 AM, Huchev hugochevr...@gmail.com wrote: For a C implementation, it could interesting to consider LZ4 algorithm, since it is written natively in this language. In contrast, Snappy has been ported to C by Andy from the original C++ Google code, which lso translate into less extensive usage and tests. From what I can tell, the C implementation of snappy has more tests than this LZ4 implementation, including a fuzz tester. It's a maintained part of Linux as well, and used for btrfs --- this is why it was ported. The high compression version of LZ4 is apparently LGPL. And, finally, there is the issue of patents: snappy has several multi-billion dollar companies that can be held liable (originator Google, as well as anyone connected to Linux), and to the best of my knowledge, nobody has been held to extortion yet. Consider me unconvinced as to this line of argument. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
For a C implementation, it could interesting to consider LZ4 algorithm, since it is written natively in this language. In contrast, Snappy has been ported to C by Andy from the original C++ Google code, which lso translate into less extensive usage and tests. http://code.google.com/p/lz4/ Furthermode, LZ4 license is BSD. And it has been reported in several tests as being faster than Snappy/LZO, especially on decompression speed. http://article.gmane.org/gmane.comp.file-systems.btrfs/15744 And last point, there is a high compression mode, which could be useful for data rarely written/modified but often read. http://code.google.com/p/lz4hc/ -- View this message in context: http://postgresql.1045698.n5.nabble.com/Faster-compression-again-tp5565675p5615311.html Sent from the PostgreSQL - hackers mailing list archive at Nabble.com. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Thu, Mar 15, 2012 at 10:34 PM, Daniel Farina dan...@heroku.com wrote: I'd really like to find a way to layer both message-oblivious and message-aware transport under FEBE with both backend and frontend support without committing the project to new code for-ever-and-ever. I guess I could investigate it in brief now, unless you've already thought about/done some work in that area. Not done anything in that area myself. I think its important that we have compression for the COPY protocol within libpq, so I'll add that to my must-do list - but would be more than happy if you wanted to tackle that yourself. -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 11:06 AM, Daniel Farina dan...@heroku.com wrote: ...and it has been ported to C (recently, and with some quirks, like no LICENSE file...yet, although it is linked from the original Snappy project). I poked the author about the license and he fixed it in a jiffy. Now under BSD, with Intel's Copyright. He seems to be committing a few enhancements, but the snail's trail of the Internet suggests that this code has made its way into Linux as well, including btrfs. So now I guess we can have at it... https://github.com/andikleen/snappy-c/ -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 6:06 PM, Daniel Farina dan...@heroku.com wrote: If we're curious how it affects replication traffic, I could probably gather statistics on LZO-compressed WAL traffic, of which we have a pretty huge amount captured. What's the compression like for shorter chunks of data? Is it worth considering using this for the libpq copy protocol and therefore streaming replication also? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Thu, Mar 15, 2012 at 3:14 PM, Simon Riggs si...@2ndquadrant.com wrote: On Wed, Mar 14, 2012 at 6:06 PM, Daniel Farina dan...@heroku.com wrote: If we're curious how it affects replication traffic, I could probably gather statistics on LZO-compressed WAL traffic, of which we have a pretty huge amount captured. What's the compression like for shorter chunks of data? Is it worth considering using this for the libpq copy protocol and therefore streaming replication also? The overhead is between 1 and 5 bytes that reserve the high bit as a continuation bit (so one byte for small data), and then straight into data. So I think it could be applied for most payloads that are a few bytes wide. Presumably that could be lifted, but the format description only allows for 2**32 - 1 for the uncompressed size. I'd really like to find a way to layer both message-oblivious and message-aware transport under FEBE with both backend and frontend support without committing the project to new code for-ever-and-ever. I guess I could investigate it in brief now, unless you've already thought about/done some work in that area. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Thu, Mar 15, 2012 at 10:14:12PM +, Simon Riggs wrote: On Wed, Mar 14, 2012 at 6:06 PM, Daniel Farina dan...@heroku.com wrote: If we're curious how it affects replication traffic, I could probably gather statistics on LZO-compressed WAL traffic, of which we have a pretty huge amount captured. What's the compression like for shorter chunks of data? Is it worth considering using this for the libpq copy protocol and therefore streaming replication also? -- Simon Riggs http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training Services Here is a pointer to some tests with Snappy+CouchDB: https://github.com/fdmanana/couchdb/blob/b8f806e41727ba18ed6143cee31a3242e024ab2c/snappy-couch-tests.txt They checked compression on smaller chunks of data. I have extracted the basic results. The first number is the original size in bytes, followed by the compressed size in bytes, the percent compressed and the compression ratio: 77 - 60, 90% or 1.1:1 120 - 104, 87% or 1.15:1 127 - 80, 63% or 1.6:1 5942 - 2930, 49% or 2:1 It looks like a good candidate for both the libpq copy protocol and streaming replication. My two cents. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 11:06:16AM -0700, Daniel Farina wrote: For 9.3 at a minimum. The topic of LZO became mired in doubts about: * Potential Patents * The author's intention for the implementation to be GPL Since then, Google released Snappy, also an LZ77-class implementation, and it has been ported to C (recently, and with some quirks, like no LICENSE file...yet, although it is linked from the original Snappy project). The original Snappy (C++) has a BSD license and a patent grant (which shields you from Google, at least). Do we want to investigate a very-fast compression algorithm inclusion again in the 9.3 cycle? +1 for Snappy and a very fast compression algorithm. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 1:06 PM, Daniel Farina dan...@heroku.com wrote: For 9.3 at a minimum. The topic of LZO became mired in doubts about: * Potential Patents * The author's intention for the implementation to be GPL Since then, Google released Snappy, also an LZ77-class implementation, and it has been ported to C (recently, and with some quirks, like no LICENSE file...yet, although it is linked from the original Snappy project). The original Snappy (C++) has a BSD license and a patent grant (which shields you from Google, at least). Do we want to investigate a very-fast compression algorithm inclusion again in the 9.3 cycle? I've been using the similar implementation LZO for WAL archiving and it is a significant savings (not as much as pg_lesslog, but also less invasive). It is also fast enough that even if one were not to uproot TOAST's compression that it would probably be very close to a complete win for protocol traffic, whereas SSL's standardized zlib can definitely be a drag in some cases. This idea resurfaces often, but the reason why I wrote in about it is because I have a table which I categorized as small but was, in fact, 1.5MB, which made transferring it somewhat slow over a remote link. zlib compression takes it down to about 550K and lzo (similar, but not identical) 880K. If we're curious how it affects replication traffic, I could probably gather statistics on LZO-compressed WAL traffic, of which we have a pretty huge amount captured. there are plenty of on gpl lz based libraries out there (for example: http://www.fastlz.org/) and always have been. they are all much faster than zlib. the main issue is patents...you have to be careful even though all the lz77/78 patents seem to have expired or apply to specifics not relevant to general use. see here for the last round of talks on this: http://archives.postgresql.org/pgsql-performance/2009-08/msg00052.php lzo is nearing its 20th birthday, so even if you are paranoid about patents (admittedly, there is good reason to be), the window is closing fast to have patent issues that aren't A expired or B covered by prior art on that or the various copycat implementations, at least in the US. snappy looks amazing. merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On 03/14/2012 04:10 PM, Merlin Moncure wrote: there are plenty of on gpl lz based libraries out there (for example: http://www.fastlz.org/) and always have been. they are all much faster than zlib. the main issue is patents...you have to be careful even though all the lz77/78 patents seem to have expired or apply to specifics not relevant to general use. We're not going to include GPL code in the backend. We have enough trouble with readline and that's only for psql. SO the fact that there are GPL libraries can't help us, whether there are patent issues or not. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 04:43:55PM -0400, Andrew Dunstan wrote: On 03/14/2012 04:10 PM, Merlin Moncure wrote: there are plenty of on gpl lz based libraries out there (for example: http://www.fastlz.org/) and always have been. they are all much faster than zlib. the main issue is patents...you have to be careful even though all the lz77/78 patents seem to have expired or apply to specifics not relevant to general use. We're not going to include GPL code in the backend. We have enough trouble with readline and that's only for psql. SO the fact that there are GPL libraries can't help us, whether there are patent issues or not. cheers andrew That is what makes Google's Snappy so appealing, a BSD license. Regards, Ken -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 3:43 PM, Andrew Dunstan and...@dunslane.net wrote: On 03/14/2012 04:10 PM, Merlin Moncure wrote: there are plenty of on gpl lz based libraries out there (for example: http://www.fastlz.org/) and always have been. they are all much faster than zlib. the main issue is patents...you have to be careful even though all the lz77/78 patents seem to have expired or apply to specifics not relevant to general use. We're not going to include GPL code in the backend. We have enough trouble with readline and that's only for psql. SO the fact that there are GPL libraries can't help us, whether there are patent issues or not. er, typo: I meant to say: *non-gpl* lz based... :-). merlin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 2:03 PM, Merlin Moncure mmonc...@gmail.com wrote: er, typo: I meant to say: *non-gpl* lz based... :-). Given that, few I would say have had the traction that LZO and Snappy have had, even though in many respects they are interchangeable in the general trade-off spectrum. The question is: what burden of proof is required to convince the project that Snappy does not have exorbitant patent issues, in proportion to the utility of having a compression scheme of this type integrated? One would think Google's lawyers did their homework to ensure they would not be trolled for hideous sums of money by disclosing and releasing the exact implementation of the compression used virtually everywhere. If anything, that may have been a more complicated issue than writing and releasing yet-another-LZ77 implementation. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
Daniel Farina dan...@heroku.com writes: Given that, few I would say have had the traction that LZO and Snappy have had, even though in many respects they are interchangeable in the general trade-off spectrum. The question is: what burden of proof is required to convince the project that Snappy does not have exorbitant patent issues, in proportion to the utility of having a compression scheme of this type integrated? Another not-exactly-trivial requirement is to figure out how to not break on-disk compatibility when installing an alternative compression scheme. In hindsight it might've been a good idea if pglz_compress had wasted a little bit of space on some sort of version identifier ... but it didn't. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
Tom Lane t...@sss.pgh.pa.us wrote: Another not-exactly-trivial requirement is to figure out how to not break on-disk compatibility when installing an alternative compression scheme. In hindsight it might've been a good idea if pglz_compress had wasted a little bit of space on some sort of version identifier ... but it didn't. Doesn't it always start with a header of two int32 values where the first must be smaller than the second? That seems like enough to get traction for an identifiably different header for an alternative compression technique. -Kevin -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 2:58 PM, Tom Lane t...@sss.pgh.pa.us wrote: Daniel Farina dan...@heroku.com writes: Given that, few I would say have had the traction that LZO and Snappy have had, even though in many respects they are interchangeable in the general trade-off spectrum. The question is: what burden of proof is required to convince the project that Snappy does not have exorbitant patent issues, in proportion to the utility of having a compression scheme of this type integrated? Another not-exactly-trivial requirement is to figure out how to not break on-disk compatibility when installing an alternative compression scheme. In hindsight it might've been a good idea if pglz_compress had wasted a little bit of space on some sort of version identifier ... but it didn't. I was more thinking that the latency and throughput in LZ77 schemes may be best first applied to protocol compression. The downside is that requires more protocol wrangling, but at least terabytes of on-disk format doesn't get in the picture, even though LZ77 on the data itself may be attractive. I'm interested allowing layering transports below FEBE (similar to how SSL is below, except without the complexity of being tied into auth auth) in a couple of respects, and this is one of them. -- fdr -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 6:08 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Tom Lane t...@sss.pgh.pa.us wrote: Another not-exactly-trivial requirement is to figure out how to not break on-disk compatibility when installing an alternative compression scheme. In hindsight it might've been a good idea if pglz_compress had wasted a little bit of space on some sort of version identifier ... but it didn't. Doesn't it always start with a header of two int32 values where the first must be smaller than the second? That seems like enough to get traction for an identifiably different header for an alternative compression technique. The first of those words is vl_len_, which we can't fiddle with too much, but the second is rawsize, which we can definitely fiddle with. Right now, rawsize vl_len_ means it's compressed; and rawsize == vl_len_ means it's uncompressed. As you point out, rawsize vl_len_ is undefined; also, and maybe simpler, rawsize 0 is undefined. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
Robert Haas robertmh...@gmail.com writes: On Wed, Mar 14, 2012 at 6:08 PM, Kevin Grittner kevin.gritt...@wicourts.gov wrote: Doesn't it always start with a header of two int32 values where the first must be smaller than the second? That seems like enough to get traction for an identifiably different header for an alternative compression technique. The first of those words is vl_len_, which we can't fiddle with too much, but the second is rawsize, which we can definitely fiddle with. Right now, rawsize vl_len_ means it's compressed; and rawsize == vl_len_ means it's uncompressed. As you point out, rawsize vl_len_ is undefined; also, and maybe simpler, rawsize 0 is undefined. Well, let's please not make the same mistake again of assuming that there will never again be any other ideas in this space. IOW, let's find a way to shoehorn in an actual compression-method ID value of some sort. I don't particularly care for trying to push that into rawsize, because you don't really have more than about one bit to work with there, unless you eat the entire word for ID purposes which seems excessive. After looking at pg_lzcompress.c for a bit, it appears to me that the LSB of the first byte of compressed data must always be zero, because the very first control bit has to say copy a literal byte; you can't have a back-reference until there's some data in the output buffer. So what I suggest is that we keep rawsize the same as it is, but peek at the first byte after that to decide what we have: even means existing compression method, an odd value is an ID byte selecting some new method. This gives us room for 128 new methods before we have trouble again, while consuming only one byte which seems like acceptable overhead for the purpose. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] Faster compression, again
On Wed, Mar 14, 2012 at 9:44 PM, Tom Lane t...@sss.pgh.pa.us wrote: Well, let's please not make the same mistake again of assuming that there will never again be any other ideas in this space. IOW, let's find a way to shoehorn in an actual compression-method ID value of some sort. I don't particularly care for trying to push that into rawsize, because you don't really have more than about one bit to work with there, unless you eat the entire word for ID purposes which seems excessive. Well, you don't have to go that far. For example, you could dictate that, when the value is negative, the most significant byte indicates the compression algorithm is in use (128 possible compression algorithms). The remaining 3 bytes indicate the compressed length; but when they're all zero, the compressed length is instead stored in the following 4-byte word. This consumes one additional 4-byte word for values that take = 16MB compressed, but that's presumably a non-problem. After looking at pg_lzcompress.c for a bit, it appears to me that the LSB of the first byte of compressed data must always be zero, because the very first control bit has to say copy a literal byte; you can't have a back-reference until there's some data in the output buffer. So what I suggest is that we keep rawsize the same as it is, but peek at the first byte after that to decide what we have: even means existing compression method, an odd value is an ID byte selecting some new method. This gives us room for 128 new methods before we have trouble again, while consuming only one byte which seems like acceptable overhead for the purpose. That would work, too. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers