Re: [HACKERS] MD5 aggregate
On 27 June 2013 17:47, Peter Eisentraut wrote: > On 6/27/13 4:19 AM, Dean Rasheed wrote: >> I'd say there are clearly people who want it, and the nature of some >> of those answers suggests to me that we ought to have a better answer >> in core. > > It's not clear what these people wanted this functionality for. They > all wanted to analyze a table to compare with another table (or the same > table later). Either, they wanted this to detect data changes, in which > case the right tool is a checksum, not a cryptographic hash. We already > have several checksum implementations in core, so we could expose on of > them. Or they wanted this to protect their data from tampering, in > which case the right tool is a cryptographic hash, but Noah argues that > a sum of MD5 hashes is not cryptographically sound. (And in any case, > we don't put cryptographic functionality into the core.) > > The reason md5_agg is proposed here and in all those cited posts is > presumably because the md5() function was already there anyway. The the > md5() function is there because the md5 code was already there anyway > because of the authentication. Let's not add higher-order > already-there-anyway code. ;-) > OK fair enough. It's looking like there are more people who don't want md5_agg() in core, or want something different, than who do want it. Also, if we're taking the view that the existing md5() function is only for hashing passwords, then it's probably not worth trying to optimise it. At this stage it's probably best to mark this as returned with feedback, and I'll consider the options for rewriting it, but not during this commitfest. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 6/27/13 4:19 AM, Dean Rasheed wrote: > I'd say there are clearly people who want it, and the nature of some > of those answers suggests to me that we ought to have a better answer > in core. It's not clear what these people wanted this functionality for. They all wanted to analyze a table to compare with another table (or the same table later). Either, they wanted this to detect data changes, in which case the right tool is a checksum, not a cryptographic hash. We already have several checksum implementations in core, so we could expose on of them. Or they wanted this to protect their data from tampering, in which case the right tool is a cryptographic hash, but Noah argues that a sum of MD5 hashes is not cryptographically sound. (And in any case, we don't put cryptographic functionality into the core.) The reason md5_agg is proposed here and in all those cited posts is presumably because the md5() function was already there anyway. The the md5() function is there because the md5 code was already there anyway because of the authentication. Let's not add higher-order already-there-anyway code. ;-) -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On Thu, Jun 27, 2013 at 7:29 AM, Marko Kreen wrote: > On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed > wrote: >> On 26 June 2013 21:46, Peter Eisentraut wrote: >>> On 6/26/13 4:04 PM, Dean Rasheed wrote: A quick google search reveals several people asking for something like this, and people recommending md5(string_agg(...)) or md5(string_agg(md5(...))) based solutions, which are doomed to failure on larger tables. >>> >>> The thread discussed several other options of checksumming tables that >>> did not have the air of a crytographic offering, as Noah put it. >>> >> >> True but md5 has the advantage of being directly comparable with the >> output of Unix md5sum, which would be useful if you loaded data from >> external files and wanted to confirm that your import process didn't >> mangle it. > > The problem with md5_agg() is that it's only useful in toy scenarios. > > It's more useful give people script that does same sum(hash(row)) > on dump file than try to run MD5 on ordered rows. > > Also, I don't think anybody actually cares about MD5(table-as-bytes), instead > people want way to check if 2 tables or table and dump are same. I think you're trying to tell Dean to write the patch that you want instead of the patch that he wants. There are certainly other things that could be done that some people might sometimes prefer, but that doesn't mean what he did isn't useful. That having been said, I basically agree with Noah: I think this would be a useful extension (perhaps even in contrib?) but I don't think we need to install it by default. It's useful, but it's also narrow. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On Thu, Jun 27, 2013 at 11:28 AM, Dean Rasheed wrote: > On 26 June 2013 21:46, Peter Eisentraut wrote: >> On 6/26/13 4:04 PM, Dean Rasheed wrote: >>> A quick google search reveals several people asking for something like >>> this, and people recommending md5(string_agg(...)) or >>> md5(string_agg(md5(...))) based solutions, which are doomed to failure >>> on larger tables. >> >> The thread discussed several other options of checksumming tables that >> did not have the air of a crytographic offering, as Noah put it. >> > > True but md5 has the advantage of being directly comparable with the > output of Unix md5sum, which would be useful if you loaded data from > external files and wanted to confirm that your import process didn't > mangle it. The problem with md5_agg() is that it's only useful in toy scenarios. It's more useful give people script that does same sum(hash(row)) on dump file than try to run MD5 on ordered rows. Also, I don't think anybody actually cares about MD5(table-as-bytes), instead people want way to check if 2 tables or table and dump are same. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 26 June 2013 21:46, Peter Eisentraut wrote: > On 6/26/13 4:04 PM, Dean Rasheed wrote: >> A quick google search reveals several people asking for something like >> this, and people recommending md5(string_agg(...)) or >> md5(string_agg(md5(...))) based solutions, which are doomed to failure >> on larger tables. > > The thread discussed several other options of checksumming tables that > did not have the air of a crytographic offering, as Noah put it. > True but md5 has the advantage of being directly comparable with the output of Unix md5sum, which would be useful if you loaded data from external files and wanted to confirm that your import process didn't mangle it. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 26 June 2013 22:48, Noah Misch wrote: > On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote: >> On 26 June 2013 19:32, Noah Misch wrote: >> > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > >> > md5_agg() is well-defined and not cryptographically novel, and your use >> > case >> > is credible. However, not every useful-sometimes function belongs in >> > core; we >> > mostly stick to functions with ubiquitous value and functions that would be >> > onerous to implement externally. (We do carry legacy stuff that wouldn't >> > make >> > the cut today.) In my estimation, md5_agg() does not make that cut. The >> > variety of intuitive md5_agg() definitions attested upthread doesn't bode >> > well >> > for its broad applicability. It could just as well live in an extension >> > published on PGXN. Mine is just one opinion, though; I won't be horrified >> > if >> > the community wants an md5_agg() in core. >> >> I disagree with this though. I see md5_agg() as a natural extension of >> the already-in-core md5() functions and underlying code, satisfying a >> well-defined use-case, and providing a checksum comparable with >> externally generated md5 sums. > > All true, but I don't see those facts justifying core inclusion. > >> A quick google search reveals several people asking for something like >> this, and people recommending md5(string_agg(...)) or >> md5(string_agg(md5(...))) based solutions, which are doomed to failure >> on larger tables. So I think that there is a case for having md5_agg() >> in core as an alternative to such hacks, while having more >> sophisticated crypto functions available as extensions. > > My nine Google hits for "md5(string_agg" included one Stack Overflow answer, > several mirrors of that answer, and a few posts on this thread. > I found more than that using Google. It's not uncommon for people to use Google and then pick the first "accepted" answer on Stack Overflow, in which case they might well pick one of the answers here [1] or here [2]. Or they might find this [3]. Or if they came to our lists they might find this [4], or deduce from this [5] that it isn't possible. [1] http://stackoverflow.com/questions/4020033/how-can-i-get-a-hash-of-an-entire-table-in-postgresql [2] http://stackoverflow.com/questions/13554333/finding-out-the-hash-value-of-a-group-of-rows [3] https://ralphholz.de/?q=node/298 [4] http://www.postgresql.org/message-id/4e5f469c.5070...@compulab.co.il [5] http://www.postgresql.org/message-id/e94d85500801301153u6b976e31m89e311c7134a0...@mail.gmail.com I'd say there are clearly people who want it, and the nature of some of those answers suggests to me that we ought to have a better answer in core. >> I haven't looked at pgcrypto because, as I said, performance wasn't my >> primary criterion either, but removing the unnessary data copy just >> seemed like good sense. >> >> I'll take a look at it, and then as you and Peter suggest, look to >> split the patch into two. I think I will withdraw md5_total() because >> I was never entirely happy with that anyway. > > Understood; feel free to back off from any performance improvements in which > you didn't wish to involve yourself. > > If we do end up moving forward with md5_agg(), I note that the pgcrypto > version already has an initialize/accumulate/finalize API division. A patch > importing that code largely-unchanged would be easier to verify than a patch > restructuring the src/backend/libpq/md5.c implementation. The two patches > would then be a "use pgcrypto md5.c in core" patch and an md5_agg() patch. > I'll take a look at it, although I think there is still the potential for bugs either way. What I've done with libpq's md5.c is just to rearrange the internal pieces, without touching the core algorithm code or the higher level callers. So the most likely types of bugs are boundary/size errors. If I'd broken any of pg_md5_hash(), pg_md5_binary() or pg_md5_crypt(), then I'd have broken them all. It's possible to get a reasonable level of confidence in those changes using queries like this on HEAD and with the patch: SELECT md5(string_agg(md5(str) || repeat(' ', i), '')) FROM ( SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str FROM generate_series(0,1) i, generate_series(0,i) j GROUP BY i ) t; and these with the patch: SELECT md5_agg(md5(str) || repeat(' ', i)) FROM ( SELECT i, string_agg(chr(32+(i+j*31)%224), '') AS str FROM generate_series(0,1) i, generate_series(0,i) j GROUP BY i ) t; SELECT md5_agg(md5_agg || repeat(' ', i)) FROM ( SELECT i, md5_agg(chr(32+(i+j*31)%224)) FROM generate_series(0,1) i, generate_series(0,i) j GROUP BY i ) t; which all produce the same overall sum. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On Wed, Jun 26, 2013 at 09:04:34PM +0100, Dean Rasheed wrote: > On 26 June 2013 19:32, Noah Misch wrote: > > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > > md5_agg() is well-defined and not cryptographically novel, and your use case > > is credible. However, not every useful-sometimes function belongs in core; > > we > > mostly stick to functions with ubiquitous value and functions that would be > > onerous to implement externally. (We do carry legacy stuff that wouldn't > > make > > the cut today.) In my estimation, md5_agg() does not make that cut. The > > variety of intuitive md5_agg() definitions attested upthread doesn't bode > > well > > for its broad applicability. It could just as well live in an extension > > published on PGXN. Mine is just one opinion, though; I won't be horrified > > if > > the community wants an md5_agg() in core. > > I disagree with this though. I see md5_agg() as a natural extension of > the already-in-core md5() functions and underlying code, satisfying a > well-defined use-case, and providing a checksum comparable with > externally generated md5 sums. All true, but I don't see those facts justifying core inclusion. > A quick google search reveals several people asking for something like > this, and people recommending md5(string_agg(...)) or > md5(string_agg(md5(...))) based solutions, which are doomed to failure > on larger tables. So I think that there is a case for having md5_agg() > in core as an alternative to such hacks, while having more > sophisticated crypto functions available as extensions. My nine Google hits for "md5(string_agg" included one Stack Overflow answer, several mirrors of that answer, and a few posts on this thread. > I haven't looked at pgcrypto because, as I said, performance wasn't my > primary criterion either, but removing the unnessary data copy just > seemed like good sense. > > I'll take a look at it, and then as you and Peter suggest, look to > split the patch into two. I think I will withdraw md5_total() because > I was never entirely happy with that anyway. Understood; feel free to back off from any performance improvements in which you didn't wish to involve yourself. If we do end up moving forward with md5_agg(), I note that the pgcrypto version already has an initialize/accumulate/finalize API division. A patch importing that code largely-unchanged would be easier to verify than a patch restructuring the src/backend/libpq/md5.c implementation. The two patches would then be a "use pgcrypto md5.c in core" patch and an md5_agg() patch. Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 6/26/13 4:04 PM, Dean Rasheed wrote: > A quick google search reveals several people asking for something like > this, and people recommending md5(string_agg(...)) or > md5(string_agg(md5(...))) based solutions, which are doomed to failure > on larger tables. The thread discussed several other options of checksumming tables that did not have the air of a crytographic offering, as Noah put it. > So I think that there is a case for having md5_agg() > in core as an alternative to such hacks, while having more > sophisticated crypto functions available as extensions. Well, in general, I'd rather see the sophisticated stuff in core and the hacks in extensions. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 26 June 2013 19:32, Noah Misch wrote: > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: >> I've been playing around with the idea of an aggregate that computes >> the sum of the md5 hashes of each of its inputs, which I've called >> md5_total() for now, although I'm not particularly wedded to that >> name. Comparing it with md5_agg() on a 100M row table (see attached >> test script) produces interesting results: >> >> SELECT md5_agg(foo.*::text) >> FROM (SELECT * FROM foo ORDER BY id) foo; >> >> 50bc42127fb9b028c9708248f835ed8f >> >> Time: 92960.021 ms >> >> SELECT md5_total(foo.*::text) FROM foo; >> >> 02faea7fafee4d253fc94cfae031afc43c03479c >> >> Time: 96190.343 ms >> >> Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its >> result is longer) but it seems like it would be very useful for >> quickly comparing data in SQL, since its value is not dependent on the >> row-order making it easier to use and better performing if there is no >> usable index for ordering. > > I took a look at this patch. First, as Peter suggested upthread, it should > have been two patches: one to optimize calculateDigestFromBuffer() and > callees, another atop the first adding new SQL functions and adjusting the > infrastructure to meet their needs. > I didn't really set out with the aim of optimising the existing md5() functions at all, but merely to restructure the code in a way that would expose the necessary API for md5_agg(). As the timings up-thread show, the performance gain is really very modest, although the memory saving might be more of a factor in some setups. In fact, for the sorts of queries shown, the vast majority of the time is spent elsewhere, as can be seen by replacing md5() with a more trivial function such as length(). >> + md5_total >> + >> +md5_total(> class="parameter">expression) >> + >> + text or bytea >> + text >> + >> +sum of the MD5 hash values of the input values, independent of their >> +order > > This is not specific enough. We would need to provide either an algorithm > specification or a literature reference. However, that just leads to the > problem that we should not put a literature-unrecognized cryptographic > algorithm into core PostgreSQL. I realize that the use case inspiring this > patch wasn't concerned with cryptographic properties, but the association with > md5 immediately makes it a cryptography offering. > Yes, I can totally understand that point-of-view. I wasn't really intending to write md5_total() at the outset - it was more of an intellectual exercise following some of the previous points raised. So on reflection, I think I can also agree that that probably doesn't belong in core. > md5_agg() is well-defined and not cryptographically novel, and your use case > is credible. However, not every useful-sometimes function belongs in core; we > mostly stick to functions with ubiquitous value and functions that would be > onerous to implement externally. (We do carry legacy stuff that wouldn't make > the cut today.) In my estimation, md5_agg() does not make that cut. The > variety of intuitive md5_agg() definitions attested upthread doesn't bode well > for its broad applicability. It could just as well live in an extension > published on PGXN. Mine is just one opinion, though; I won't be horrified if > the community wants an md5_agg() in core. > I disagree with this though. I see md5_agg() as a natural extension of the already-in-core md5() functions and underlying code, satisfying a well-defined use-case, and providing a checksum comparable with externally generated md5 sums. A quick google search reveals several people asking for something like this, and people recommending md5(string_agg(...)) or md5(string_agg(md5(...))) based solutions, which are doomed to failure on larger tables. So I think that there is a case for having md5_agg() in core as an alternative to such hacks, while having more sophisticated crypto functions available as extensions. >> *** a/src/backend/libpq/md5.c >> --- b/src/backend/libpq/md5.c >> *** >> *** 1,14 >> /* >>* md5.c >>* >> ! * Implements the MD5 Message-Digest Algorithm as specified in >> ! * RFC 1321. This implementation is a simple one, in that it >> ! * needs every input byte to be buffered before doing any >> ! * calculations. I do not expect this file to be used for >> ! * general purpose MD5'ing of large amounts of data, only for >> ! * generating hashed passwords from limited input. > > In other words, performance wasn't a design criterion. I wonder if we would > be wasting our time with a narrow performance change like removing the data > copy. What's your opinion of copying pgcrypto's md5 implementation, which > already avoids the data copy? > I haven't looked at pgcrypto because, as I said, performance wasn't my primary criterion either, but r
Re: [HACKERS] MD5 aggregate
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > I've been playing around with the idea of an aggregate that computes > the sum of the md5 hashes of each of its inputs, which I've called > md5_total() for now, although I'm not particularly wedded to that > name. Comparing it with md5_agg() on a 100M row table (see attached > test script) produces interesting results: > > SELECT md5_agg(foo.*::text) > FROM (SELECT * FROM foo ORDER BY id) foo; > > 50bc42127fb9b028c9708248f835ed8f > > Time: 92960.021 ms > > SELECT md5_total(foo.*::text) FROM foo; > > 02faea7fafee4d253fc94cfae031afc43c03479c > > Time: 96190.343 ms > > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its > result is longer) but it seems like it would be very useful for > quickly comparing data in SQL, since its value is not dependent on the > row-order making it easier to use and better performing if there is no > usable index for ordering. I took a look at this patch. First, as Peter suggested upthread, it should have been two patches: one to optimize calculateDigestFromBuffer() and callees, another atop the first adding new SQL functions and adjusting the infrastructure to meet their needs. > + md5_total > + > +md5_total( class="parameter">expression) > + > + text or bytea > + text > + > +sum of the MD5 hash values of the input values, independent of their > +order This is not specific enough. We would need to provide either an algorithm specification or a literature reference. However, that just leads to the problem that we should not put a literature-unrecognized cryptographic algorithm into core PostgreSQL. I realize that the use case inspiring this patch wasn't concerned with cryptographic properties, but the association with md5 immediately makes it a cryptography offering. md5_agg() is well-defined and not cryptographically novel, and your use case is credible. However, not every useful-sometimes function belongs in core; we mostly stick to functions with ubiquitous value and functions that would be onerous to implement externally. (We do carry legacy stuff that wouldn't make the cut today.) In my estimation, md5_agg() does not make that cut. The variety of intuitive md5_agg() definitions attested upthread doesn't bode well for its broad applicability. It could just as well live in an extension published on PGXN. Mine is just one opinion, though; I won't be horrified if the community wants an md5_agg() in core. > *** a/src/backend/libpq/md5.c > --- b/src/backend/libpq/md5.c > *** > *** 1,14 > /* >* md5.c >* > ! * Implements the MD5 Message-Digest Algorithm as specified in > ! * RFC 1321. This implementation is a simple one, in that it > ! * needs every input byte to be buffered before doing any > ! * calculations. I do not expect this file to be used for > ! * general purpose MD5'ing of large amounts of data, only for > ! * generating hashed passwords from limited input. In other words, performance wasn't a design criterion. I wonder if we would be wasting our time with a narrow performance change like removing the data copy. What's your opinion of copying pgcrypto's md5 implementation, which already avoids the data copy? Thanks, nm -- Noah Misch EnterpriseDB http://www.enterprisedb.com -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: Review [was Re: [HACKERS] MD5 aggregate]
On 21 June 2013 21:04, David Fetter wrote: > On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote: >> On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: >> > On 15 June 2013 10:22, Dean Rasheed wrote: >> > > There seem to be 2 separate directions that this could go, which >> > > really meet different requirements: >> > > >> > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of >> > > the order in which they are scanned. A possible approach to this might >> > > be something like an aggregate >> > > >> > > md5_total(text/bytea) returns text >> > > >> > > that returns the sum of the md5 values of each input value, treating >> > > each md5 value as an unsigned 128-bit integer, and then producing the >> > > hexadecimal representation of the final sum. This should out-perform a >> > > solution based on numeric addition, and in typical cases, the result >> > > wouldn't be much longer than a regular md5 sum, and so would be easy >> > > to eyeball for differences. >> > > >> > >> > I've been playing around with the idea of an aggregate that computes >> > the sum of the md5 hashes of each of its inputs, which I've called >> > md5_total() for now, although I'm not particularly wedded to that >> > name. Comparing it with md5_agg() on a 100M row table (see attached >> > test script) produces interesting results: >> > >> > SELECT md5_agg(foo.*::text) >> > FROM (SELECT * FROM foo ORDER BY id) foo; >> > >> > 50bc42127fb9b028c9708248f835ed8f >> > >> > Time: 92960.021 ms >> > >> > SELECT md5_total(foo.*::text) FROM foo; >> > >> > 02faea7fafee4d253fc94cfae031afc43c03479c >> > >> > Time: 96190.343 ms >> > >> > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its >> > result is longer) but it seems like it would be very useful for >> > quickly comparing data in SQL, since its value is not dependent on the >> > row-order making it easier to use and better performing if there is no >> > usable index for ordering. >> > >> > Note, however, that if there is an index that can be used for >> > ordering, the performance is not necessarily better than md5_agg(), as >> > this example shows. There is a small additional overhead per row for >> > initialising the MD5 sums, and adding the results to the total, but I >> > think the biggest factor is that md5_total() is processing more data. >> > The reason is that MD5 works on 64-byte blocks, so the total amount of >> > data going through the core MD5 algorithm is each row's size is >> > rounded up to a multiple of 64. In this simple case it ends up >> > processing around 1.5 times as much data: >> > >> > SELECT sum(length(foo.*::text)) AS md5_agg, >> >sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; >> > >> > md5_agg | md5_total >> > +- >> > 8103815438 | 12799909248 >> > >> > although of course that overhead won't be as large on wider tables, >> > and even in this case the overall performance is still on a par with >> > md5_agg(). >> > >> > ISTM that both aggregates are potentially useful in different >> > situations. I would probably typically use md5_total() because of its >> > simplicity/order-independence and consistent performance, but >> > md5_agg() might also be useful when comparing with external data. >> > >> > Regards, >> > Dean > >> Performance review (skills needed: ability to time performance) >> >> Does the patch slow down simple tests? >> >> Yes. For an MD5 checksum of some random data, here are >> results from master: >> >> shackle@postgres:5493=# WITH t1 AS (SELECT >> string_agg(chr(floor(255*random()+1)::int),'') AS a FROM >> generate_series(1,1)), >> postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN >> generate_series(1,1)) >> postgres-# select md5(a::text) FROM t2; >> shackle@postgres:5493=# \timing >> Timing is on. >> shackle@postgres:5493=# \g >> Time: 955.393 ms >> shackle@postgres:5493=# \g >> Time: 950.909 ms >> shackle@postgres:5493=# \g >> Time: 947.955 ms >> shackle@postgres:5493=# \g >> Time: 946.193 ms >> shackle@postgres:5493=# \g >> Time: 950.591 ms >> shackle@postgres:5493=# \g >> Time: 952.346 ms >> shackle@postgres:5493=# \g >> Time: 948.623 ms >> shackle@postgres:5493=# \g >> Time: 939.819 ms >> >> and here from master + the patch: >> >> WITH t1 AS (SELECT >> string_agg(chr(floor(255*random()+1)::int),'') AS a FROM >> generate_series(1,1)), >> t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,1)) >> select md5(a::text) FROM t2; >> Time: 1129.178 ms >> shackle@postgres:5494=# \g >> Time: 1134.013 ms >> shackle@postgres:5494=# \g >> Time: 1124.387 ms >>
Re: Review [was Re: [HACKERS] MD5 aggregate]
On Fri, Jun 21, 2013 at 10:48:35AM -0700, David Fetter wrote: > On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > > On 15 June 2013 10:22, Dean Rasheed wrote: > > > There seem to be 2 separate directions that this could go, which > > > really meet different requirements: > > > > > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > > > the order in which they are scanned. A possible approach to this might > > > be something like an aggregate > > > > > > md5_total(text/bytea) returns text > > > > > > that returns the sum of the md5 values of each input value, treating > > > each md5 value as an unsigned 128-bit integer, and then producing the > > > hexadecimal representation of the final sum. This should out-perform a > > > solution based on numeric addition, and in typical cases, the result > > > wouldn't be much longer than a regular md5 sum, and so would be easy > > > to eyeball for differences. > > > > > > > I've been playing around with the idea of an aggregate that computes > > the sum of the md5 hashes of each of its inputs, which I've called > > md5_total() for now, although I'm not particularly wedded to that > > name. Comparing it with md5_agg() on a 100M row table (see attached > > test script) produces interesting results: > > > > SELECT md5_agg(foo.*::text) > > FROM (SELECT * FROM foo ORDER BY id) foo; > > > > 50bc42127fb9b028c9708248f835ed8f > > > > Time: 92960.021 ms > > > > SELECT md5_total(foo.*::text) FROM foo; > > > > 02faea7fafee4d253fc94cfae031afc43c03479c > > > > Time: 96190.343 ms > > > > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its > > result is longer) but it seems like it would be very useful for > > quickly comparing data in SQL, since its value is not dependent on the > > row-order making it easier to use and better performing if there is no > > usable index for ordering. > > > > Note, however, that if there is an index that can be used for > > ordering, the performance is not necessarily better than md5_agg(), as > > this example shows. There is a small additional overhead per row for > > initialising the MD5 sums, and adding the results to the total, but I > > think the biggest factor is that md5_total() is processing more data. > > The reason is that MD5 works on 64-byte blocks, so the total amount of > > data going through the core MD5 algorithm is each row's size is > > rounded up to a multiple of 64. In this simple case it ends up > > processing around 1.5 times as much data: > > > > SELECT sum(length(foo.*::text)) AS md5_agg, > >sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; > > > > md5_agg | md5_total > > +- > > 8103815438 | 12799909248 > > > > although of course that overhead won't be as large on wider tables, > > and even in this case the overall performance is still on a par with > > md5_agg(). > > > > ISTM that both aggregates are potentially useful in different > > situations. I would probably typically use md5_total() because of its > > simplicity/order-independence and consistent performance, but > > md5_agg() might also be useful when comparing with external data. > > > > Regards, > > Dean > Performance review (skills needed: ability to time performance) > > Does the patch slow down simple tests? > > Yes. For an MD5 checksum of some random data, here are > results from master: > > shackle@postgres:5493=# WITH t1 AS (SELECT > string_agg(chr(floor(255*random()+1)::int),'') AS a FROM > generate_series(1,1)), > postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN > generate_series(1,1)) > postgres-# select md5(a::text) FROM t2; > shackle@postgres:5493=# \timing > Timing is on. > shackle@postgres:5493=# \g > Time: 955.393 ms > shackle@postgres:5493=# \g > Time: 950.909 ms > shackle@postgres:5493=# \g > Time: 947.955 ms > shackle@postgres:5493=# \g > Time: 946.193 ms > shackle@postgres:5493=# \g > Time: 950.591 ms > shackle@postgres:5493=# \g > Time: 952.346 ms > shackle@postgres:5493=# \g > Time: 948.623 ms > shackle@postgres:5493=# \g > Time: 939.819 ms > > and here from master + the patch: > > WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') > AS a FROM generate_series(1,1)), > t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,1)) > select md5(a::text) FROM t2; > Time: 1129.178 ms > shackle@postgres:5494=# \g > Time: 1134.013 ms > shackle@postgres:5494=# \g > Time: 1124.387 ms > shackle@postgres:5494=# \g > Time: 1119.733 ms > shackle@postgres:5494=# \g > Time: 1104.906 ms >
Review [was Re: [HACKERS] MD5 aggregate]
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > On 15 June 2013 10:22, Dean Rasheed wrote: > > There seem to be 2 separate directions that this could go, which > > really meet different requirements: > > > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > > the order in which they are scanned. A possible approach to this might > > be something like an aggregate > > > > md5_total(text/bytea) returns text > > > > that returns the sum of the md5 values of each input value, treating > > each md5 value as an unsigned 128-bit integer, and then producing the > > hexadecimal representation of the final sum. This should out-perform a > > solution based on numeric addition, and in typical cases, the result > > wouldn't be much longer than a regular md5 sum, and so would be easy > > to eyeball for differences. > > > > I've been playing around with the idea of an aggregate that computes > the sum of the md5 hashes of each of its inputs, which I've called > md5_total() for now, although I'm not particularly wedded to that > name. Comparing it with md5_agg() on a 100M row table (see attached > test script) produces interesting results: > > SELECT md5_agg(foo.*::text) > FROM (SELECT * FROM foo ORDER BY id) foo; > > 50bc42127fb9b028c9708248f835ed8f > > Time: 92960.021 ms > > SELECT md5_total(foo.*::text) FROM foo; > > 02faea7fafee4d253fc94cfae031afc43c03479c > > Time: 96190.343 ms > > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its > result is longer) but it seems like it would be very useful for > quickly comparing data in SQL, since its value is not dependent on the > row-order making it easier to use and better performing if there is no > usable index for ordering. > > Note, however, that if there is an index that can be used for > ordering, the performance is not necessarily better than md5_agg(), as > this example shows. There is a small additional overhead per row for > initialising the MD5 sums, and adding the results to the total, but I > think the biggest factor is that md5_total() is processing more data. > The reason is that MD5 works on 64-byte blocks, so the total amount of > data going through the core MD5 algorithm is each row's size is > rounded up to a multiple of 64. In this simple case it ends up > processing around 1.5 times as much data: > > SELECT sum(length(foo.*::text)) AS md5_agg, >sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; > > md5_agg | md5_total > +- > 8103815438 | 12799909248 > > although of course that overhead won't be as large on wider tables, > and even in this case the overall performance is still on a par with > md5_agg(). > > ISTM that both aggregates are potentially useful in different > situations. I would probably typically use md5_total() because of its > simplicity/order-independence and consistent performance, but > md5_agg() might also be useful when comparing with external data. > > Regards, > Dean Submission review: Is the patch in a patch format which has context? (eg: context diff format) Yes. Does it apply cleanly to the current git master? Yes. Does it include reasonable tests, necessary doc patches, etc? Yes. Usability review: Does the patch actually implement that? Yes. Do we want that? Enough do. Do we already have it? No. Does it follow SQL spec, or the community-agreed behavior? No, and yes, respectively. Does it include pg_dump support (if applicable)? N/A Are there dangers? Patch changes the MD5 implementation, which could theoretically result in backward incompatibility if the changes are not 100% backward-compatible. Have all the bases been covered? Yes. Feature test: Does the feature work as advertised? Yes. Are there corner cases the author has failed to consider? Not that I've found so far. Are there any assertion failures or crashes? No. Performance review (skills needed: ability to time performance) Does the patch slow down simple tests? Yes. For an MD5 checksum of some random data, here are results from master: shackle@postgres:5493=# WITH t1 AS (SELECT string_agg(chr(floor(255*random()+1)::int),'') AS a FROM generate_series(1,1)), postgres-# t2 AS (SELECT a FROM t1 CROSS JOIN generate_series(1,1)) postgres-# select md5(a::text) FROM t2; shackle@postgres:5493=# \timing Timing is on. shackle@postgres:5493=# \g Time: 955.393 ms shackle@postgres:5493=# \g Time: 950.909 ms shackle@postgres:5493=# \g Time: 947.955 ms shackle@postgres:5493=# \g Time: 946.193 ms shackle@postgres:5493=# \g Time: 950.591 ms
Re: [HACKERS] MD5 aggregate
On Mon, Jun 17, 2013 at 11:34:52AM +0100, Dean Rasheed wrote: > On 15 June 2013 10:22, Dean Rasheed wrote: > > There seem to be 2 separate directions that this could go, which > > really meet different requirements: > > > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > > the order in which they are scanned. A possible approach to this might > > be something like an aggregate > > > > md5_total(text/bytea) returns text > > > > that returns the sum of the md5 values of each input value, treating > > each md5 value as an unsigned 128-bit integer, and then producing the > > hexadecimal representation of the final sum. This should out-perform a > > solution based on numeric addition, and in typical cases, the result > > wouldn't be much longer than a regular md5 sum, and so would be easy > > to eyeball for differences. > > > > I've been playing around with the idea of an aggregate that computes > the sum of the md5 hashes of each of its inputs, which I've called > md5_total() for now, although I'm not particularly wedded to that > name. Comparing it with md5_agg() on a 100M row table (see attached > test script) produces interesting results: > > SELECT md5_agg(foo.*::text) > FROM (SELECT * FROM foo ORDER BY id) foo; > > 50bc42127fb9b028c9708248f835ed8f > > Time: 92960.021 ms > > SELECT md5_total(foo.*::text) FROM foo; > > 02faea7fafee4d253fc94cfae031afc43c03479c > > Time: 96190.343 ms > > Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its > result is longer) but it seems like it would be very useful for > quickly comparing data in SQL, since its value is not dependent on the > row-order making it easier to use and better performing if there is no > usable index for ordering. > > Note, however, that if there is an index that can be used for > ordering, the performance is not necessarily better than md5_agg(), as > this example shows. There is a small additional overhead per row for > initialising the MD5 sums, and adding the results to the total, but I > think the biggest factor is that md5_total() is processing more data. > The reason is that MD5 works on 64-byte blocks, so the total amount of > data going through the core MD5 algorithm is each row's size is > rounded up to a multiple of 64. In this simple case it ends up > processing around 1.5 times as much data: > > SELECT sum(length(foo.*::text)) AS md5_agg, >sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; > > md5_agg | md5_total > +- > 8103815438 | 12799909248 > > although of course that overhead won't be as large on wider tables, > and even in this case the overall performance is still on a par with > md5_agg(). > > ISTM that both aggregates are potentially useful in different > situations. I would probably typically use md5_total() because of its > simplicity/order-independence and consistent performance, but > md5_agg() might also be useful when comparing with external data. Few notes: - Index-scan over whole table is *very* bad for larger tables (few times bigger than available RAM). If you want to promote such use you should also warn against use on loaded server. - It's pointless to worry about overflow on 128-bit ints. Just let it happen. Adding complexity for that does not bring any advantage. - Using some faster 128-bit hash may be useful - check out CityHash or SpookyHash. You can get C implementation from pghashlib. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 15 June 2013 10:22, Dean Rasheed wrote: > There seem to be 2 separate directions that this could go, which > really meet different requirements: > > 1). Produce an unordered sum for SQL to compare 2 tables regardless of > the order in which they are scanned. A possible approach to this might > be something like an aggregate > > md5_total(text/bytea) returns text > > that returns the sum of the md5 values of each input value, treating > each md5 value as an unsigned 128-bit integer, and then producing the > hexadecimal representation of the final sum. This should out-perform a > solution based on numeric addition, and in typical cases, the result > wouldn't be much longer than a regular md5 sum, and so would be easy > to eyeball for differences. > I've been playing around with the idea of an aggregate that computes the sum of the md5 hashes of each of its inputs, which I've called md5_total() for now, although I'm not particularly wedded to that name. Comparing it with md5_agg() on a 100M row table (see attached test script) produces interesting results: SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; 50bc42127fb9b028c9708248f835ed8f Time: 92960.021 ms SELECT md5_total(foo.*::text) FROM foo; 02faea7fafee4d253fc94cfae031afc43c03479c Time: 96190.343 ms Unlike md5_agg(), it is no longer a true MD5 sum (for one thing, its result is longer) but it seems like it would be very useful for quickly comparing data in SQL, since its value is not dependent on the row-order making it easier to use and better performing if there is no usable index for ordering. Note, however, that if there is an index that can be used for ordering, the performance is not necessarily better than md5_agg(), as this example shows. There is a small additional overhead per row for initialising the MD5 sums, and adding the results to the total, but I think the biggest factor is that md5_total() is processing more data. The reason is that MD5 works on 64-byte blocks, so the total amount of data going through the core MD5 algorithm is each row's size is rounded up to a multiple of 64. In this simple case it ends up processing around 1.5 times as much data: SELECT sum(length(foo.*::text)) AS md5_agg, sum(((length(foo.*::text)+63)/64)*64) AS md5_total FROM foo; md5_agg | md5_total +- 8103815438 | 12799909248 although of course that overhead won't be as large on wider tables, and even in this case the overall performance is still on a par with md5_agg(). ISTM that both aggregates are potentially useful in different situations. I would probably typically use md5_total() because of its simplicity/order-independence and consistent performance, but md5_agg() might also be useful when comparing with external data. Regards, Dean md5_agg_v2.patch Description: Binary data md5-100m-row-test.sql Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 13 June 2013 10:35, Dean Rasheed wrote: > Hi, > > Attached is a patch implementing a new aggregate function md5_agg() to > compute the aggregate MD5 sum across a number of rows. This is > something I've wished for a number of times. I think the primary use > case is to do a quick check that 2 tables, possibly on different > servers, contain the same data, using a query like > > SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; > > or > > SELECT md5_agg(foo.*::text ORDER BY id) FROM foo; There seem to be 2 separate directions that this could go, which really meet different requirements: 1). Produce an unordered sum for SQL to compare 2 tables regardless of the order in which they are scanned. A possible approach to this might be something like an aggregate md5_total(text/bytea) returns text that returns the sum of the md5 values of each input value, treating each md5 value as an unsigned 128-bit integer, and then producing the hexadecimal representation of the final sum. This should out-perform a solution based on numeric addition, and in typical cases, the result wouldn't be much longer than a regular md5 sum, and so would be easy to eyeball for differences. 2). Produce an ordered MD5 sum compatible with COPY, whose result would match that of running unix md5sum on the COPY output. Given all the possible COPY options that would affect the result (format, delimiters, headers, quoting, escaping, ...), I think that such a thing would only reasonably be possible as an extension to the COPY command itself. I guess in its simplest form this would just be a new option "MD5" to COPY that would cause it to pipe its output to the md5 aggregator and then send the final sum to the COPY destination at the end (e.g., "COPY foo TO STDOUT MD5" would produce the ordered MD5 sum of the data in foo). I still think my original md5_agg() has its uses, since what it produces is comparable with external md5 sums, and is directly available to SQL, but (1) is probably the most useful for quickly comparing 2 tables. I'm much less convinced about the value of (2), but on the face of it, it doesn't seem like it would be hard to implement. Thoughts? Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 06/13/2013 05:35 PM, Dean Rasheed wrote: > Hi, > > Attached is a patch implementing a new aggregate function md5_agg() to > compute the aggregate MD5 sum across a number of rows. This is > something I've wished for a number of times. I think the primary use > case is to do a quick check that 2 tables, possibly on different > servers, contain the same data, using a query like > > SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; > > or > > SELECT md5_agg(foo.*::text ORDER BY id) FROM foo; That's a very useful thing to be able to do, but I'm hesitant to make the fact that it uses md5 too prominent in the name if it doesn't produce a result that an external user could reasonably expect from md5'ing the same data. I imagine having an md5_agg(text) and md5(bytea) that was the more efficient, streaming equivalent of: md5(string_agg(the_col,'')) would be rather handy. It'd be less useful for other types (floats, integers, etc) unless we had a way to get the binary representations of those in a well defined form, like int8le(1) . Casting to 'text' would be sufficient for most of the purposes I can imagine, though, and for those that it wouldn't things would quickly get so complicated that you'd want to be using a PL/something function anyway. -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On 06/14/2013 09:40 PM, Stephen Frost wrote: > Where I'd take this is actually in a completely different direction.. > I'd like the aggregate to be able to match the results of running the > 'md5sum' unix utility on a file that's been COPY'd out. Until I started looking at the follow-up discussion I didn't realise it wasn't supposed to. If it isn't the md5sum of the ordered rows, it shouldn't be called 'md5'. It might still be useful, but please don't call it md5. The proposals to make it produce the same result with different row orderings sound useful in the context of SQL; if it produced a different result with different orderings I'd want a way to force an explicit ORDER BY clause on the aggregate and error out if one wasn't present. Making it ignore order means that it's no longer md5, though. row_sums(...) ? - -- Craig Ringer http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.13 (GNU/Linux) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJRu+XMAAoJELBXNkqjr+S2mkkH/j8gi8d07dI6+G742f0U+v0J u8DGhtDQuuWHalqlaUDOssmi4fRDg99OzLlR+Mid0yGL/UfFMoL47H+kNRoMkuzV stUz3vf5rp8TbqEnikT3EwEKIuzaWrae0Fn3TKIYXVSRVvWjGzRSZsvJZsdfcS7T 7lZ9sf6QGekT9bAi6BIFsG7Z1bFLb6Q6AeTsX04++dLBCrjm96CSyisBswY5J2qg zD0WrK6IOsSn9ljlIZRGSTtP+tdM5mOi/DHdeEd+glGx5YKQ9t9yq++oayoqb9mp hPtsBo6UwcMaylPA2vQnhVi0q2bl9FMa+QGpaWBe6YfXLPF4PWhET2OixkRat1w= =ncT/ -END PGP SIGNATURE- -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 14 June 2013 16:09, Hannu Krosing wrote: > What skytools/pgq/londiste uses for comparing tables on master > and slave is query like this > > select sum(hashtext(t.*::text)) from t; > > This is non-modulo sum and does not use md5 but relies on > whatever the hashtext() du jour is :) > > So it is not comparable to anything external (like the md5sum > compatible idea above) but is usually good enough for fast > checks of compatible tables. > > As tables are unordered by definition anyway, this should be > good enough for most SQL. > > The speed comes from both fast(er) hashtext() function and > avoiding the sort. > That sounds like a pretty good approach. We could do that if we had a version of md5() that returned numeric. My impression is that numeric computations are pretty fast compared to the sorting overhead. On the other hand, if there is a usable index, select md5_agg(..) from (sub-query) will do and index scan rather than a sort, making it much faster than using an ORDER BY in the aggregate. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 06/14/2013 04:47 PM, Tom Lane wrote: > Dean Rasheed writes: >> On 14 June 2013 14:14, Tom Lane wrote: >>> Personally I'd be a bit inclined to xor the per-row md5's rather than >>> sum them, but that's a small matter. >> But this would be a much riskier thing to do with a single column, >> because if you updated multiple rows in the same way (e.g., UPDATE t >> SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if >> there were an even number of matches. > I was implicitly thinking that the sum would be a modulo sum so that the > final result is still the size of an md5 signature. If that's true, > then leaking bits via carry out is just as bad as xor's deficiencies. > Now, you could certainly make it a non-modulo sum and not lose any > information to carries, if you're willing to do the arithmetic in > NUMERIC and have a variable-width result. Sounds a bit slow though. What skytools/pgq/londiste uses for comparing tables on master and slave is query like this select sum(hashtext(t.*::text)) from t; This is non-modulo sum and does not use md5 but relies on whatever the hashtext() du jour is :) So it is not comparable to anything external (like the md5sum compatible idea above) but is usually good enough for fast checks of compatible tables. As tables are unordered by definition anyway, this should be good enough for most SQL. The speed comes from both fast(er) hashtext() function and avoiding the sort. -- Hannu Krosing PostgreSQL Consultant Performance, Scalability and High Availability 2ndQuadrant Nordic OÜ -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 2013-06-14 15:49:31 +0100, Dean Rasheed wrote: > On 14 June 2013 15:19, Stephen Frost wrote: > > * Andrew Dunstan (and...@dunslane.net) wrote: > >> I'd rather go the other way, processing the records without having > >> to process them otherwise at all. Turning things into text must slow > >> things down, surely. > > > > That's certainly an interesting idea also.. > > > > md5_agg(record) ? > > Yes, I like it. It's more complex than just memcmp()ing HeapTupleData though. At least if the Datum contains varlena columns there's so many different representations (short, long, compressed, external, external compressed) of the same data that a md5 without normalizing that wouldn't be very interesting. So you would at least need a normalizing version of toast_flatten_tuple() that also deals with short/long varlenas. But even after that, you would still need to deal with Datums that can have different representation (like short numerics, old style hstore, ...). It might be more realistic to use the binary output functions, but I am not sure whether all of those are sufficiently reproduceable. Greetings, Andres Freund -- Andres Freund http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 14 June 2013 15:19, Stephen Frost wrote: > * Andrew Dunstan (and...@dunslane.net) wrote: >> I'd rather go the other way, processing the records without having >> to process them otherwise at all. Turning things into text must slow >> things down, surely. > > That's certainly an interesting idea also.. > md5_agg(record) ? Yes, I like it. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
Dean Rasheed writes: > On 14 June 2013 14:14, Tom Lane wrote: >> Personally I'd be a bit inclined to xor the per-row md5's rather than >> sum them, but that's a small matter. > But this would be a much riskier thing to do with a single column, > because if you updated multiple rows in the same way (e.g., UPDATE t > SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if > there were an even number of matches. I was implicitly thinking that the sum would be a modulo sum so that the final result is still the size of an md5 signature. If that's true, then leaking bits via carry out is just as bad as xor's deficiencies. Now, you could certainly make it a non-modulo sum and not lose any information to carries, if you're willing to do the arithmetic in NUMERIC and have a variable-width result. Sounds a bit slow though. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 14 June 2013 14:14, Tom Lane wrote: > Marko Kreen writes: >> On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed >> wrote: >>> Attached is a patch implementing a new aggregate function md5_agg() to >>> compute the aggregate MD5 sum across a number of rows. > >> It's more efficient to calculate per-row md5, and then sum() them. >> This avoids the need for ORDER BY. > > Good point. The aggregate md5 function also fails to distinguish the > case where we have 'xyzzy' followed by 'xyz' in two adjacent rows > from the case where they contain 'xyz' followed by 'zyxyz'. > Well, if you aggregated foo.*::text as in my original example, then the textual representation of the row would protect you from that. But yes, if you were just doing it with a single text column that might be a risk. > Now, as against that, you lose any sensitivity to the ordering of the > values. > > Personally I'd be a bit inclined to xor the per-row md5's rather than > sum them, but that's a small matter. > But this would be a much riskier thing to do with a single column, because if you updated multiple rows in the same way (e.g., UPDATE t SET x='foo' WHERE x='bar') then xor'ing the md5's would cancel out if there were an even number of matches. Regards, Dean -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
* Andrew Dunstan (and...@dunslane.net) wrote: > I'd rather go the other way, processing the records without having > to process them otherwise at all. Turning things into text must slow > things down, surely. That's certainly an interesting idea also.. Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] MD5 aggregate
On 06/14/2013 09:40 AM, Stephen Frost wrote: * Tom Lane (t...@sss.pgh.pa.us) wrote: Marko Kreen writes: On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed wrote: Attached is a patch implementing a new aggregate function md5_agg() to compute the aggregate MD5 sum across a number of rows. It's more efficient to calculate per-row md5, and then sum() them. This avoids the need for ORDER BY. Good point. The aggregate md5 function also fails to distinguish the case where we have 'xyzzy' followed by 'xyz' in two adjacent rows from the case where they contain 'xyz' followed by 'zyxyz'. Now, as against that, you lose any sensitivity to the ordering of the values. Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. Where I'd take this is actually in a completely different direction.. I'd like the aggregate to be able to match the results of running the 'md5sum' unix utility on a file that's been COPY'd out. Yes, that means we'd need a way to get back "what would this row look like if it was sent through COPY with these parameters", but I've long wanted that also. No, no clue about how to put all that together. Yes, having this would be better than nothing, so I'm still for adding this even if we can't make it match COPY output. :) I'd rather go the other way, processing the records without having to process them otherwise at all. Turning things into text must slow things down, surely. cheers andrew -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
* Tom Lane (t...@sss.pgh.pa.us) wrote: > Marko Kreen writes: > > On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed > > wrote: > >> Attached is a patch implementing a new aggregate function md5_agg() to > >> compute the aggregate MD5 sum across a number of rows. > > > It's more efficient to calculate per-row md5, and then sum() them. > > This avoids the need for ORDER BY. > > Good point. The aggregate md5 function also fails to distinguish the > case where we have 'xyzzy' followed by 'xyz' in two adjacent rows > from the case where they contain 'xyz' followed by 'zyxyz'. > > Now, as against that, you lose any sensitivity to the ordering of the > values. > > Personally I'd be a bit inclined to xor the per-row md5's rather than > sum them, but that's a small matter. Where I'd take this is actually in a completely different direction.. I'd like the aggregate to be able to match the results of running the 'md5sum' unix utility on a file that's been COPY'd out. Yes, that means we'd need a way to get back "what would this row look like if it was sent through COPY with these parameters", but I've long wanted that also. No, no clue about how to put all that together. Yes, having this would be better than nothing, so I'm still for adding this even if we can't make it match COPY output. :) Thanks, Stephen signature.asc Description: Digital signature
Re: [HACKERS] MD5 aggregate
On Fri, Jun 14, 2013 at 2:14 PM, Tom Lane wrote: > Marko Kreen writes: > > On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed > wrote: > >> Attached is a patch implementing a new aggregate function md5_agg() to > >> compute the aggregate MD5 sum across a number of rows. > > > It's more efficient to calculate per-row md5, and then sum() them. > > This avoids the need for ORDER BY. > > Good point. The aggregate md5 function also fails to distinguish the > case where we have 'xyzzy' followed by 'xyz' in two adjacent rows > from the case where they contain 'xyz' followed by 'zyxyz'. > > Now, as against that, you lose any sensitivity to the ordering of the > values. > > Personally I'd be a bit inclined to xor the per-row md5's rather than > sum them, but that's a small matter. > > regards, tom lane > > xor works but only if each row is different (e.g. at the very least all columns together make a unique key). > > -- > Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) > To make changes to your subscription: > http://www.postgresql.org/mailpref/pgsql-hackers >
Re: [HACKERS] MD5 aggregate
Marko Kreen writes: > On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed > wrote: >> Attached is a patch implementing a new aggregate function md5_agg() to >> compute the aggregate MD5 sum across a number of rows. > It's more efficient to calculate per-row md5, and then sum() them. > This avoids the need for ORDER BY. Good point. The aggregate md5 function also fails to distinguish the case where we have 'xyzzy' followed by 'xyz' in two adjacent rows from the case where they contain 'xyz' followed by 'zyxyz'. Now, as against that, you lose any sensitivity to the ordering of the values. Personally I'd be a bit inclined to xor the per-row md5's rather than sum them, but that's a small matter. regards, tom lane -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On Thu, Jun 13, 2013 at 12:35 PM, Dean Rasheed wrote: > Attached is a patch implementing a new aggregate function md5_agg() to > compute the aggregate MD5 sum across a number of rows. This is > something I've wished for a number of times. I think the primary use > case is to do a quick check that 2 tables, possibly on different > servers, contain the same data, using a query like > > SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; > > or > > SELECT md5_agg(foo.*::text ORDER BY id) FROM foo; > > these would be equivalent to > > SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo; > > but without the excessive memory consumption for the intermediate > concatenated string, and the resulting 1GB table size limit. It's more efficient to calculate per-row md5, and then sum() them. This avoids the need for ORDER BY. -- marko -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
Re: [HACKERS] MD5 aggregate
On 6/13/13 5:35 AM, Dean Rasheed wrote: > Attached is a patch implementing a new aggregate function md5_agg() to > compute the aggregate MD5 sum across a number of rows. That seems somewhat useful. > In passing, I've tidied up and optimised the code in md5.c a bit --- > specifically I've removed the malloc()/memcpy()/free() code that was > unnecessarily making a copy of the entire input data just to pad it > and append the bit count. This reduces the memory consumption of the > existing md5() functions for large inputs, and gives a modest > performance boost. As a result, the md5() function can no longer throw > an out-of-memory error. I think it would be better if you split this into two separate patches. -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers
[HACKERS] MD5 aggregate
Hi, Attached is a patch implementing a new aggregate function md5_agg() to compute the aggregate MD5 sum across a number of rows. This is something I've wished for a number of times. I think the primary use case is to do a quick check that 2 tables, possibly on different servers, contain the same data, using a query like SELECT md5_agg(foo.*::text) FROM (SELECT * FROM foo ORDER BY id) foo; or SELECT md5_agg(foo.*::text ORDER BY id) FROM foo; these would be equivalent to SELECT md5(string_agg(foo.*::text, '' ORDER BY id)) FROM foo; but without the excessive memory consumption for the intermediate concatenated string, and the resulting 1GB table size limit. I've added 2 variants: md5_agg(text) and md5_agg(bytea) to match the 2 variants of md5(), so pure binary data can also be checksummed. In passing, I've tidied up and optimised the code in md5.c a bit --- specifically I've removed the malloc()/memcpy()/free() code that was unnecessarily making a copy of the entire input data just to pad it and append the bit count. This reduces the memory consumption of the existing md5() functions for large inputs, and gives a modest performance boost. As a result, the md5() function can no longer throw an out-of-memory error. Regards, Dean md5_agg.patch Description: Binary data -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers