Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jul 2, 2012, at 11:21 PM, Alexey Tourbin wrote: > On Mon, Jul 2, 2012 at 9:17 AM, Jeffrey Johnson wrote: >>> All RPMv4.4+ packages, that is, but not RPMv4.0. I find this "file >>> coloring" business very annoying, by the way, and it took me some time >>> to realize that "fc" actually stands for "file coloring". :-) >> >> RPMv4.0 was a l-o-n-g time ago. > > It was a long time ago, but it was not that bad. In ALT Linux we still > use rpm-4.0 code base (with many important backports etc.). Since I've > recently broken with ALT Linux, I cannot make more claims, you see… > In all cases: whether rpm-4.0.x had file classification is irrelevant to whether a set:versions "container" can be used for keyword storage. >> I find multilib quite annoying: I was asked for an implementation, >> and did so. 'Twas a job mon: already 7y since leaving @redhat, and file >> colors for multi lib were several years before that. > > I don't like how "multilib" works either. You can no longer identify > packages by their names, and then there are special rules to resolve > file conflicts, which basically say that the license is that you can > swamp files as much as you want, provided that you got the first hand > in that strange "x86-64" relationship. There is than that recent "x32" > stuff where you can run in 64-bit mode using only 32-bit pointers. How > can you address THAT? To me, the world is declining, like the Roman > Empire. ;-) > Dependency graphs based solely on package names are overrated, particularly when there are multiple naming schemes around, and packages are continuously renamed, and so the benefits of consistent naming are already a pleasant lie. With file colors, the test for x32 elf "magic" sets a file color bit and nothing else in rpm needs changing. The existing preference configuration to resolve executable (any non-elf executable file conflict is a different issue) automates the resolution in 2 locations. Might need a tweak to handle secondary chjoices (i.e. when neither of the conflicting files is the preferred arch) … *shrug* >>> I see no reason why rpm(1) should ever consider any preferences. To >>> me, rpm(1) is exactly black-and-white thing, a watchdog which checks >>> logical assertions. Things are either consistent enough to proceed >>> (and e.g. to upgrade the packages), or not - and then you get e.g. >>> non-zero exit code and you are forced to bail out. Higher-level logic >>> of finding an upgrade plan anyways belongs to something like apt(1) or >>> yum(1), although it is executed in some basic librpm terms which we >>> must make efficient enough. I see no reason to discuss closest metrics >>> or largest overlaps - this is as interesting as irrelevant to our >>> basic tasks. >> >> Everyone has an opinion: yours is particularly brittle but >> entirely logical. Now go persuade everyone else that your >> opinion is The One True Opinion and RPM will surely change too. >> There's a similar application with dual/triple/... licensed software and computing per-file, not per-package, license affinity precisely where set:versions (or Bloom filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses unlike file(1) magic keywords will require name space administration. SUrely LSB and LFF are seeking something useful to do for RPM packaging these days, and might be convinced to make some set of license tokens "standard" so that license affinity can be precisely computed in distributed software. >>> >>> You then discuss more applications which are largely irrelevant to our >>> basic tasks. (I realize that I'm revisiting and older discussion, >>> which might not be completely fair because our understanding might >>> have evolved since.) >> >> Um … what are "… our basic tasks."? > > Our basic task, is that when we feed packages to rpm(1), it must > quickly decide whether the upgrade is feasible or not. Of course this > involves hard and sometimes speculative considerations whether things > are going to work. But if things are definitely not going to work, rpm > should upgrade never! Not without a special flag which reads > '--upgrade-as-i-wish'. > We disagree on several points here: 1) upgrade decisions based solely on version information do not handle rebuilds. 2) FULLSTOP "never!" behavior when something breaks makes no forward progress when there's some minor issue. A "best effort" is generally a better approach even in the face of occasionally making matters worse. 3) adding Yet Another Disabler that noone knows how to use solves no meaningful problem. > What is not part of "our basic tasks" is to find an upgrade plan. How > they do that - it's another business, and completely another story. We > only check if the upgrade is consistent or not. > Assuming transitive closure on dependency assertions isn't all that is needed. >> I cannot determine what is "fair" wit
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Mon, Jul 2, 2012 at 9:17 AM, Jeffrey Johnson wrote: >> All RPMv4.4+ packages, that is, but not RPMv4.0. I find this "file >> coloring" business very annoying, by the way, and it took me some time >> to realize that "fc" actually stands for "file coloring". :-) > > RPMv4.0 was a l-o-n-g time ago. It was a long time ago, but it was not that bad. In ALT Linux we still use rpm-4.0 code base (with many important backports etc.). Since I've recently broken with ALT Linux, I cannot make more claims, you see... > I find multilib quite annoying: I was asked for an implementation, > and did so. 'Twas a job mon: already 7y since leaving @redhat, and file > colors for multi lib were several years before that. I don't like how "multilib" works either. You can no longer identify packages by their names, and then there are special rules to resolve file conflicts, which basically say that the license is that you can swamp files as much as you want, provided that you got the first hand in that strange "x86-64" relationship. There is than that recent "x32" stuff where you can run in 64-bit mode using only 32-bit pointers. How can you address THAT? To me, the world is declining, like the Roman Empire. ;-) >> I see no reason why rpm(1) should ever consider any preferences. To >> me, rpm(1) is exactly black-and-white thing, a watchdog which checks >> logical assertions. Things are either consistent enough to proceed >> (and e.g. to upgrade the packages), or not - and then you get e.g. >> non-zero exit code and you are forced to bail out. Higher-level logic >> of finding an upgrade plan anyways belongs to something like apt(1) or >> yum(1), although it is executed in some basic librpm terms which we >> must make efficient enough. I see no reason to discuss closest metrics >> or largest overlaps - this is as interesting as irrelevant to our >> basic tasks. > > Everyone has an opinion: yours is particularly brittle but > entirely logical. Now go persuade everyone else that your > opinion is The One True Opinion and RPM will surely change too. > >>> There's a similar application with dual/triple/... licensed software and >>> computing >>> per-file, not per-package, license affinity precisely where set:versions (or >>> Bloom >>> filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses >>> unlike >>> file(1) magic keywords will require name space administration. SUrely LSB >>> and LFF >>> are seeking something useful to do for RPM packaging these days, and might >>> be convinced to make some set of license tokens "standard" so that license >>> affinity can be precisely computed in distributed software. >> >> You then discuss more applications which are largely irrelevant to our >> basic tasks. (I realize that I'm revisiting and older discussion, >> which might not be completely fair because our understanding might >> have evolved since.) > > Um … what are "… our basic tasks."? Our basic task, is that when we feed packages to rpm(1), it must quickly decide whether the upgrade is feasible or not. Of course this involves hard and sometimes speculative considerations whether things are going to work. But if things are definitely not going to work, rpm should upgrade never! Not without a special flag which reads '--upgrade-as-i-wish'. What is not part of "our basic tasks" is to find an upgrade plan. How they do that - it's another business, and completely another story. We only check if the upgrade is consistent or not. > I cannot determine what is "fair" without knowing what is being compared … > >> Anyway, set-versions are not the "next big thing" with plenty of >> applications. It's rather a very boring stuff which nevertheless >> answers the question "how we can possibly enhance ABI compatibility >> control beyond sonames". The answer is that we must involve into >> set/subset testing - that's the model, that it is very expensive, and >> that the only reasonable and possibly the best way to go is to replace >> symbols with numbers, and to treat sets of numbers as special kind of >> versions. Now why is that? But that's a much better perspective for >> discussion. > > We differ in usage cases. I see a "container", you see a "version". I realize that the "version" is somehow a contrived concept behind really a containter. Like: - Whats' that? - It's a version. - Oh my gosh! But then again, if you try to organize your thinking beyond sets, bytes, and characters, the concept of "version" pops up pretty naturally. If we are going to satisfy some requirement, that's good for "version". > There are many usage cases for subset operations in "package management" > no matter what we think or how the operations are implemented and represented. Not all subset operations are equally important, or complex. As I said, sometimes you should make easy things easy, and simply use which is Rb-tree or something like this. Sometimes things go off, though. Like this: - How many symbols do you want to encode? - 10M. - That'
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jul 1, 2012, at 11:43 PM, Alexey Tourbin wrote: > On Sat, Jun 23, 2012 at 10:29 PM, Jeffrey Johnson wrote: >> In the interest of getting off negative nerdy obscure discussions, let's >> try a positive alternative application for Golob-Rice subset operations. >> >> All RPMv4 packages attach (a lightly filtered) file(1) magic string to >> every file. > > All RPMv4.4+ packages, that is, but not RPMv4.0. I find this "file > coloring" business very annoying, by the way, and it took me some time > to realize that "fc" actually stands for "file coloring". :-) > RPMv4.0 was a l-o-n-g time ago. I find multilib quite annoying: I was asked for an implementation, and did so. 'Twas a job mon: already 7y since leaving @redhat, and file colors for multi lib were several years before that. And the "fc" stands for "file classifier". But feel free to claim whatever you want just like everyone else. >> The file(1) data is mostly usable as a "keyword" namespace exactly as is. >> Yes there are flaws: however magic strings are from file(1) is about >> as good as any other de facto keyword tagging of file content. >> >> keywords are strings just like elf symbols are, and set:versions (or Bloom >> filters) >> are a compact representation from which its rather easy to do subset >> computations. >> >> One extension that would be needed is a "closest" metric in order to >> "prefer" >> the largest subset overlap: with set:versions any contained subset will >> satisfy the >> logical assertions, and there's no easy way to prefer the larger sub-set. > > I see no reason why rpm(1) should ever consider any preferences. To > me, rpm(1) is exactly black-and-white thing, a watchdog which checks > logical assertions. Things are either consistent enough to proceed > (and e.g. to upgrade the packages), or not - and then you get e.g. > non-zero exit code and you are forced to bail out. Higher-level logic > of finding an upgrade plan anyways belongs to something like apt(1) or > yum(1), although it is executed in some basic librpm terms which we > must make efficient enough. I see no reason to discuss closest metrics > or largest overlaps - this is as interesting as irrelevant to our > basic tasks. Everyone has an opinion: yours is particularly brittle but entirely logical. Now go persuade everyone else that your opinion is The One True Opinion and RPM will surely change too. > >> There's a similar application with dual/triple/... licensed software and >> computing >> per-file, not per-package, license affinity precisely where set:versions (or >> Bloom >> filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses >> unlike >> file(1) magic keywords will require name space administration. SUrely LSB >> and LFF >> are seeking something useful to do for RPM packaging these days, and might >> be convinced to make some set of license tokens "standard" so that license >> affinity can be precisely computed in distributed software. > > You then discuss more applications which are largely irrelevant to our > basic tasks. (I realize that I'm revisiting and older discussion, > which might not be completely fair because our understanding might > have evolved since.) > Um … what are "… our basic tasks."? I cannot determine what is "fair" without knowing what is being compared … > Anyway, set-versions are not the "next big thing" with plenty of > applications. It's rather a very boring stuff which nevertheless > answers the question "how we can possibly enhance ABI compatibility > control beyond sonames". The answer is that we must involve into > set/subset testing - that's the model, that it is very expensive, and > that the only reasonable and possibly the best way to go is to replace > symbols with numbers, and to treat sets of numbers as special kind of > versions. Now why is that? But that's a much better perspective for > discussion. We differ in usage cases. I see a "container", you see a "version". There are many usage cases for subset operations in "package management" no matter what we think or how the operations are implemented and represented. 73 de Jeff > __ > RPM Package Managerhttp://rpm5.org > Developer Communication Listrpm-devel@rpm5.org __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Sat, Jun 23, 2012 at 10:29 PM, Jeffrey Johnson wrote: > In the interest of getting off negative nerdy obscure discussions, let's > try a positive alternative application for Golob-Rice subset operations. > > All RPMv4 packages attach (a lightly filtered) file(1) magic string to > every file. All RPMv4.4+ packages, that is, but not RPMv4.0. I find this "file coloring" business very annoying, by the way, and it took me some time to realize that "fc" actually stands for "file coloring". :-) > The file(1) data is mostly usable as a "keyword" namespace exactly as is. > Yes there are flaws: however magic strings are from file(1) is about > as good as any other de facto keyword tagging of file content. > > keywords are strings just like elf symbols are, and set:versions (or Bloom > filters) > are a compact representation from which its rather easy to do subset > computations. > > One extension that would be needed is a "closest" metric in order to > "prefer" > the largest subset overlap: with set:versions any contained subset will > satisfy the > logical assertions, and there's no easy way to prefer the larger sub-set. I see no reason why rpm(1) should ever consider any preferences. To me, rpm(1) is exactly black-and-white thing, a watchdog which checks logical assertions. Things are either consistent enough to proceed (and e.g. to upgrade the packages), or not - and then you get e.g. non-zero exit code and you are forced to bail out. Higher-level logic of finding an upgrade plan anyways belongs to something like apt(1) or yum(1), although it is executed in some basic librpm terms which we must make efficient enough. I see no reason to discuss closest metrics or largest overlaps - this is as interesting as irrelevant to our basic tasks. > There's a similar application with dual/triple/... licensed software and > computing > per-file, not per-package, license affinity precisely where set:versions (or > Bloom > filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses > unlike > file(1) magic keywords will require name space administration. SUrely LSB > and LFF > are seeking something useful to do for RPM packaging these days, and might > be convinced to make some set of license tokens "standard" so that license > affinity can be precisely computed in distributed software. You then discuss more applications which are largely irrelevant to our basic tasks. (I realize that I'm revisiting and older discussion, which might not be completely fair because our understanding might have evolved since.) Anyway, set-versions are not the "next big thing" with plenty of applications. It's rather a very boring stuff which nevertheless answers the question "how we can possibly enhance ABI compatibility control beyond sonames". The answer is that we must involve into set/subset testing - that's the model, that it is very expensive, and that the only reasonable and possibly the best way to go is to replace symbols with numbers, and to treat sets of numbers as special kind of versions. Now why is that? But that's a much better perspective for discussion. __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Sat, Jun 23, 2012 at 10:10 PM, Jeffrey Johnson wrote: >> So I suggest that set-strings must cover all usages of set-subset >> computations, where static data structures are appropriate (so that >> you compute it, write into RPM header, and do not expect to modify >> it). it is still "efficient", despite the fact that we try hard to >> achieve a better string representation. The fact that it can be >> packaged nicely does not make it any worse in other respects, except >> that obviously we can stuff somewhat less bits into alnum characters. >> But I also wanted it to look "nice" (this is why base62 was devised >> over much simpler base64 encoding). But so, it looks nice, and it is >> efficient. Why are you unhappy? Do you necessarily require that you >> have to do bitmask tests yourself? > > I'm not "unhappy". Rather I'm pointing out that -- if your primary > goal is bandwidth reduction in *.rpm package metadata -- that there are > equivalent savings that aren't from "pretty" encodings. Okay, the primary goal was not so much to minimize bandwidth, but just to make things "nice" overall. But again, realize that set-versions are O(N) in size, where N is the number of symbols. For example, here is the Provides of perl-base package in ALT Linux: $ rpm -q --provides perl-base perl = 1:5.14.2 perl-version = 0.88 perl-PerlIO = 1:5.14.2 perl-Storable = 1:5.14.2 perl-Digest-MD5 perl-Time-HiRes perl-MIME-Base64 libperl-5.14.so()(64bit) = set:odiA4ZlKzevK5y39eQsWZckcCoxsZqg4k4MIETgeNn0RxqjzEscOAdkenBZ7a373IGkMzlRZcFtq8vCvMeJMgzVyfiFR61GhMIUZ81a3aDJtilWwCksMTw3vQxUgc4x1QMFnA2h8lmdZpJ4jHU4wlFn4W2wHpFabuLlqZiIyviaTYTPl1PjrcIm4iZdvp5fOXR9PKnYQyfo6w1AOS7V5qO5ND9AGGby7LRkpBYZuOEZhvKfx9ZEGiSTLG9vx8ka0dtcUm5WXZtFUvjG92kZwnUjHVWc9NZapw9ocaec37K7rO7N4rl3mGGxh21z3bE4FdZ6RI6k52BrCvogEqZqeVZLcBZ9tcAbHq6ObepE2vXR9TQODoNkoeyaOQGbrDchblKzS8wBPTGtnv5Ze8K9FAsQN2dsQxAJ2hD2NOIL1ROhhFydp27xpqCYssiowuiYfJtCdofpKW9qohptrbV1pmkjVb4kI6lhKHZb5ae7p4u65NodjOSHmOw81L7fjwZjoK2uwH5rACj7NxXW9kk10uOsM6EJdRY4nrZqYbkKN2G1gQvUtZzqxPMjWgkh3P9LmcluzYzgCfZoexmZxQP4nRLYuemEWzHXkX1Ys5wqbfMZDxBbi7PtkohqPCqcB4f6zurrib4y7aUHs6aqCiVRHGDdR4thBgZFjpXDxIDN4X2QjAOjRQZhNzilkpBXmCgy75JDndDdw1DCMEZkmfgAnLU4rUEDIZ2pUvC5q3S2lMtq2Zj17f2xDL80cyg11BFQ8thh9iR2SIXCB69Jrda3E7ArKzyXYci6KlHo1jfvWP6Q2lre8VQCGQZoH8qDo7vaq2MGlNFgtgJGQgTGVx0CKZAAhlijag1ZKUGmgqFOdprCVtuZc8IoZBsuiY1DKZyS6e1HPHGYrZcOcd4KNcminIr9spwghADXzujggTmVsCutsEh6LLbjD2boZan6ycEnBMoFsbyIoz5ggnxFKIjuJyZvM6voYIs2qurMGOCoeB2csGGtTV6b82ItcZ2kj7mbamrZsQscXQ8EUT7lflT6IamZAZLZx7GTKUaZjwPyUltZ6gHWKIvw5wbNQMZGvOU6F43xdP7M2lJ0R9g3cVd4oiQig2F8K0AUdnwXz7LVPbjJMk8iBytXGntBKoPI6G3ahWHgSvmhGVVdNJeNXZl0AkcxhvhFqRujtasuOxCEz8WJCLZaO642OhFjtOVZliv4b4Y34Uh2vftwCo1vXxEJxtVVek42wG9La5uiLoz9o77UfQz4pu0yvZmTR8rs1X05ykr61bK73IOzGuMFjfKxQX2uUUL7zkl661NxWK8VR04bP7p1GVLn7SWPfXyfm2MbTZG8ytPzT2vDJYjmKTi6hRJ9RFPpZegntTk7uuL7ahsSej3aiBbEWVKYXWhkntIZ3YfDPOzRxosdgSKMNRY1llmBvZk5JlN3ETltMlAI7v7bq24bCvG9iGyCPyoUoKVLYpL8cycfepxYcgm4ioyz8zt3ZkAMQi4iZfs2zbd34VnocYTB6cyEZxR67x0i7wQRP7VrwDilZfniZgmdH5pAsWxXX0auKdd1nSN93XGZECEQP4nKKWCLwjYVjjAB1iRBohqEbRBstuQLCBMEL8OLapxqEYozeGlFMoSbo6eV6lCFuOKbYwvN6mjofKRtjavsU4hYm2ArDYgZ8alF8tZCw1BFvmhI0zL4w2un5m3S3G2tcZpWdVODLSi3IB4pQyR755EB7zUaPCF8LafHzDKHBSKEZohdNEooVN3lY5DnGpz25rW38dU8TXRG2pckDnGb5LVBVYpxK7VsybdZGsy5JP8Alk1Mf5O4zFmGcpZLs172YECGj3f8JzUcpknwF2PWxqRZgnZapssvh3CIOl8fA1bfZiUGMFg1eu62GXYlWiqWaA9g627YjeQkGdz9AhZCVECGOqNEZFX2Jy6O23lHvgJ2ylDSwi2RJGZdl1ckGMVwZ5z3uKJ207KVuQWyeCv7Ne4wxyWj9JgP0Z2PEmyXsan5QBFEZq323vZmlRyhhmPoPDle2zyxxT7AZ3vL42k3MLOzNPtGpZcTgIJfhy2H38fUU0YIFNpi9ygZGg5NBZdu1XL0NWesCbk2PapUsltirZ7qo6Jwbg2UkwsgEaVsclZ2EPrD5brplmPl9DHlrgLcAZCaNIG9HIFSfZErD7U0tJob9JvoHKZogXZgtaH0UkZjrpma2lCxPpj9qdZdh4gli6C9mqCKElC2NjZKh9T0Uw1ddLDYNmvPB971Xyzx9qSNQ99acW1uP36D7qgjyPsf9qO1ZDqA9fZuJ5oEkNIpIoatW0f157Sp3bTZ7o58XhKir4N8PnJIfj3P4NQjtrjW85vAEHU9B2dZACzNfNACZn7TO2AyRBmnPn6XWYKoAEW9UiH8Z12Z5OBMa2ovftCZJ3Zpqoc0GG9kpGjXEZCN0ypGqk4lnYTCL4X8i1B2AabsfEG100r2HQWbjoIrM1LtGBAqSuPAPwIEVnaSqAvQFfCi9xCHS3EKRSZa9EsDZktcKHply2cMb5ldXl06eDg3i1xHM9uPTvm7laFyWorkUMcCZiV5d0tlXzFuiDjA3pOsOjUFmgAzPNJK17LfQSHA4avpaLbZ4lGSPRdQWOOKBxGDgOYMnK2Z11iom0XUBCvybJ2POnO81yDKHCEtVHJE4ZKTMiruLDjAe876HiwstanPxJ0m6ZeYmWgraCbEF7iaQZoZkgkwHS8ONi1OXSBF7crddfeYDwWLZk9w4lRr5lj5b3bZx2DFkDwk1IwczPHksFPHGgM8hSnBhfBXpcRLalykEkN9p0LGi6u5Zzs7q7dLLOPRrRLNtaz83XPT1ytZAbLKy7dKfRzTBQ314Zbe7tf8HZwplZuD7ZJZi2Juyxq8JdDPBRCJYgD8ZoDjoNbfrnxv7oJ60T8OyKLvkuLjFnWnHZd1M3OwsOM4uDRX3hk8OsAzsQSehIcdLYzRjYAloGknyZGmDCQLMlfpzzz18aIP9BMhLr364JPk6sSFK4qDU8EMcHGgYUWRUaJE9HnLTs9TzLxTveQnBch7PRnniDwgh2hUF0sDUbQmR9wRcDHtZD5p4QZBHaJYKWpdRtiL1nSsNEzvsMZ9PI05mLLNmZIfluzk3mopgeSQWtQGoiHzmRlNFnrhy5AxC9hHegoVllHAHc8VZrOZyNnBZD9b3Vaqrybs1uveBSfZG3mohKqO2WbTUjg3FXD1kkQPIWA0Z2Z3d6Q77wriiou1ZldJgnkru5WtLsY5YAvwTMdFzoa82XjjDRQwCXnzyRRuXqLF6TGDcrEZIAlAtgs8qgZr10JIaHcp7ZiG4oZpCP8nBCCWGxkBOrue2foa1QoJ30LXN7tMFc5U3wuIoXZmSIyBApsgPkgk5 perl(AutoLoader.pm) = 5.710 perl(B.pm) = 1.290 perl(Benchmark.pm) = 1.120 perl(Carp.pm) = 1.200 perl(Carp/Heavy.pm) perl(Class/Struct.pm) = 0.630 perl(Config
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 28, 2012, at 1:27 PM, Alexey Tourbin wrote: > > The downside is, of course, that when a dependency R subset P is > broken, it is not easy to find out which P symbols were deleted or > renamed (or which R symbols are missing). But this is largely a > developer's, or should I say a hacker's, problem. > Calculating set union/disjunction (for Bloom filters, surely almost as easy with set:versions) for P/R "sets" union P | R intersectionP & R is natural/rapid/easy even if the symbols are unknown. FWIW: the "arbitrary tags" implemented @rpm5.org have a similar one-way lossy mapping where the string representation of the tag name cannot be unmapped from the 30 bit (truncated SHA1) tag number. The lack of mapping invertibility matters little to "batch oriented" installer usage cases in RPM libraries even if "enquiring minds" wish concise/complete/informative error messages for diagnosis of problems. Error messages (and symbols) are overrated and less useful with moderately sized token dictionaries imho. I hope to get around to generating set:versions this weekend. I was almost there 10 days ago, but got side-tracked by 1 week's worth of cable modem issues which has now led to postponed backup/upgrade chores … almost finished. 73 de Jeff__ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Thu, Jun 28, 2012 at 8:18 PM, Alexey Tourbin wrote: > There is also a philosophical consideration which somehow accompanies > this practical consideration. There is a short story, I believe by > Borges, where a clever scientist devises a 1-1 map of reality. A 1-1 > map of reality turns out to be a very true picture of reality. What's > wrong with this approach? Well, a 1-1 map of the world turns out to be > an exact copy of the world, which is of no use in terms of being "a > map". Somehow, the reality must be "construed" and "reduced" to a > simpler (and somewhat coarser) description to become a useful model. > This is also why we don't plug ELF sections into RPM headers: we > believe that much simpler (or at least much shorter) dependencies must > be used to represent ELF binaries in terms of their > interconnectedness, and must also omit other less important details. This philosophical argument applies to set-versions in a (not-so-)obvious manner, which I will now clarify. It goes like this: although the ultimate goal is to check that R-set of symbols is a subset of P-set of symbols, you do not necessarily have to store the full names of the symbols in order to perform a somewhat stripped-down check itself. When it only matters if R is subset of P, the names themselves become largely irrelevant, provided that you can devise a very clever substitution/encoding scheme. You can make "much simpler (or at least shorter) dependencies" by getting rid of the names in a manner which does not destroy the check. The downside is, of course, that when a dependency R subset P is broken, it is not easy to find out which P symbols were deleted or renamed (or which R symbols are missing). But this is largely a developer's, or should I say a hacker's, problem. On the other hand, from the user point of view, and also from rpm(1) perspective, this approach simply promotes synchronous or rather "transactional" upgrades. It says like, guys, I will not apply half-baked updates before you fix it all - so that apps and libraries match. Which totally makes sense! __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 28, 2012, at 12:18 PM, Alexey Tourbin wrote: > > There is also a philosophical consideration which somehow accompanies > this practical consideration. There is a short story, I believe by > Borges, where a clever scientist devises a 1-1 map of reality. A 1-1 > map of reality turns out to be a very true picture of reality. What's > wrong with this approach? Well, a 1-1 map of the world turns out to be > an exact copy of the world, which is of no use in terms of being "a > map". Somehow, the reality must be "construed" and "reduced" to a > simpler (and somewhat coarser) description to become a useful model. > This is also why we don't plug ELF sections into RPM headers: we > believe that much simpler (or at least much shorter) dependencies must > be used to represent ELF binaries in terms of their > interconnectedness, and must also omit other less important details. > (aside) Nice: I'll dig out the Borges story … meanwhile (iirc) there is a story (and surely some sort of literary response to Borges) called How Much Shall We Bet in "Cosmicomics" that I pass along for your amusement. Seriously yes: carrying redundant metadata _EXACTLY_ as is in the interest of accurate data -- while well intended -- isn't needed. And both Bloom filters and set:versions are useful/compact data stores for existence tests without a need to also carry the _EXACT_ values (if/when the false positive rate can be tolerated, which is often true for the moderate sized sets needed in package management). The rest I agree with. 73 de Jeff__ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Sat, Jun 23, 2012 at 7:10 AM, Jeffrey Johnson wrote: >> Why is it any wrong to minimize bandwidth, or, in other words, why it >> is bad to spend less money? Your answer is like, because the meaning >> of life is not to spend less money, which is a wrong perspective. >> Okay, but what's a better perspective? Spending more money for less >> good is not at all a better perspective. > > Nothing I said implies that bandwidth reduction is "wrong". > > If you want to minimize bandwidth downloading packages compress > the metadata and deal with the "legacy compatible" issues however > you want/need. > > My rule of thumb is that metadata is ~15% of the size of a > typical *.rpm package: assuming 50% compression one might save > 7.5% of the package download cost (0.50 * 0.15 = 7.5%). An interesting practical matter behind set-versions is that we can simply represent the exact set of symbols, which can be encoded and pictured like this: Provides: libfoo.so.0 = set:sym1,sym2,sym3,sym4 Requires: libfoo.so.0 = set:sym1,sym3 (where symN stands for direct ELF symbol name - e.g. strcmp). You simply name the symbols which you require or provide! In fact, this is exactly how early alpha set-versions were implemented - before I had some time to ponder over approximate set/subset encoding problem (it is then how sym1,sym2... sets where converted into numbers, per symbol, and compressed). The point here is that the price of the exact set representation may, or may not, be prohibitive. If the price is not prohibitive, it's a no-brainer: you don't have to involve into approximate subset business at all. Sometimes, you simply should not use bloom filters, despite the fact that they might seem appealing. However, if the price is prohibitive, which it was, the reason for going into approximate subset business is also a no-brainer: you should cut down heavily and optimize for size first. If you simply introduced probabilities without making things less prohibitive, did you do anything useful at all? You only spoiled things a bit! There is also a philosophical consideration which somehow accompanies this practical consideration. There is a short story, I believe by Borges, where a clever scientist devises a 1-1 map of reality. A 1-1 map of reality turns out to be a very true picture of reality. What's wrong with this approach? Well, a 1-1 map of the world turns out to be an exact copy of the world, which is of no use in terms of being "a map". Somehow, the reality must be "construed" and "reduced" to a simpler (and somewhat coarser) description to become a useful model. This is also why we don't plug ELF sections into RPM headers: we believe that much simpler (or at least much shorter) dependencies must be used to represent ELF binaries in terms of their interconnectedness, and must also omit other less important details. Back to the story of set-versions, with the "original" implementation (which introduced full Golomb coding), it was estimated that the size of architecture-dependent pkglist.classic.bz2 metadata is going to go up from about 3M to about 12M, four-fold! This still was almost prohibitive to soar up like this. It was considered non-prohibitive only by the virtue of information-theoretical considerations: since we are going to encode that many symbols, we must not fool ourselves into thinking that we could somehow pay a smaller price - that is, without violating fundamental laws. By the way, what's the information-theoretical minimum? Say, we want to encode 1024 20-bit hash values (which yields the false positive rate at about 0.1%). Well, the first mathematical intuition is that we need to cut 20-bit range into 1024 smaller stripes, which gives 10 bits per stripe, on average. It is a little bit more complicated than that, though, exactly because of this "on average" business: we must also take some bits to encode stripe boundaries. But this is only a mathematical intuition. The exact formula, in R, is: > lchoose(2**20,2**10)/log(2)/2**10 [1] 11.43581 (so, on the other hand, and somewhat unexpectedly, it is that old good "n choose k" business.) Current implementation lines up at about 11.6 bits per symbol. This is why sometimes I say that set-versions currently take about 2 alnum character per Provides symbol - this is because each alnum character can stuff about log2(62)=5.9+ bits. But compare this to "early alpha" set-versions which represented exact sets in form of "set:sym1,sym2,sym3,sym4", that is, in terms of enumerating symbols. With exact sets, can you think of going to anywhere near 2 characters per symbol, especially that you also need a separator? This leaves you 1 character per symbol! :-) There's a saying that Perl makes easy things easy and hard things possible. Set-versions were designed to make hard things possible. __ RPM Package Managerhttp://rpm5.org Developer Communication List
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 23, 2012, at 2:29 PM, Jeffrey Johnson wrote: > > On Jun 23, 2012, at 2:10 PM, Jeffrey Johnson wrote: > >> >> On Jun 23, 2012, at 1:49 PM, Alexey Tourbin wrote: >> > > <…> > >> Perhaps making base61 encoding MANDATORY in rpm would displease >> everyone equally: choosing a prime is as "pretty" as all other >> encoding criteria. >> > > In the interest of getting off negative nerdy obscure discussions, let's > try a positive alternative application for Golob-Rice subset operations. > > All RPMv4 packages attach (a lightly filtered) file(1) magic string to > every file. > > The file(1) data is mostly usable as a "keyword" namespace exactly as is. > Yes there are flaws: however magic strings are from file(1) is about > as good as any other de facto keyword tagging of file content. > > keywords are strings just like elf symbols are, and set:versions (or Bloom > filters) > are a compact representation from which its rather easy to do subset > computations. > > One extension that would be needed is a "closest" metric in order to "prefer" > the largest subset overlap: with set:versions any contained subset will > satisfy the > logical assertions, and there's no easy way to prefer the larger sub-set. > > There's a similar application with dual/triple/... licensed software and > computing > per-file, not per-package, license affinity precisely where set:versions (or > Bloom > filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses > unlike > file(1) magic keywords will require name space administration. SUrely LSB and > LFF > are seeking something useful to do for RPM packaging these days, and might > be convinced to make some set of license tokens "standard" so that license > affinity can be precisely computed in distributed software. > The above suggestions were an attempt to generalize assertions to "feature vectors" (which is the standard technique by which SAT solvers can start to handle "preference"). Meanwhile here's another application of existing set:versions "containers" Query-by-ELF-symbol that's perfectly obvious as well. 1) Generate tables for Provideversion/Requireversion by mentioning in the usual macro that configures Berkeley DB. The table will be automagically generated lazily on 1st access (or do --rebuilddb if you insist on being a Luddite). 2) Generate a set:versions set with the symbols of interest. POPT already has a data type for #define POPT_ARG_BITSET 16U+14U /*!< arg ==> bit set */ which results in a Bloom filter (set:versions can be added to popt just like rpmbf was by swiping code from RPM: just as easy is to add some vectors that RPM fills in on startup so that option processing can generate a set:versions set). 3) Retrieve the set:versions values from the PV/RV table; the retrieval will be blazingly fast because there is no need to do secondary loading of headers, and there is already means to retrieve by a *RE pattern like "^set:.*$" that essentially do a partial retrieve on "set:" 4) Filter the retrieved set:versions keys down to a set of headers to load; do the usual --queryformat display of NEVRA (or whatever other --queryformat is useful). None of the above is very hard to do (with @rpm5.org code), and is sure to be lots faster than the equivalent scripted operations because of Berkeley DB and btree indexing. 73 de Jeff
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 23, 2012, at 2:10 PM, Jeffrey Johnson wrote: > > On Jun 23, 2012, at 1:49 PM, Alexey Tourbin wrote: > <…> > Perhaps making base61 encoding MANDATORY in rpm would displease > everyone equally: choosing a prime is as "pretty" as all other > encoding criteria. > In the interest of getting off negative nerdy obscure discussions, let's try a positive alternative application for Golob-Rice subset operations. All RPMv4 packages attach (a lightly filtered) file(1) magic string to every file. The file(1) data is mostly usable as a "keyword" namespace exactly as is. Yes there are flaws: however magic strings are from file(1) is about as good as any other de facto keyword tagging of file content. keywords are strings just like elf symbols are, and set:versions (or Bloom filters) are a compact representation from which its rather easy to do subset computations. One extension that would be needed is a "closest" metric in order to "prefer" the largest subset overlap: with set:versions any contained subset will satisfy the logical assertions, and there's no easy way to prefer the larger sub-set. There's a similar application with dual/triple/... licensed software and computing per-file, not per-package, license affinity precisely where set:versions (or Bloom filters) will represent keywords (like "LGPLv2" or "BSD") easily. Licenses unlike file(1) magic keywords will require name space administration. SUrely LSB and LFF are seeking something useful to do for RPM packaging these days, and might be convinced to make some set of license tokens "standard" so that license affinity can be precisely computed in distributed software. 73 de Jeff
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 23, 2012, at 1:49 PM, Alexey Tourbin wrote: > On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson wrote: >> There are lots of usage cases for efficient sub-set computations >> in package management, not just as a de facto API/ABI check >> using ELF symbols. Most of the other usage cases for efficient >> sub-set computations are not subject to an ultimately optimal >> string encoding. > > The possibility of optimal string encoding should not be > underestimated. If you can't "write it down with a pencil", which by > the way refers to Alan Turing's style of reasoning, it becomes very > problematic anyway. Your intuition is probably that raw bytes are > cheap because you can bitwise-AND them in terms of direct CPU > instructions, and any "encoding" is expensive because you have to > crunch bits a lot. This is not very true in practice, though. But you > should arm yourself with valgrind(1) and spend a few days with it > before you understand how you waste you CPU powers (and bandwidth, for > that matter). Set-string encoding has reasonable, and affordable, > cost. It can be all put together and presented in a rather less > pessimistic manner. (For example, there is a cache_decode_set routine > that boosts things by a factor of about o 5. This is simply due to the > fact that you do not always have to decode the same Provides version > all over again. This is part of that "fancy-schmancy" stuff which you > must ignore on the first reading.) > > So I suggest that set-strings must cover all usages of set-subset > computations, where static data structures are appropriate (so that > you compute it, write into RPM header, and do not expect to modify > it). it is still "efficient", despite the fact that we try hard to > achieve a better string representation. The fact that it can be > packaged nicely does not make it any worse in other respects, except > that obviously we can stuff somewhat less bits into alnum characters. > But I also wanted it to look "nice" (this is why base62 was devised > over much simpler base64 encoding). But so, it looks nice, and it is > efficient. Why are you unhappy? Do you necessarily require that you > have to do bitmask tests yourself? I'm not "unhappy". Rather I'm pointing out that -- if your primary goal is bandwidth reduction in *.rpm package metadata -- that there are equivalent savings that aren't from "pretty" encodings. The other reason is this: missing values in the {E,V,R} triple (and attaching some reasonable DWIM semantic) lead to huge amounts of pointless discussion. Adding all of twiddle-in-version set:versions dpkg-like is all underway atm. Running digest/encoding strings through "legacy compatible" (i.e. what most rpm tool chains will continue to be using for years) rpmvercmp(3) is an accident waiting to be reported again again again. Its just as easy to attach a tag with binary data as any other approach to "pretty" encodings, all of which involve some aesthetic choice of permitted/excluded ASCII characters, and so the encodings proliferate to meet all possible opinions. Perhaps making base61 encoding MANDATORY in rpm would displease everyone equally: choosing a prime is as "pretty" as all other encoding criteria. 73 de Jeff __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson wrote: > There are lots of usage cases for efficient sub-set computations > in package management, not just as a de facto API/ABI check > using ELF symbols. Most of the other usage cases for efficient > sub-set computations are not subject to an ultimately optimal > string encoding. The possibility of optimal string encoding should not be underestimated. If you can't "write it down with a pencil", which by the way refers to Alan Turing's style of reasoning, it becomes very problematic anyway. Your intuition is probably that raw bytes are cheap because you can bitwise-AND them in terms of direct CPU instructions, and any "encoding" is expensive because you have to crunch bits a lot. This is not very true in practice, though. But you should arm yourself with valgrind(1) and spend a few days with it before you understand how you waste you CPU powers (and bandwidth, for that matter). Set-string encoding has reasonable, and affordable, cost. It can be all put together and presented in a rather less pessimistic manner. (For example, there is a cache_decode_set routine that boosts things by a factor of about o 5. This is simply due to the fact that you do not always have to decode the same Provides version all over again. This is part of that "fancy-schmancy" stuff which you must ignore on the first reading.) So I suggest that set-strings must cover all usages of set-subset computations, where static data structures are appropriate (so that you compute it, write into RPM header, and do not expect to modify it). it is still "efficient", despite the fact that we try hard to achieve a better string representation. The fact that it can be packaged nicely does not make it any worse in other respects, except that obviously we can stuff somewhat less bits into alnum characters. But I also wanted it to look "nice" (this is why base62 was devised over much simpler base64 encoding). But so, it looks nice, and it is efficient. Why are you unhappy? Do you necessarily require that you have to do bitmask tests yourself? __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 22, 2012, at 8:02 PM, Alexey Tourbin wrote: > On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson wrote: >> I would state that compression (of any sort) to minimize >> bandwidth is entirely the wrong problem to solve. > > So what kind of a problem are we trying to solve? Why are you making > all these small puns? Are you ready to go out and pronounce that "I am > the Jefferey Johnson is entirely confident that whichever problem we > face we must address, but to minimize bandwidth is entirely wrong and > shall be addressed never!" :-) > For the record: there is no intentional pun in that quote: I tried to state a misdirection that cannot (and SHOULD not) be solved in the context of "package management", nothing more. And I do feel for PTT's in Kzhakistan extracting every ruble for bandwidth. > Why is it any wrong to minimize bandwidth, or, in other words, why it > is bad to spend less money? Your answer is like, because the meaning > of life is not to spend less money, which is a wrong perspective. > Okay, but what's a better perspective? Spending more money for less > good is not at all a better perspective. > Nothing I said implies that bandwidth reduction is "wrong". If you want to minimize bandwidth downloading packages compress the metadata and deal with the "legacy compatible" issues however you want/need. My rule of thumb is that metadata is ~15% of the size of a typical *.rpm package: assuming 50% compression one might save 7.5% of the package download cost (0.50 * 0.15 = 7.5%). The "real" problem is the perceived need to be up-to-date as of this second, and the duplicated/redundant/necessary monolithic repository metadata download is an expensive way to determine Nothing to upgrade: relax. An incremental download of changed metadata is simple engineering (not widely practiced). > Okay, but what do we actually try to do? Um, we try to bring some > binary compatibility by using the limited amount of information with > some contrived data structures which test set-subset relation. If > space were not a problem, we could simply plug ELF sections into RPM > headers, couldn't we? Why not? And why is there a distinction between > the header and the payload? :-) > Hmm .. we have different POV here: I have the luxury of being able to change anything in RPM that makes sound engineering sense, which perhaps warps my opinions. Does it make engineering sense to include ELF headers in their entirety in *.rpm metadata? No. Does it make sense to attempt set:versions for de facto ABI subsets (if done carefully)? Yes. I *like* set:versions or we would not be having this conversation. >> My contrarian POV should not be taken as opposed to compression >> or elegance or anything else. Just that minimal size in the >> representation of bit fields (or bit positions as numbers) >> overly limits the applicability of set:versions. >> >> There are lots of usage cases for efficient sub-set computations >> in package management, not just as a de facto API/ABI check >> using ELF symbols. Most of the other usage cases for efficient >> sub-set computations are not subject to an ultimately optimal >> string encoding. > > My POV is that I ask "how the best I can do what I need to do, is it > doable, what is the price, can it be reduced, etc". Of course, these > "best to do" and "need to do" terms are not exactly mathematical, and > I already hear some laughs in the audience. Nevertheless, the only > thing which you can oppose to this is simply a better implementation. > Anything else does not count. Why? That's because! Go tell people they > should spend more money. ;-) > again: I *like* set:versions and see many usage cases, not just in dependencies. >> E.g. ripping off the base62 encoding and distributing binary >> assertion fields saves at least as much bandwidth as your >> guesstimate that a Golomb-Rice code is ~20% more efficient >> than a Bloom filter. > > I see no point in criticizing base62 encoding being suboptimal as > compared to raw bytes, because it does just that: squeezes bits into > alnum characters. It is pretty clear and pointless that raw bytes > stuff more bits. The goal was just that - to "dump" bits into > "readable" and "usable" form. > I did not say sub-optimal: I said that the encoding into a string entails a cost that is largely the same as the difference between "numbers" in set:versions and bits in Bloom filters. I can tie binary set:versions into "traditional" EVR tags if I choose to attempt entirely and entirely remove the need for base62 encoding if necessary. So could you. There is no desire to break "legacy compatibility" needlessly however; base63 is as good as its gonna get if one MUST encode into a string. >> Just in case: yes acidic sarcasm was fully intended. > > There is no point in your sarcasm, except that probably that you are > very clever (which I totally agree). > You miss the point of the sarcasm: I don't give a h00t whether
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Sat, Jun 23, 2012 at 1:55 AM, Jeffrey Johnson wrote: > I would state that compression (of any sort) to minimize > bandwidth is entirely the wrong problem to solve. So what kind of a problem are we trying to solve? Why are you making all these small puns? Are you ready to go out and pronounce that "I am the Jefferey Johnson is entirely confident that whichever problem we face we must address, but to minimize bandwidth is entirely wrong and shall be addressed never!" :-) Why is it any wrong to minimize bandwidth, or, in other words, why it is bad to spend less money? Your answer is like, because the meaning of life is not to spend less money, which is a wrong perspective. Okay, but what's a better perspective? Spending more money for less good is not at all a better perspective. Okay, but what do we actually try to do? Um, we try to bring some binary compatibility by using the limited amount of information with some contrived data structures which test set-subset relation. If space were not a problem, we could simply plug ELF sections into RPM headers, couldn't we? Why not? And why is there a distinction between the header and the payload? :-) > My contrarian POV should not be taken as opposed to compression > or elegance or anything else. Just that minimal size in the > representation of bit fields (or bit positions as numbers) > overly limits the applicability of set:versions. > > There are lots of usage cases for efficient sub-set computations > in package management, not just as a de facto API/ABI check > using ELF symbols. Most of the other usage cases for efficient > sub-set computations are not subject to an ultimately optimal > string encoding. My POV is that I ask "how the best I can do what I need to do, is it doable, what is the price, can it be reduced, etc". Of course, these "best to do" and "need to do" terms are not exactly mathematical, and I already hear some laughs in the audience. Nevertheless, the only thing which you can oppose to this is simply a better implementation. Anything else does not count. Why? That's because! Go tell people they should spend more money. ;-) > E.g. ripping off the base62 encoding and distributing binary > assertion fields saves at least as much bandwidth as your > guesstimate that a Golomb-Rice code is ~20% more efficient > than a Bloom filter. I see no point in criticizing base62 encoding being suboptimal as compared to raw bytes, because it does just that: squeezes bits into alnum characters. It is pretty clear and pointless that raw bytes stuff more bits. The goal was just that - to "dump" bits into "readable" and "usable" form. > Just in case: yes acidic sarcasm was fully intended. There is no point in your sarcasm, except that probably that you are very clever (which I totally agree). > You already made a valid point asking what probability means wrto > false positives. In fact your 2**10/2**20 is a very different > estimate of false positive probability than the approx. currently > in use in rpmbf.c (which is based on a statistical model for > Bloom filter parameters). The 2**10/2**20 is the best estimate and you probably cannot further improve it. The whole business of set-versions, which I have been pondering recently, boils down to the questions, can you improve the ratio just a little bit? Or can you pay just a little bit less for a little bit more? Is there a better data structure? The answer is basically turns out to be "NO". The "set of numbers" is good to go, and the Galois fields have some discrepancies which you do not want to know, and the benefit is very marginal. > Below is what is in use by RPM which chooses {m, k} given > an estimated population n and a desired probability of false > positives e (I forget where I found the actual computation, > can likely find it again if/when necessary). > > What I would like to be able to do with set:versions > in RPM is to be able to use either of these 2 "container" > representations interchangeably. > > I'd also like to be able to pre-compute either form > in package tags for whatever purpose without the > dreadful tyranny of being "legacy compatible" with > older versions of RPM. Adding a new tag that isn't > used by older RPM is exactly as compatible as following > existing string representations for NEVR dependencies > in existing tags, arguably more compatible because > older versions of RPM are incapable of interpreting > unknown tags. > > And again: the best solution for the download bandwidth > problem is > Update the metadata less often. > if you think a bit. The best solution for "don't be stupid" is just don't be stupid. Of course, downloading metadata is not the only problem, and not exactly the one which we exactly try to solve. It all combines into a complex system. The question is then, to me, can we do any better? How much does this whole business cost? if it's still expensive, which it is, how can we cut down on the costs? These kind of questions I am willing to
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 22, 2012, at 5:26 PM, Alexey Tourbin wrote: > On Fri, Jun 22, 2012 at 5:30 AM, Jeffrey Johnson wrote: >> Sure numbers "make sense". >> >> But God invented 0 and 1 and who needs steenkin carries to do >> arithmetic in Galois fields?!? > > Jeffery, I understand that you are ironic and sarcastic, but I can't > see the reason why, as per our discussion. If you ask honest > questions, like "what this means", I try to do my best to answer the > question, possibly involving considerations like "numbers comes from > God", which are questionable but not irrelevant. They may help to > understand, or may not. > Apologies for the sarcasm. My comments were inline with your observations regarding undergraduate difficulties with Galois fields and matrices imho. But math jokes are bone dry … >>> P: 01010010010101 >>> R: 010001 >> >> Yes … but this is "premature optimization" … > > There is no premature optimization here, and it is fair to ask what we > may possibly want to try to optimize. My answer is: first, it must > take less space (given the probability) because we have to download > the repo metadata every time we run the "update"; second, set-versions > must compare quickly (we must be able to compare them all within a > second). What's your suggestion? Do you want them to take more space > and compare slowly, so as to avoid premature optimizations? :-) Or do > you dislike them exactly because they are not Bloom filters? Go on > then, make shorter strings, given the same probability, which compare > faster! But this is largely impossible, exactly because of these > laughable topics like "what is a number" or "what is a probability" > which I'm trying to present. > I would state that compression (of any sort) to minimize bandwidth is entirely the wrong problem to solve. Even in Kazakistan, one can minimize bandwidth by Upgrading less often! My contrarian POV should not be taken as opposed to compression or elegance or anything else. Just that minimal size in the representation of bit fields (or bit positions as numbers) overly limits the applicability of set:versions. There are lots of usage cases for efficient sub-set computations in package management, not just as a de facto API/ABI check using ELF symbols. Most of the other usage cases for efficient sub-set computations are not subject to an ultimately optimal string encoding. E.g. ripping off the base62 encoding and distributing binary assertion fields saves at least as much bandwidth as your guesstimate that a Golomb-Rice code is ~20% more efficient than a Bloom filter. So don't take my flip comments as being opposed to what you have implemented. >> How big the repo is determines how important the >> distro is and nothing else. Just look at how impo'tent Debian is … > > Your considerations are probably true but are largely irrelevant to > the discussion. > Besides that, I can't understand them. Are you sarcastic? I am > sarcastic too, that's not at all a problem. Shall we talk? :-) > Just in case: yes acidic sarcasm was fully intended. >>> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10 >>> set:dbUnz4 >>> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20 >>> set:nl2ZEALdS >>> >>> In the first run, the probability is modest ("print -l" is zsh syntax >>> to print an arg per line). In the second run, the probability is much >>> higher (and it takes more space). There is also a function in scripts >>> to sustain probabilities at about o 0.1%. >> >> Hmmm … I've been defaulting Bloom filters ultra conservatively >> at 10**-4 and mobing to 10**-6 at the first hint of a possible problem. > > The number does not matter here, the only consideration was that > 2^{-10} was kind of cute number, and it works well in practice - that > is, we catch all or most of the bugs where a library symbol has been > deleted. But it could also be 2^{-16} or even 2^{-20}. It is a > trade-off, and I see no ground for criticizing it for just that. Or > again it boils down to the question how many megabytes you expect to > download when you run the "update". You already made a valid point asking what probability means wrto false positives. In fact your 2**10/2**20 is a very different estimate of false positive probability than the approx. currently in use in rpmbf.c (which is based on a statistical model for Bloom filter parameters). Below is what is in use by RPM which chooses {m, k} given an estimated population n and a desired probability of false positives e (I forget where I found the actual computation, can likely find it again if/when necessary). What I would like to be able to do with set:versions in RPM is to be able to use either of these 2 "container" representations interchangeably. I'd also like to be able to pre-compute either form in package tags for whatever purpose without the dreadful tyranny of being "legacy compatible" with older versions of RPM. Adding a new tag that isn't used by older RPM is exactly as compatible as followin
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Fri, Jun 22, 2012 at 5:30 AM, Jeffrey Johnson wrote: > Sure numbers "make sense". > > But God invented 0 and 1 and who needs steenkin carries to do > arithmetic in Galois fields?!? Jeffery, I understand that you are ironic and sarcastic, but I can't see the reason why, as per our discussion. If you ask honest questions, like "what this means", I try to do my best to answer the question, possibly involving considerations like "numbers comes from God", which are questionable but not irrelevant. They may help to understand, or may not. >> P: 01010010010101 >> R: 010001 > > Yes … but this is "premature optimization" … There is no premature optimization here, and it is fair to ask what we may possibly want to try to optimize. My answer is: first, it must take less space (given the probability) because we have to download the repo metadata every time we run the "update"; second, set-versions must compare quickly (we must be able to compare them all within a second). What's your suggestion? Do you want them to take more space and compare slowly, so as to avoid premature optimizations? :-) Or do you dislike them exactly because they are not Bloom filters? Go on then, make shorter strings, given the same probability, which compare faster! But this is largely impossible, exactly because of these laughable topics like "what is a number" or "what is a probability" which I'm trying to present. > How big the repo is determines how important the > distro is and nothing else. Just look at how impo'tent Debian is … Your considerations are probably true but are largely irrelevant to the discussion. Besides that, I can't understand them. Are you sarcastic? I am sarcastic too, that's not at all a problem. Shall we talk? :-) >> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 10 >> set:dbUnz4 >> $ print -l sym1 sym2 |/usr/lib/rpm/mkset 20 >> set:nl2ZEALdS >> >> In the first run, the probability is modest ("print -l" is zsh syntax >> to print an arg per line). In the second run, the probability is much >> higher (and it takes more space). There is also a function in scripts >> to sustain probabilities at about o 0.1%. > > Hmmm … I've been defaulting Bloom filters ultra conservatively > at 10**-4 and mobing to 10**-6 at the first hint of a possible problem. The number does not matter here, the only consideration was that 2^{-10} was kind of cute number, and it works well in practice - that is, we catch all or most of the bugs where a library symbol has been deleted. But it could also be 2^{-16} or even 2^{-20}. It is a trade-off, and I see no ground for criticizing it for just that. Or again it boils down to the question how many megabytes you expect to download when you run the "update". __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Fri, Jun 22, 2012 at 09:39:22PM +0400, Alexey Tourbin wrote: > What is a probability? (That's the second stunning questions after > what is a number.) If numbers were unlimited, they could have > represented symbols exactly. The only reason the numbers "clash" is > because they "sit" within a limited range. What you have to do to > control probabilities is to select the appropriate range, which is a > trade-off between how many clashes are possible and how much bits per > number you are allowed to take. The basic idea is that, to encode a > Provides version, you first need to know how many symbols you are > going to encode, say 1024. You then ask yourself, how many bits per > symbol should I take. If you take e.g. only 10 bits, that's plain > stupid, because 10-bit numbers can address only 1024 values (so you > end up with most of the bits set, or most of the numbers taken, that > is to say). If you take 20 bits per symbol, though, things get more > interesting. 20-bit numbers can address 2^{20} different symbols. > Since the probability is due to range limitations, you get > 2^{10}/2^{20}=0.1% false positive rate, which is basically what you > want. When encoding a Requires version, it's another matter: you > should use the same number of bits as the corresponding Requires > versions have. You mean the corresponding Provides, i.e. the symbols in the "target" library, right? Basically you should never need to downsample when doing the set comparison, as downsampling kills your probablilities. Cheers, Michael. -- Michael Schroeder m...@suse.de SUSE LINUX Products GmbH, GF Jeff Hawn, HRB 16746 AG Nuernberg main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Thu, Jun 21, 2012 at 7:51 PM, Jeffrey Johnson wrote: > On Jun 18, 2012, at 2:32 PM, Jeffrey Johnson wrote: >> >> The "contained in" or subset semantic that applies to the operations "<" and >> ">=" >> is rather easy to do as well. E.g. if (assuming on;y existence, not >> versioned inequality ranges) >> P == Bloom filter of Provides: tokens >> R == Bloom filter of a (possible) subset of tokens >> then the subset computation is nothing more than >> (P & R) == R That's right. Set-versions do the same thing, only reframed in terms of set of numbers (as opposed to explicit bit masks). The rpmsetcmp routine basically does just that: (P & R) == R check, written in a slightly different manner, with a bunch of fancy stuff. :-) > I should point out the implicit (and tricky) assumption regarding > using Bloom filters to easily compute set and union intersection: > Both P and R MUST have exactly the same parameterization. > Choosing an a priori parametrization is hard because it MUST > deal with the worst possible case of the largest estimated population > which increases the sparseness of all other Bloom filters. There are a few kinds of "parametrization" which we must ponder. First, you should use the same hash function for R and P sets, so that it makes the same number per symbol (or sets the same bits). With different hash functions, there is no chance to compare sets meaningfully. Second, you should also ponder what a "number" actually is (or how high a bit you may want to be able to set). The thing is, the numbers are not unlimited (or otherwise the whole world can be represented with just a big number). It helps to think of a number as a tiny bit of information within a limited range. The range is another parameter. > This also applies to "tuning" to reduce the probability of false positives. > > There are no obvious ways to rescale a Bloom filter meaningfully either. There IS a (not-so-)obvious way to rescale Bloom filters, and numbers, for that matter. Big surprise, big surprise! It boils down to the question what a number is. If your number is just a few bits in the range modulo power of 2, you can "rescale" the number down to a smaller range by simply stripping its higher bits. The equivalent operation can be devised to "downsample" a bloom filter into a lesser precision, provided that some (not-so-)obvious conditions are met. Basically you need to split the filter into two halves and bitwise-OR them. > Do set: versions have the same difficulty (even if the parameters are hidden > as implementation/design constraints somehow)? There is some difficulty that we always need to use the same hash function, which is hardwired, and cannot be easily changed (it is Jenkins on-at-a-time hash with a fancy initial constant). There is no difficulty to compare set-versions per se, though. Because a set-versions is just a set of numbers within a limited range modulo power of two. If ranges are different, you first need to "downsample" the higher-ranged set by stripping its higher bits and sorting the numbers again, but then you proceed normally. Note that the range, or "bpp", has to be encoded explicitly, and is part of a set-version. Again, it helps to think in terms what a number is. A number is a thing within the range, modulo power of two. If you don't know the range, you don't know what the number is. > Or does the Golomb-Rice encoding "scale" more naturally than Bloom filter > parameters? > If the set:versions implementation "scales" better than Bloom filters (which > seems to be the case), > then there are lots of usage cases (like packages in a repository, or files > in a package) where > an efficient means to compute set membership is quite useful. > > (aside) > Apologies for "scales" imprecision. I'm merely trying to ask is > To what extent does set:versions depend on a priori assumptions? > > BTW, what is the current false positive failure probability for set:versions? What is a probability? (That's the second stunning questions after what is a number.) If numbers were unlimited, they could have represented symbols exactly. The only reason the numbers "clash" is because they "sit" within a limited range. What you have to do to control probabilities is to select the appropriate range, which is a trade-off between how many clashes are possible and how much bits per number you are allowed to take. The basic idea is that, to encode a Provides version, you first need to know how many symbols you are going to encode, say 1024. You then ask yourself, how many bits per symbol should I take. If you take e.g. only 10 bits, that's plain stupid, because 10-bit numbers can address only 1024 values (so you end up with most of the bits set, or most of the numbers taken, that is to say). If you take 20 bits per symbol, though, things get more interesting. 20-bit numbers can address 2^{20} different symbols. Since the probability is due to range limitations, you get 2^{10}/2^{20
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 21, 2012, at 8:27 PM, Alexey Tourbin wrote: > On Thu, Jun 21, 2012 at 7:28 PM, Jeffrey Johnson wrote: >>> More precisely, a set-version can be (in principle) converted to a >>> Bloom filter which uses only one hash function. The idea is that such >>> a filter will set bits in a highly sparse set of bits, one by one. >>> Instead, a set-version remembers the bits simply by their indices. >>> Setting the bit becomes taking the number, and there is a >>> straightforward correspondence. It also turns out that a set of >>> numbers can be easier to deal with, as opposed to a sparse set of >>> bits. >> >> You've made an unsupported claim here re "numbers can be easier". >> Presumably you are talking about means of encoding, where clever >> mathematics derived from numerical transforms can be used to >> remove redundancy. With a pile of bits in a Bloom filter all one >> has to work with is a pile of bits. > > Numbers are easier to deal with because you can use 'unsigned v[]' > array to represent them, which can be seen throughout the code, > accompanied by 'int c' (which is argc/argv[] style). Bits are somewhat > more complicated: you need to use something like > macros, and you cannot use pointer arithmetic to implement traversal, > etc. > Well different strokes here. The #include is complicated only because of effete Linux kernel debutantes with their precious kernel stack advocacy issues re select(2). These minor defines (swiped directly from mind you) are hugely portable: #define __PBM_NBITS /*@-sizeoftype@*/(8 * sizeof(__pbm_bits))/*@=sizeoftype@*/ #define __PBM_IX(d) ((d) / __PBM_NBITS) #define __PBM_MASK(d) ((__pbm_bits) 1 << (((unsigned)(d)) % __PBM_NBITS)) #define __PBM_BITS(set) ((__pbm_bits *)(set)->bits) #define PBM_FREE(s) _free(s); #define PBM_SET(d, s) (__PBM_BITS (s)[__PBM_IX (d)] |= __PBM_MASK (d)) #define PBM_CLR(d, s) (__PBM_BITS (s)[__PBM_IX (d)] &= ~__PBM_MASK (d)) #define PBM_ISSET(d, s) ((__PBM_BITS (s)[__PBM_IX (d)] & __PBM_MASK (d)) != 0) #define PBM_ALLOC(d)xcalloc(__PBM_IX (d) + 1, __PBM_NBITS/8) See . > Also, to me, a set of numbers just "makes sense", and is good to go. > If you wonder how much complicated things can get, there's been some > recent papers out there which use matrix solving for approximate set > representation. They are basically unreadable (given the undergraduate > skills), and you can be completely lost because there is apparently no > connection between the Galois fields and what you actually try to do. > After you try to read these papers, the "set of numbers" becomes a > wonderful salvation which comes directly from God and the Holy Spirit > and brings peace to your soul. This wonderful salvation, which comes > from God, turns out to be a good thing. You should use it, when you > have some. :-) > Sure numbers "make sense". But God invented 0 and 1 and who needs steenkin carries to do arithmetic in Galois fields?!? >> The counter argument is this: testing bits (in a Bloom filter) is >> rather KISS. Other compression/redundancy removal schemes >> end up trading storage for implementation complexity cost. >> In a running installer like RPM, there will always be a need >> for memory dealing with payload contents. Since large amounts >> of memory are eventually going to be needed/used, savings >> by using Golomb-Reid codes are mostly irrelevant for the >> 100K -> 1M set sizes used by RPM no matter whether bits or >> numbers are used to represent. > > I do not agree that testing bits is more KISS than a set of numbers. > It you think that testing bits is the only and natural thing to do, I > disagree. The reason is: Requires versions are much more sparse than > Provides versions. If the Provides version is optimal, in terms of bit > usage (50% set), the Requires version must be necessarily suboptimal > (too few bits set). It can be pictured like this: > > P: 01010010010101 > R: 010001 > Yes … but this is "premature optimization" … > The conclusion is, if you want to stick to bitmasks and reduce subset > comparison to bitmask tests, there will be a great deal of > inefficiency because of sparse R-sets. On the other hand, sets of > numbers can be pictured like this: > > P: 1,3,7,13,15 > R: 1,13 > > Note that R-set does not take extra space, and there is a simple > merge-like algorithm which advances either R or P (or both - which is > how rpmsetcmp routine is implemented). > … if you are serious abtout minimal size, compressing headers in *.rpm packages will save more bandwidth than any single other optimization one might choose to implement. >> But don't take my argument as being opposed to set:versions whatsoever. >> Just (for RPM and even for *.rpm) that compression size isn't as important >> to me as it is to you as the "primary goal" of set: versions. > > The size of *.rpm is very important when you want to update the > information about a repo, like in "apt-get update" - you have to > download it
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Thu, Jun 21, 2012 at 7:28 PM, Jeffrey Johnson wrote: >> More precisely, a set-version can be (in principle) converted to a >> Bloom filter which uses only one hash function. The idea is that such >> a filter will set bits in a highly sparse set of bits, one by one. >> Instead, a set-version remembers the bits simply by their indices. >> Setting the bit becomes taking the number, and there is a >> straightforward correspondence. It also turns out that a set of >> numbers can be easier to deal with, as opposed to a sparse set of >> bits. > > You've made an unsupported claim here re "numbers can be easier". > Presumably you are talking about means of encoding, where clever > mathematics derived from numerical transforms can be used to > remove redundancy. With a pile of bits in a Bloom filter all one > has to work with is a pile of bits. Numbers are easier to deal with because you can use 'unsigned v[]' array to represent them, which can be seen throughout the code, accompanied by 'int c' (which is argc/argv[] style). Bits are somewhat more complicated: you need to use something like macros, and you cannot use pointer arithmetic to implement traversal, etc. Also, to me, a set of numbers just "makes sense", and is good to go. If you wonder how much complicated things can get, there's been some recent papers out there which use matrix solving for approximate set representation. They are basically unreadable (given the undergraduate skills), and you can be completely lost because there is apparently no connection between the Galois fields and what you actually try to do. After you try to read these papers, the "set of numbers" becomes a wonderful salvation which comes directly from God and the Holy Spirit and brings peace to your soul. This wonderful salvation, which comes from God, turns out to be a good thing. You should use it, when you have some. :-) > The counter argument is this: testing bits (in a Bloom filter) is > rather KISS. Other compression/redundancy removal schemes > end up trading storage for implementation complexity cost. > In a running installer like RPM, there will always be a need > for memory dealing with payload contents. Since large amounts > of memory are eventually going to be needed/used, savings > by using Golomb-Reid codes are mostly irrelevant for the > 100K -> 1M set sizes used by RPM no matter whether bits or > numbers are used to represent. I do not agree that testing bits is more KISS than a set of numbers. It you think that testing bits is the only and natural thing to do, I disagree. The reason is: Requires versions are much more sparse than Provides versions. If the Provides version is optimal, in terms of bit usage (50% set), the Requires version must be necessarily suboptimal (too few bits set). It can be pictured like this: P: 01010010010101 R: 010001 The conclusion is, if you want to stick to bitmasks and reduce subset comparison to bitmask tests, there will be a great deal of inefficiency because of sparse R-sets. On the other hand, sets of numbers can be pictured like this: P: 1,3,7,13,15 R: 1,13 Note that R-set does not take extra space, and there is a simple merge-like algorithm which advances either R or P (or both - which is how rpmsetcmp routine is implemented). > But don't take my argument as being opposed to set:versions whatsoever. > Just (for RPM and even for *.rpm) that compression size isn't as important > to me as it is to you as the "primary goal" of set: versions. The size of *.rpm is very important when you want to update the information about a repo, like in "apt-get update" - you have to download it all. The question is then how big the repo is, and how many set-versions you can afford to download. :-) The "primary goal" was not to make things much worse. In other words, the price must not be prohibitive even for the repo of 10k or 15k packages strong. (Please also realize that the Internet was not a given until very recent times - some shoddy ISPs in Kamchatka still want to charge something like $0.1 per 1Mb.) > AFAICT set:versions also has a finite probability of failure due to > (relatively rare) > hash collisions converting strings to numbers. > > Is there any hope that set:versions (or the Golomb-Rice code) can supply a > parameterization > to "tune" the probability of false positives? > > Or am I mis-understanding what is implemented? You can tune the probability by selecting the appropriate "bpp" paramter, and passing it to "mkset" (which is how scripts work). The "bpp" indicates how much bits must be used per hash value (the range is 10..32). For example, if you want to encode Provides versions which has 1024 symbols, and the error rate has to be fixed at about 0.1%, you need to use bpp=20 - that is, after each symbol is hashed, its 20 lower bits will be taken into account and encoded. The probability then is simply the number of symbols over the capacity of the universe, which is 2^{10}/2^{20}=0.1%. On the other hand,
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 18, 2012, at 2:32 PM, Jeffrey Johnson wrote: > > The "contained in" or subset semantic that applies to the operations "<" and > ">=" > is rather easy to do as well. E.g. if (assuming on;y existence, not versioned > inequality ranges) > P == Bloom filter of Provides: tokens > R == Bloom filter of a (possible) subset of tokens > then the subset computation is nothing more than > (P & R) == R > I should point out the implicit (and tricky) assumption regarding using Bloom filters to easily compute set and union intersection: Both P and R MUST have exactly the same parameterization. Choosing an a priori parametrization is hard because it MUST deal with the worst possible case of the largest estimated population which increases the sparseness of all other Bloom filters. This also applies to "tuning" to reduce the probability of false positives. There are no obvious ways to rescale a Bloom filter meaningfully either. Do set: versions have the same difficulty (even if the parameters are hidden as implementation/design constraints somehow)? Or does the Golomb-Rice encoding "scale" more naturally than Bloom filter parameters? If the set:versions implementation "scales" better than Bloom filters (which seems to be the case), then there are lots of usage cases (like packages in a repository, or files in a package) where an efficient means to compute set membership is quite useful. (aside) Apologies for "scales" imprecision. I'm merely trying to ask is To what extent does set:versions depend on a priori assumptions? BTW, what is the current false positive failure probability for set:versions? 73 de Jeff __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 21, 2012, at 12:51 AM, Alexey Tourbin wrote: > On Thu, Jun 21, 2012 at 12:15 AM, Alexey Tourbin > wrote: >> On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson wrote: >>> Good: the above confirmation of the characteristics allows a set:versions >>> implementation to proceed. >> >> Hello, there's been some speculation about Bloom filters below, which >> I cannot address right now, offhand. Nevertheless, I can say that, in >> some highly mathematical sense, set-versions are exactly equivalent to >> Bloom filters. They do just the same thing, if you will. The only >> difference is that set-versions are more compact: they take somewhat >> less space, which was, if you remember, the number one goal of the >> original implementation. > > More precisely, a set-version can be (in principle) converted to a > Bloom filter which uses only one hash function. The idea is that such > a filter will set bits in a highly sparse set of bits, one by one. > Instead, a set-version remembers the bits simply by their indices. > Setting the bit becomes taking the number, and there is a > straightforward correspondence. It also turns out that a set of > numbers can be easier to deal with, as opposed to a sparse set of > bits. > You've made an unsupported claim here re "numbers can be easier". Presumably you are talking about means of encoding, where clever mathematics derived from numerical transforms can be used to remove redundancy. With a pile of bits in a Bloom filter all one has to work with is a pile of bits. The counter argument is this: testing bits (in a Bloom filter) is rather KISS. Other compression/redundancy removal schemes end up trading storage for implementation complexity cost. In a running installer like RPM, there will always be a need for memory dealing with payload contents. Since large amounts of memory are eventually going to be needed/used, savings by using Golomb-Reid codes are mostly irrelevant for the 100K -> 1M set sizes used by RPM no matter whether bits or numbers are used to represent. But don't take my argument as being opposed to set:versions whatsoever. Just (for RPM and even for *.rpm) that compression size isn't as important to me as it is to you as the "primary goal" of set: versions. > If you want to know more why things have to work like this, and e.g. > where constants pop up, there is a good starting point at "Cache-, > Hash-, and Space-efficient Bloom filters" paper by Felix Putze, Peter > Sanders, and Johannes Singler. Actually this paper helped me a lot to > put things together and to produce the original implementation. > Reading this paper requires some working mathematical knowledge, > though. This requirement must not be underestimated, but also should > not be overestimated. The paper is very readable: it tells you what > you may want to do and what you have to do. > http://algo2.iti.kit.edu/singler/publications/cacheefficientbloomfilters-wea2007.pdf Thanks for the pointer. I actually read and studied this paper a few years back trying to decide whether there was a need to attempt cache-line optimization (but I totally ignored the paragraph that links Golomb codes to Bloom filters). At the time I decided that the complexity cost to attempt cache-line optimization wasn't a priority in RPM, but rather KISS simplicity while users considered changes that involve a finite probability of false positive failures. AFAICT set:versions also has a finite probability of failure due to (relatively rare) hash collisions converting strings to numbers. Is there any hope that set:versions (or the Golomb-Rice code) can supply a parameterization to "tune" the probability of false positives? Or am I mis-understanding what is implemented? 73 de Jeff > __ > RPM Package Managerhttp://rpm5.org > Developer Communication Listrpm-devel@rpm5.org __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Thu, Jun 21, 2012 at 12:15 AM, Alexey Tourbin wrote: > On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson wrote: >> Good: the above confirmation of the characteristics allows a set:versions >> implementation to proceed. > > Hello, there's been some speculation about Bloom filters below, which > I cannot address right now, offhand. Nevertheless, I can say that, in > some highly mathematical sense, set-versions are exactly equivalent to > Bloom filters. They do just the same thing, if you will. The only > difference is that set-versions are more compact: they take somewhat > less space, which was, if you remember, the number one goal of the > original implementation. More precisely, a set-version can be (in principle) converted to a Bloom filter which uses only one hash function. The idea is that such a filter will set bits in a highly sparse set of bits, one by one. Instead, a set-version remembers the bits simply by their indices. Setting the bit becomes taking the number, and there is a straightforward correspondence. It also turns out that a set of numbers can be easier to deal with, as opposed to a sparse set of bits. If you want to know more why things have to work like this, and e.g. where constants pop up, there is a good starting point at "Cache-, Hash-, and Space-efficient Bloom filters" paper by Felix Putze, Peter Sanders, and Johannes Singler. Actually this paper helped me a lot to put things together and to produce the original implementation. Reading this paper requires some working mathematical knowledge, though. This requirement must not be underestimated, but also should not be overestimated. The paper is very readable: it tells you what you may want to do and what you have to do. http://algo2.iti.kit.edu/singler/publications/cacheefficientbloomfilters-wea2007.pdf __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Mon, Jun 18, 2012 at 10:32 PM, Jeffrey Johnson wrote: > Good: the above confirmation of the characteristics allows a set:versions > implementation to proceed. Hello, there's been some speculation about Bloom filters below, which I cannot address right now, offhand. Nevertheless, I can say that, in some highly mathematical sense, set-versions are exactly equivalent to Bloom filters. They do just the same thing, if you will. The only difference is that set-versions are more compact: they take somewhat less space, which was, if you remember, the number one goal of the original implementation. Very informal, if you want to encode 1K symbols out of 1M symbols using a bloom filter, you need at least 1024 * 1.44 * 10 bits of information (which is a factor of 1.44 per symbol) while with a set-version you need at least 1024 * (10 + 1.44) bits of information (which is an additive constant of 1.44 per symbol). It is basically fair to say that set-versions are 20% shorter than the equivalent bloom filters, which is not unimportant. By the way, the information-theoretical minimum is 1024 * 10 bits of information - you cannot go beyond that without violating fundamental laws. __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Jun 15, 2012, at 11:58 PM, Alexey Tourbin wrote: > On Fri, Jun 1, 2012 at 6:07 AM, Jeffrey Johnson wrote: >> I asked 2 very specific questions … the rest is quite important also, >> but I need to understand precisely what properties set:versions have in order >> to implement correctly (and I don't fully understand your reply). >> >> Specifically: >> >>1) Is the set:versions VERSION independent of the order of the >>calls to rpmsetAdd()? (you know the routine as set_add()) > > Completely independent - you can add symbols in any order. The symbols > are then hashed and sorted by their numeric values. The underlying > idea is that a set-version is just a (sorted) set of numbers. You can > add whatever symbol to it, possibly twice, the symbol will be hashed > to a number, in a unique manner, and finally you can get the string > representation of the set of numbers. This involves much fuss under > the hood, but basically, you should think of the set of symbols, which > is just the set of numbers, after each symbol has been hashed > individually. > >>2) Can the set:versions encoding be compared for more than equality? >>What set/arithmetic property is the basis for the comparison? What >>circumstances/constraints are there related to >>… You cannot always compare >>set-versions in terms of "greater or equal" (but when you >> can, it's >>important). > > Set-versions compare as sets. There are Euler diagrams to visaulise > set comparsion, which is an undergraduate matetrial. The idea is that, > real numbers are linear order: you can always tell either V1 V1>=V2. Sets are quite another matter: you cannot always apply for > "tertium non datur" (either lt or ge). Which is to say that sets can > be quite different and do not compare easily. The order can be imposed > on the sets, though, by requiring the "greater" sets to have at least > the same elements they compare against (perhaps I'm starting to retell > the undergraduate material, which is not going to last). To sum up, > there IS a mathematical basis behind "Requires: foo >= set:asdf" > dependencies. > >> I can of course answer my own questions with try-and-see test cases. Good: the above confirmation of the characteristics allows a set:versions implementation to proceed. What I am hearing is that set:versions is not fundamentally different from these transforms Create a Bloom filter (with acceptable false positive rate) Compress the Bloom filter parameters and bits. Encode the compression in base{16,32,62,64} for transport. The "contained in" or subset semantic that applies to the operations "<" and ">=" is rather easy to do as well. E.g. if (assuming on;y existence, not versioned inequality ranges) P == Bloom filter of Provides: tokens R == Bloom filter of a (possible) subset of tokens then the subset computation is nothing more than (P & R) == R The important difference is that Bloom filters have a false positive failure probability which is likely not present in the Golomb-Rice coding in the current implementation. I'm expecting that Bloom+Compression+Encoding is comparable in size to the Golomb-Rice encoding. The only way to tell (that I know of) is to try-and-see, which I will likely also do in order to tell how good the compression is with Golomb-Rice coding. Meanwhile there's lots of usage cases for the set:versions implementation in RPM, not just as a container of Elf symbols that is a de facto ABI sub-set. 73 de Jeff > __ > RPM Package Managerhttp://rpm5.org > Developer Communication Listrpm-devel@rpm5.org __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Fri, Jun 1, 2012 at 6:07 AM, Jeffrey Johnson wrote: > I asked 2 very specific questions … the rest is quite important also, > but I need to understand precisely what properties set:versions have in order > to implement correctly (and I don't fully understand your reply). > > Specifically: > > 1) Is the set:versions VERSION independent of the order of the > calls to rpmsetAdd()? (you know the routine as set_add()) Completely independent - you can add symbols in any order. The symbols are then hashed and sorted by their numeric values. The underlying idea is that a set-version is just a (sorted) set of numbers. You can add whatever symbol to it, possibly twice, the symbol will be hashed to a number, in a unique manner, and finally you can get the string representation of the set of numbers. This involves much fuss under the hood, but basically, you should think of the set of symbols, which is just the set of numbers, after each symbol has been hashed individually. > 2) Can the set:versions encoding be compared for more than equality? > What set/arithmetic property is the basis for the comparison? What > circumstances/constraints are there related to > … You cannot always compare > set-versions in terms of "greater or equal" (but when you can, > it's > important). Set-versions compare as sets. There are Euler diagrams to visaulise set comparsion, which is an undergraduate matetrial. The idea is that, real numbers are linear order: you can always tell either V1=V2. Sets are quite another matter: you cannot always apply for "tertium non datur" (either lt or ge). Which is to say that sets can be quite different and do not compare easily. The order can be imposed on the sets, though, by requiring the "greater" sets to have at least the same elements they compare against (perhaps I'm starting to retell the undergraduate material, which is not going to last). To sum up, there IS a mathematical basis behind "Requires: foo >= set:asdf" dependencies. > I can of course answer my own questions with try-and-see test cases. __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On May 31, 2012, at 7:24 PM, Alexey Tourbin wrote: >>> >> >> I'm trying to understand rpmsetcmp() as a "black box" independent >> of all the gory implementation details of ELF symbols, base62 encoding, >> and RPM dependencies. >> >> I believe that set:versions are much like Bloom filters: >>1) strings can be added to a "set" in any order >>2) the comparison operation implied by >>Requires: foo >= set:…. >>is identical to "contained in" or "is subset of" >> Is that the case? I asked 2 very specific questions … the rest is quite important also, but I need to understand precisely what properties set:versions have in order to implement correctly (and I don't fully understand your reply). Specifically: 1) Is the set:versions VERSION independent of the order of the calls to rpmsetAdd()? (you know the routine as set_add()) 2) Can the set:versions encoding be compared for more than equality? What set/arithmetic property is the basis for the comparison? What circumstances/constraints are there related to … You cannot always compare set-versions in terms of "greater or equal" (but when you can, it's important). I can of course answer my own questions with try-and-see test cases. 73 de Jeff__ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Thu, May 31, 2012 at 9:03 AM, Jeffrey Johnson wrote: >>> The mixed code case is interesting: what happens >>> if a set:version encoding contains the literal string "0:V-R" >> >> I can't understand you question. A version is either a set-version, or >> not a set-version. If a version is a set-version, it has to be >> prefixed by "set:". Regular RPM versions cannot be prefixed by "set:", >> because "set" cannot be decoded as a valid serial number, which has to >> be an integer. There might be some implementation sloppiness out >> there, but in principle, I believe the encoding scheme is sane, and >> makes sense. >> >> Now the question is, what if a set-version cannot be decoded? But that >> can be perplexed by a question, what if a regular rpm version cannot >> be decoded? Or can you decode any junk as a valid RPM version? > > I'm trying to understand rpmsetcmp() as a "black box" independent > of all the gory implementation details of ELF symbols, base62 encoding, > and RPM dependencies. > > I believe that set:versions are much like Bloom filters: > 1) strings can be added to a "set" in any order > 2) the comparison operation implied by > Requires: foo >= set:…. > is identical to "contained in" or "is subset of" > Is that the case? To me, a set-version is just a VERSION. I can't stress that enough. When you need a library, it is a legitimate question to ask which version you need. If you answer that you need at least version 1.0, a conventional version, you must be kidding. Because there is no connection between what you actually need and a god-damn number. So a plausible answer must be "Well, we need at least a version which provides symbols which we need to use". Let's take this approach to the extreme, and say that the VERSION which we need is simply the one which provides at least symbols. To me, this is much better an approach to library versioning, and possibly the only viable approach. How else can you express your expectations about a library? Suppose someone is talking to you at a conference, and says "Mr. Johnson, we are very proud, blah-blah-blah, because blah-blah-blah". What you want to tell them is basically "Go out, folks, it works". Now, with set-versions, things really work. :-) Back to implementation, 1) set-strings should be considered opaque, static, and unmodifiable. Once they are formed, there is no useful way to alter them. They express an idea of a VERSION in a manner which cannot be easily comprehended, but that's not a problem (or otherwise there is perplexing questions of what regular rpm versions must mean, and whether any piece of junk can be decoded as a regular rpm version). 2) They must compare as sets, in terms of elements unique to the first set, common elements, and elements unique to the second set. The implementation does not quite much yet, because it already tries to mimic regular versions. Regular versions are linear order. Set-versions are partial order. You cannot always compare set-versions in terms of "greater or equal" (but when you can, it's important). __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On May 31, 2012, at 5:07 AM, Michael Schroeder wrote: > On Wed, May 30, 2012 at 02:56:20PM -0400, Jeffrey Johnson wrote: >> >> On Apr 23, 2012, at 10:32 AM, Jeff Johnson wrote: >> >>> I should point out that writing the attached >>> message (and sending from the wrong e-mail address) has instantly >>> led to a different -- and perhaps more natural -- syntax like >>> >>> Requires: set(libfoo.so.1) >= whatever >>> >> >> After a month of quietly muddling, I like this namespace like syntax >> better because it doesn't lead to magic tokens like "set:" in the EVR >> string and is more consistent with existing namespaces in RPM dependencies. > > Is that just syntax? I.e., will rpm -q --whatprovides libfoo.so.1 > still find the package? > Depends on what I choose to implement. It's quite easy to use an additional per-namspace table, or to add two keys to the existing table, or to treat "set(…)" as a probe method. Because of the length/encoding in set:version data, --whatprovides is rather awkward and uninformative for humans. The namespace convention that exists in RPM is that in a statement like Dependency: foo(N) = E:V-R that E:V-R applies to the token N, not the wrapped name foo(N). All other behaviors aren't specified. 73 de Jeff__ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Wed, May 30, 2012 at 02:56:20PM -0400, Jeffrey Johnson wrote: > > On Apr 23, 2012, at 10:32 AM, Jeff Johnson wrote: > > > I should point out that writing the attached > > message (and sending from the wrong e-mail address) has instantly > > led to a different -- and perhaps more natural -- syntax like > > > > Requires: set(libfoo.so.1) >= whatever > > > > After a month of quietly muddling, I like this namespace like syntax > better because it doesn't lead to magic tokens like "set:" in the EVR > string and is more consistent with existing namespaces in RPM dependencies. Is that just syntax? I.e., will rpm -q --whatprovides libfoo.so.1 still find the package? Cheers, Michael. -- Michael Schroeder m...@suse.de SUSE LINUX Products GmbH, GF Jeff Hawn, HRB 16746 AG Nuernberg main(_){while(_=~getchar())putchar(~_-1/(~(_|32)/13*2-11)*13);} __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On May 30, 2012, at 11:24 PM, Alexey Tourbin wrote: > On Thu, May 31, 2012 at 6:51 AM, Jeffrey Johnson wrote: >> We are in violent agreement here over a minor issue >> of implementation/representation. > > By the way, actual problems that will arise are rarely what you expect > them to be. In 2010, I was naive and I thought that "char bitv[]" was > a pretty good representation of bit sequence (which can be still seen > in set.c). It then took many days to devise a sophisticated decoding > routine which avoids bitv[] altogether and makes things smooth. So, in > a violent agreement, don't take things for granted. :-) > In 2001, I thought that typedef'd enum's were the best implementation for named bit fields because the names were displayed in gdb backtraces. There is some speshul painfulness not only from the typedef's in both C/C++ but also from the inability to extend the no. of bits beyond whatever and "int" contains. But these are implementation issues which I choose to call "minor" ;-) >> The mixed code case is interesting: what happens >> if a set:version encoding contains the literal string "0:V-R" > > I can't understand you question. A version is either a set-version, or > not a set-version. If a version is a set-version, it has to be > prefixed by "set:". Regular RPM versions cannot be prefixed by "set:", > because "set" cannot be decoded as a valid serial number, which has to > be an integer. There might be some implementation sloppiness out > there, but in principle, I believe the encoding scheme is sane, and > makes sense. > > Now the question is, what if a set-version cannot be decoded? But that > can be perplexed by a question, what if a regular rpm version cannot > be decoded? Or can you decode any junk as a valid RPM version? > I'm trying to understand rpmsetcmp() as a "black box" independent of all the gory implementation details of ELF symbols, base62 encoding, and RPM dependencies. I believe that set:versions are much like Bloom filters: 1) strings can be added to a "set" in any order 2) the comparison operation implied by Requires: foo >= set:…. is identical to "contained in" or "is subset of" Is that the case? 73 de Jeff >> and a match is attempted against a traditional dependency like >>Provides: foo = 0:V-R >> If the literal string in the Provides: is encoded on the >> fly, will setcmp(…) match or not? > __ > RPM Package Managerhttp://rpm5.org > Developer Communication Listrpm-devel@rpm5.org __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Thu, May 31, 2012 at 6:51 AM, Jeffrey Johnson wrote: > We are in violent agreement here over a minor issue > of implementation/representation. By the way, actual problems that will arise are rarely what you expect them to be. In 2010, I was naive and I thought that "char bitv[]" was a pretty good representation of bit sequence (which can be still seen in set.c). It then took many days to devise a sophisticated decoding routine which avoids bitv[] altogether and makes things smooth. So, in a violent agreement, don't take things for granted. :-) > The mixed code case is interesting: what happens > if a set:version encoding contains the literal string "0:V-R" I can't understand you question. A version is either a set-version, or not a set-version. If a version is a set-version, it has to be prefixed by "set:". Regular RPM versions cannot be prefixed by "set:", because "set" cannot be decoded as a valid serial number, which has to be an integer. There might be some implementation sloppiness out there, but in principle, I believe the encoding scheme is sane, and makes sense. Now the question is, what if a set-version cannot be decoded? But that can be perplexed by a question, what if a regular rpm version cannot be decoded? Or can you decode any junk as a valid RPM version? > and a match is attempted against a traditional dependency like > Provides: foo = 0:V-R > If the literal string in the Provides: is encoded on the > fly, will setcmp(…) match or not? __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On May 30, 2012, at 10:44 PM, Alexey Tourbin wrote: > On Thu, May 31, 2012 at 6:21 AM, Jeffrey Johnson wrote: >>> On Mon, Apr 23, 2012 at 6:32 PM, Jeff Johnson wrote: I should point out that writing the attached message (and sending from the wrong e-mail address) has instantly led to a different -- and perhaps more natural -- syntax like Requires: set(libfoo.so.1) >= whatever >>> >>> Hello, >>> Set-versions are just that - versions. One must arguably think of them >>> in terms of VERSIONS. If you need a library, it is a legitimate >>> question to ask which version you need. If you think you need the >>> version at least 1.0, there's a good question: why the heck you think >>> you need the version at least 1.0 (and whether 2.0 would still fit). >>> With set-versions, things get straightforward: you need at least a >>> version which provides API symbols - that's much >>> better a description of a library than a god-damn arbitrary number. >> >> Yes VERSIONS. >> The issue for RPM is how to represent/attach a different VERSION comparison. > > If those are only VERSIONS, they must apply to the same NAME. Look, > there are two-fold way of dependency resolution in rpm. Set-versions > were designed to fit into the scheme. First, you look up the name in > rpmdb/Providename, and fetch the headers. Second, you decide whether > the versions match. > > The real question is what to do if we are forced to match a > set-version against a non-set/another kind of version. Well, the > answer is that you should return as if the NAMEs were different. And > unless you're in a deeply theoretical mood, I believe the approach is > perfectly valid. If two kinds of versions cannot be matched, pretend > they apply to different names. We are in violent agreement here over a minor issue of implementation/representation. The mixed code case is interesting: what happens if a set:version encoding contains the literal string "0:V-R" and a match is attempted against a traditional dependency like Provides: foo = 0:V-R If the literal string in the Provides: is encoded on the fly, will setcmp(…) match or not? 73 de Jeff __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Thu, May 31, 2012 at 6:21 AM, Jeffrey Johnson wrote: >> On Mon, Apr 23, 2012 at 6:32 PM, Jeff Johnson wrote: >>> I should point out that writing the attached >>> message (and sending from the wrong e-mail address) has instantly >>> led to a different -- and perhaps more natural -- syntax like >>> >>> Requires: set(libfoo.so.1) >= whatever >> >> Hello, >> Set-versions are just that - versions. One must arguably think of them >> in terms of VERSIONS. If you need a library, it is a legitimate >> question to ask which version you need. If you think you need the >> version at least 1.0, there's a good question: why the heck you think >> you need the version at least 1.0 (and whether 2.0 would still fit). >> With set-versions, things get straightforward: you need at least a >> version which provides API symbols - that's much >> better a description of a library than a god-damn arbitrary number. > > Yes VERSIONS. > The issue for RPM is how to represent/attach a different VERSION comparison. If those are only VERSIONS, they must apply to the same NAME. Look, there are two-fold way of dependency resolution in rpm. Set-versions were designed to fit into the scheme. First, you look up the name in rpmdb/Providename, and fetch the headers. Second, you decide whether the versions match. The real question is what to do if we are forced to match a set-version against a non-set/another kind of version. Well, the answer is that you should return as if the NAMEs were different. And unless you're in a deeply theoretical mood, I believe the approach is perfectly valid. If two kinds of versions cannot be matched, pretend they apply to different names. __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On May 30, 2012, at 9:46 PM, Alexey Tourbin wrote: > On Mon, Apr 23, 2012 at 6:32 PM, Jeff Johnson wrote: >> I should point out that writing the attached >> message (and sending from the wrong e-mail address) has instantly >> led to a different -- and perhaps more natural -- syntax like >> >> Requires: set(libfoo.so.1) >= whatever > > Hello, > Set-versions are just that - versions. One must arguably think of them > in terms of VERSIONS. If you need a library, it is a legitimate > question to ask which version you need. If you think you need the > version at least 1.0, there's a good question: why the heck you think > you need the version at least 1.0 (and whether 2.0 would still fit). > With set-versions, things get straightforward: you need at least a > version which provides API symbols - that's much > better a description of a library than a god-damn arbitrary number. > Yes VERSIONS. The issue for RPM is how to represent/attach a different VERSION comparison. A prefix marker with set: is mostly sane (until there are multiple meanings for ':' and '~' and '-' separators parsing the serialized VERSION representation which is increasingly the state of affairs @rpm5.org). So its perhaps a more natural syntax and a cleaner representation to use the existing name spacing convention with the form Requires: set(bar) = VERSION to distinguish in a namespace the VERSION that needs a different comparison method attached. But these are mostly details (until one has to write the parsing regex and watch the combinatorial explosion of possibilities with "optional" fields). But I'm biased at the moment, having just today added twiddle-in-version ... > There are some philosophical implications of introducing set-versions, > in particular, whether it can be extended to describe prototypes and > calling conventions (e.g. the number of arguments which must be passed > to a function). This is why I might seem reluctant to participate in > discussions. I'm thinking! (And perhaps I'm arrogant.) > Thinking is good and arrogance matters little (to me ;-): set:versions is the right thing to do, collapsing multiple redundant "just in case" assertions into a single dependency representation. hth 73 de jeff >> for set:versions, and for the generalization (for writing strict regression >> tests, >> its mostly useless in packaging because there is no mapping that specifies >> how the mixed DEB <-> RPM version comparison might be done "naturally") >> >> Requires: deb(foo) >= E:V-R >> Requires: rpm(foo) >= E:V-R >> >> The precedent for foo(bar) name spacing in RPM dependencies with >> the above syntax is already widely deployed although entirely >> de facto. >> >> Sure would be nice _NOT_ to have to consider "Have it your own way!" >> competing syntaxes like >> Requires: libfoo.so.1 >= set:whatever >> and >> Requires: set(libfoo.so.1) >= whatever > __ > RPM Package Managerhttp://rpm5.org > Developer Communication Listrpm-devel@rpm5.org __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Mon, Apr 23, 2012 at 6:32 PM, Jeff Johnson wrote: > I should point out that writing the attached > message (and sending from the wrong e-mail address) has instantly > led to a different -- and perhaps more natural -- syntax like > > Requires: set(libfoo.so.1) >= whatever Hello, Set-versions are just that - versions. One must arguably think of them in terms of VERSIONS. If you need a library, it is a legitimate question to ask which version you need. If you think you need the version at least 1.0, there's a good question: why the heck you think you need the version at least 1.0 (and whether 2.0 would still fit). With set-versions, things get straightforward: you need at least a version which provides API symbols - that's much better a description of a library than a god-damn arbitrary number. There are some philosophical implications of introducing set-versions, in particular, whether it can be extended to describe prototypes and calling conventions (e.g. the number of arguments which must be passed to a function). This is why I might seem reluctant to participate in discussions. I'm thinking! (And perhaps I'm arrogant.) > for set:versions, and for the generalization (for writing strict regression > tests, > its mostly useless in packaging because there is no mapping that specifies > how the mixed DEB <-> RPM version comparison might be done "naturally") > > Requires: deb(foo) >= E:V-R > Requires: rpm(foo) >= E:V-R > > The precedent for foo(bar) name spacing in RPM dependencies with > the above syntax is already widely deployed although entirely > de facto. > > Sure would be nice _NOT_ to have to consider "Have it your own way!" > competing syntaxes like > Requires: libfoo.so.1 >= set:whatever > and > Requires: set(libfoo.so.1) >= whatever __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On May 30, 2012, at 3:09 PM, Robert Xu wrote: > On Wed, May 30, 2012 at 2:56 PM, Jeff Johnson wrote: >> >> OK, I'm in the last stages of adding twiddle-in-version so its time to >> consider other changes: >> >>1) Epoch as a string? The only lossage here is when digit strings >> differ in length >>(and strcmp isn't identical to arithmetic comparison). Alternatively, >> should Epoch's >>be forced to a [0-9]+ digit string? > > I don't see the usefulness of Epoch as a string. I would rather have > it just integers, higher takes precedence. > Much less mind boggling. > And Epoch: as integer requires special casing of an integer data type everywhere. strings are easier for the implementation and for metadata data types. I'd rather have a means to eliminate Epoch: … Epoch: is easily dropped out everywhere by adding this to RPM configuration %evr_tuple_order VR and all epochs will be parsed in dependency strings and ignored. >> >>2) Is there a need for >>twiddle-in-epoch(assuming a string) >>twiddle-in-release >>twiddle-in-distepoch >>There are/were several ways to implement twiddle-in-version; so far >> I've >>chosen to do only twiddle-in-version because that involved little more >>than rewriting some regex patterns. OTOH, handling twiddle-in-version >>as a special case of a more general string comparison (aka rpmvercmp) >>is at least as much of a problem as just permitting the full >> generalization. >> >>3) revisiting dependency ranges with syntax like this >>Provides: foo = [1.2-3|4.5-6) >>which would match only Requires in the range, including 1.2-3 because >> of the '[' >>and excluding 4.5-6 because of the ')' with the usual mathematical >> conventions (in the US). >> >>The problematic character would be choosing a separator (I used '|' >> in the example): almost >>all the usual characters have pre-existing usages, but perhaps >> something could still be twitched together. >> > > Can you explain twiddle-in-* to me? I'm slightly confused. Go read the Debian documentation or ask SuSE where the need for twiddle-in-version is. Perhaps the most complete description of already implemented alternatives @rpm5.org is here: https://bugzilla.yoctoproject.org/show_bug.cgi?id=256 I am forced to "compatibility" (with dpkg in Yocto) and feature parity (released in rpm-4.10.0) on an unnecessary (imho) implementation complexity adding twiddle-in-version @rpm5.org. The original RFE (from SuSE) is here http://www.rpm.org/ticket/56 > (Shows how much I haven't kept up with anything within like the > past... year >.>) > The twiddle-in-version RFE started in 2008, was proposed in the "OpenSUSE RPM Summit" ROADMAP on September 18th, 2009, and has just been released. That SHOULD tell you a lot about the urgency and necessity of twiddle-in-version in RPM. 73 de Jeff
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Wed, May 30, 2012 at 2:56 PM, Jeff Johnson wrote: > > OK, I'm in the last stages of adding twiddle-in-version so its time to > consider other changes: > > 1) Epoch as a string? The only lossage here is when digit strings > differ in length > (and strcmp isn't identical to arithmetic comparison). Alternatively, > should Epoch's > be forced to a [0-9]+ digit string? I don't see the usefulness of Epoch as a string. I would rather have it just integers, higher takes precedence. Much less mind boggling. > > 2) Is there a need for > twiddle-in-epoch(assuming a string) > twiddle-in-release > twiddle-in-distepoch > There are/were several ways to implement twiddle-in-version; so far > I've > chosen to do only twiddle-in-version because that involved little more > than rewriting some regex patterns. OTOH, handling twiddle-in-version > as a special case of a more general string comparison (aka rpmvercmp) > is at least as much of a problem as just permitting the full > generalization. > > 3) revisiting dependency ranges with syntax like this > Provides: foo = [1.2-3|4.5-6) > which would match only Requires in the range, including 1.2-3 because > of the '[' > and excluding 4.5-6 because of the ')' with the usual mathematical > conventions (in the US). > > The problematic character would be choosing a separator (I used '|' in > the example): almost > all the usual characters have pre-existing usages, but perhaps > something could still be twitched together. > Can you explain twiddle-in-* to me? I'm slightly confused. (Shows how much I haven't kept up with anything within like the past... year >.>) -- later daze. :: Robert Xu :: protocol.by/rxu __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Re: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
On Apr 23, 2012, at 10:32 AM, Jeff Johnson wrote: > I should point out that writing the attached > message (and sending from the wrong e-mail address) has instantly > led to a different -- and perhaps more natural -- syntax like > > Requires: set(libfoo.so.1) >= whatever > After a month of quietly muddling, I like this namespace like syntax better because it doesn't lead to magic tokens like "set:" in the EVR string and is more consistent with existing namespaces in RPM dependencies. I'll leave the existing "set:" prefix as is until I implement the set(…) syntax. > for set:versions, and for the generalization (for writing strict regression > tests, > its mostly useless in packaging because there is no mapping that specifies > how the mixed DEB <-> RPM version comparison might be done "naturally") > > Requires: deb(foo) >= E:V-R > Requires: rpm(foo) >= E:V-R > (aside) And this is a rather easy/natural extension to choose between alternative version comparisons (at least for development/testing). OTOH, its mostly useless for dependency assertion evaluation because there is no clearly identified "mixed" comparison when Provides: deb(foo) >= E:V-R Requires: rpm(foo) >= E:V-R I suppose I can attempt a resolution like this Prefer an explicit comparison in the Provides: if there is a conflict. Prefer an explicit comparison over an implicit/default comparison Use the default comparison if no explicit comparison hint is present. > The precedent for foo(bar) name spacing in RPM dependencies with > the above syntax is already widely deployed although entirely > de facto. > > Sure would be nice _NOT_ to have to consider "Have it your own way!" > competing syntaxes like > Requires: libfoo.so.1 >= set:whatever > and > Requires: set(libfoo.so.1) >= whatever > (and I'll ignore the gawdafulness of existing multilib dependencies like > Requires: libfoo.ao.1()(64bit) > coming along rapidly as a Newer! Better! Bestest! ABI standard now > that there is support in kernel >= 3.2. > > *shrug* > > Its all implementable, just not very KISS when aesthetic/innate opinions get > involved. > OK, I'm in the last stages of adding twiddle-in-version so its time to consider other changes: 1) Epoch as a string? The only lossage here is when digit strings differ in length (and strcmp isn't identical to arithmetic comparison). Alternatively, should Epoch's be forced to a [0-9]+ digit string? 2) Is there a need for twiddle-in-epoch(assuming a string) twiddle-in-release twiddle-in-distepoch There are/were several ways to implement twiddle-in-version; so far I've chosen to do only twiddle-in-version because that involved little more than rewriting some regex patterns. OTOH, handling twiddle-in-version as a special case of a more general string comparison (aka rpmvercmp) is at least as much of a problem as just permitting the full generalization. 3) revisiting dependency ranges with syntax like this Provides: foo = [1.2-3|4.5-6) which would match only Requires in the range, including 1.2-3 because of the '[' and excluding 4.5-6 because of the ')' with the usual mathematical conventions (in the US). The problematic character would be choosing a separator (I used '|' in the example): almost all the usual characters have pre-existing usages, but perhaps something could still be twitched together. 73 de Jeff __ RPM Package Managerhttp://rpm5.org Developer Communication Listrpm-devel@rpm5.org
Fwd: EVR issues: set:versions, epoch-as-string, now twiddle-in-version
I should point out that writing the attached message (and sending from the wrong e-mail address) has instantly led to a different -- and perhaps more natural -- syntax like Requires: set(libfoo.so.1) >= whatever for set:versions, and for the generalization (for writing strict regression tests, its mostly useless in packaging because there is no mapping that specifies how the mixed DEB <-> RPM version comparison might be done "naturally") Requires: deb(foo) >= E:V-R Requires: rpm(foo) >= E:V-R The precedent for foo(bar) name spacing in RPM dependencies with the above syntax is already widely deployed although entirely de facto. Sure would be nice _NOT_ to have to consider "Have it your own way!" competing syntaxes like Requires: libfoo.so.1 >= set:whatever and Requires: set(libfoo.so.1) >= whatever (and I'll ignore the gawdafulness of existing multilib dependencies like Requires: libfoo.ao.1()(64bit) coming along rapidly as a Newer! Better! Bestest! ABI standard now that there is support in kernel >= 3.2. *shrug* Its all implementable, just not very KISS when aesthetic/innate opinions get involved. hth 73 de Jeff Begin forwarded message: > From: Jeff Johnson > Subject: EVR issues: set:versions, epoch-as-string, now twiddle-in-version > Date: April 23, 2012 10:16:28 AM EDT > To: rpm-devel@rpm5.org > > OK, adoption of the ancient SuSE "twiddle-in-version" patch by > @rpm.org has now added yet a 3rd incompatibility that needs > to be addressed within what is commonly known as "rpmvercmp" > > (aside) > The designator "rpmvercmp" is already horrendously misleading > because its the name of the comparison routine for each of the > V and R members of the tuple {E,V,R}. There are many many > many false starts at package version comparisons that naively > attempt > rc = rpmvercmp("E:V-R", whatever); > when what is actually needed is a loop of each member of the > tuple like >static char * EVRprecedence = "EVR"; >char ptype; >int i; > >for (i = 0; (ptype = EVRprecedence[i]) != '\0; i++) { > switch (ptype) { > case 'E': /* coerce and compare Epoch: digit strings */ break; > case 'V': > case 'R': > case 'D': /* Distepoch/Disttag comparison is similar to V and R > iirc */ > rc = rpmvercmp(somestring, anotherstring); break; > } > if (rc) break; >} > > I've written out the algorithm so that the asymmetry of the full > comparison with the existing convention of Epoch: as digit string > can be more easily seen. > > With 3 "incompatibilities" in existing EVR comparison looming, > I am tipped in favor of treating Eoch: more generally as a string, not > just a digit sequence, in order to > Handle all the incompatibilities (to disappoint every one equally) at > the same time. > > The issue of whether Epoch: SHOULD be treated as a [0-9]+: pattern is > merely a stricter policy convention on the general case. There are other > possible policy conventions with V and R that could and should be > imposed more strictly than > Neither V nor R can contain a '-' character. > which has always been seen as some sort of deficiency in RPM compared > to APT/DPKG. > > I can also see some usage case if "set" is described as a namespace marker > for a type of comparison, particularly since '>=" is being overloaded as a > subset "contained within" semantic instead of "rpmvercmp" comparison. > > The next most useful namespace marker is likely going to be "deb" (or > "dpkg" or "borg" to taste ;-) to at least attempt some means to also > write regression tests for the corner cases that are surely going to be > recognized soon. Because of pattern parsing ambiguity with ':' (which > has already been overused as a separator of Disttag/Distepoch), this > would mean that adding a "deb" item to the (being proposed here) EVR > name space based on a generalization of Epoch: as a string marker > for dpkg comparison (where Yet Another meaning for ':' is encountered). > > An example is easier than words: Suppose that there is a EVR string like > 0:1.2-3 > that then needs a "deb" or "rpm" name space marker. The parser pattern > could/should/would then return a tuple like > {"deb:0", "1.2", "3" } > or > {"deb", "0:1.2", "3" } > Either parsing is deterministic and implementable. I point out
EVR issues: set:versions, epoch-as-string, now twiddle-in-version
OK, adoption of the ancient SuSE "twiddle-in-version" patch by @rpm.org has now added yet a 3rd incompatibility that needs to be addressed within what is commonly known as "rpmvercmp" (aside) The designator "rpmvercmp" is already horrendously misleading because its the name of the comparison routine for each of the V and R members of the tuple {E,V,R}. There are many many many false starts at package version comparisons that naively attempt rc = rpmvercmp("E:V-R", whatever); when what is actually needed is a loop of each member of the tuple like static char * EVRprecedence = "EVR"; char ptype; int i; for (i = 0; (ptype = EVRprecedence[i]) != '\0; i++) { switch (ptype) { case 'E': /* coerce and compare Epoch: digit strings */ break; case 'V': case 'R': case 'D': /* Distepoch/Disttag comparison is similar to V and R iirc */ rc = rpmvercmp(somestring, anotherstring); break; } if (rc) break; } I've written out the algorithm so that the asymmetry of the full comparison with the existing convention of Epoch: as digit string can be more easily seen. With 3 "incompatibilities" in existing EVR comparison looming, I am tipped in favor of treating Eoch: more generally as a string, not just a digit sequence, in order to Handle all the incompatibilities (to disappoint every one equally) at the same time. The issue of whether Epoch: SHOULD be treated as a [0-9]+: pattern is merely a stricter policy convention on the general case. There are other possible policy conventions with V and R that could and should be imposed more strictly than Neither V nor R can contain a '-' character. which has always been seen as some sort of deficiency in RPM compared to APT/DPKG. I can also see some usage case if "set" is described as a namespace marker for a type of comparison, particularly since '>=" is being overloaded as a subset "contained within" semantic instead of "rpmvercmp" comparison. The next most useful namespace marker is likely going to be "deb" (or "dpkg" or "borg" to taste ;-) to at least attempt some means to also write regression tests for the corner cases that are surely going to be recognized soon. Because of pattern parsing ambiguity with ':' (which has already been overused as a separator of Disttag/Distepoch), this would mean that adding a "deb" item to the (being proposed here) EVR name space based on a generalization of Epoch: as a string marker for dpkg comparison (where Yet Another meaning for ':' is encountered). An example is easier than words: Suppose that there is a EVR string like 0:1.2-3 that then needs a "deb" or "rpm" name space marker. The parser pattern could/should/would then return a tuple like {"deb:0", "1.2", "3" } or {"deb", "0:1.2", "3" } Either parsing is deterministic and implementable. I point out the details solely to indicate some of the implementation issues that will need a conventionally defined answer. So the generalization to epoch-as-string (with an enforcement pattern of [0-9]+ for the Luddites) use now +1 from me. And its time to open the discussion for other comments. Note that you likely have 2-3 weeks to respond before I get around to attempting the epoch-as-string (and twiddle-in-version and set:versions generation) implementations. Other opinions? (aside) A simple +1, 0, or -1 reply suffices and will be tallied. Talking about package manager version comparison in public is known to have carcinogenic side effects inducing brain tumors. 73 de Jeff smime.p7s Description: S/MIME cryptographic signature