Re: [gentoo-dev] New distfile mirror layout
Hi All, On 2019/10/21 18:42, Richard Yao wrote: If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory. I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the best when all files are accessed with equal frequency, which would be an incorrect assumption. A more human friendly approach might still be better. I doubt that we have the data to determine that though. Also, another idea is to use a cheap hash function (e.g. fletcher) and just have the mirrors do the hashing behind the scenes. Then we would have the best of both worlds. Experience: ext4 sucks at targeting name lookups without dir_index feature (O(n) lookups - scans all entries in the folder). With dir_index readdir performance is crap. Pick your poison I guess. Most of our larger filesystems (2TB+, but especially the 80TB+ ones) we've reverted to disabling dir_index as the benefit is outweighed by the crappy readdir() and glob() performance. There doesn't seem to be a real specific tip-over point, and it seems to depend a lot on RAM availability and harddrive speed (obviously). So if dentries gets cached, disk speeds becomes less of an issue. However, on large folders (where I typically use 10k as a value for large based on "gut feeling" and "unquantifiable experience" and "nothing scientific at all") I find that even with lots of RAM two consecutive ls commands remains terribly slow. Switch off dir_index and that becomes an order of magnitude faster. I don't have a great deal of experience with XFS, but on those systems where we do it's generally on a VM, and perceivably (again, not scientific) our experience has been that it feels slower. Again, not scientific, just perception. I'm in support for the change. This will bucket to 256 folders and should have a reasonably even split between folders. If required a second layer could be introduced by using the 3rd and 4th digits of the hash for a second layer. Any hash should be fine, it really doesn't need to be cryptographically strong, it just needs to provide a good spread and be really fast. Generally a hash table should have a prime number of buckets to assist with hash bias, but frankly, that's over complicating the situation here. I also agree with others that it used to be easy to get distfiles as and when needed, so an alternative structure could mirror that of the portage tree itself, in other words "cat/pkg/distfile". This perhaps just shifts the issue: jkroon@plastiekpoot /usr/portage $ find . -maxdepth 1 -type d -name "*-*" | wc -l 167 jkroon@plastiekpoot /usr/portage $ find *-* -maxdepth 1 -type d | wc -l 19412 jkroon@plastiekpoot /usr/portage $ for i in *-*; do echo $(find $i -maxdepth 1 -type d | wc -l) $i; done | sort -g | tail -n10 347 net-misc 373 media-sound 395 media-libs 399 dev-util 505 dev-libs 528 dev-java 684 dev-haskell 690 dev-ruby 1601 dev-perl 1889 dev-python So that's average 116 sub folders under the top layer (only two over 1000), and then presumably less than 100 distfiles maximum per package? Probably overkill but would (should) solve both the too many files per folder as well as the easy lookup by hand issue. I don't have a preference on either solution though but do agree that "easy finding of distfiles" are handy. The INDEX mechanism is fine for me. Kind Regards, Jaco
Re: [gentoo-dev] Proposal: change to default policy of doing changes to packages that are maintained by other developers
On Mon, 21 Oct 2019 17:58:51 -0700 Matt Turner wrote: > I'm not sure what this is in reference to so it seems to be a > non-sequitur, but I like the policy of at least waiting a day for > review of non-critical fixes. Phrased another way, let people in every > timezone have a chance. Its not aimed so much at the person writing this proposal, but more so, something that has happened before and was annoying and I had to have annoying conversations about why this was not-ok, where the agent in question didn't attempt to reach out to any team member of a very critical package before "just doing it". The nature of the change *was* simple enough that had we seen it, we'd have ACKed it easily. But the need to get lucky and spot the commit in the history and review it after-the-fact hoping the agent didn't break anything is where this is "not-ok". Yes, I hate to have to re-iterate in policy behaviour that I consider to be a sensible reasonable default... but it turns out, not everyone has that same sensibility. I'm fine with a policy that allows for non-maintainer contributions, just the stipulations of "try contact the maintainer, wait a reasonable amount of time based on the overall combination of that packages importance and the effective availability of people who maintain it to even respond to a ping". Because the fact is, non-maintainers have substantially less understanding of the total net of complexities, both in portage and outside of portage (by way of project specific tracking projects), and they're less equipped to make a judgement call as to wether or not a change that looks trivial, is trivial in the grand scheme of things. ( For instance, tweaking the package-keywords entries on bulk keyword request I filed riles me up, because I need a lot of tooling to maintain that stuff, and currently, other people tweaking stuff that is outside their responsibilities messes with my ability to do this effectively. And this isn't even visible 'in portage', its just more of the same pattern of 'touching stuff that's not your responsibility to touch without contacting the people whos responsibility it is' ) pgpBUYJcP1lcr.pgp Description: OpenPGP digital signature
Re: [gentoo-dev] Proposal: change to default policy of doing changes to packages that are maintained by other developers
On Mon, Oct 21, 2019 at 5:37 PM Kent Fredric wrote: > > On Mon, 21 Oct 2019 19:37:28 +0200 > Piotr Karbowski wrote: > > > This is a bit unhealthy, especially when some developers that maintain > > packages are out of reach, or the patches to update ebuild just rot on > > the bugzilla and are not taken in by maintainers. > > IME this is far from the norm and should not be used as the > justification here. I think exceptional cases are the reason for many, many policies in Gentoo. That probably doesn't surprise you. > If you don't have the patience to even wait _one_ day for a response, > maybe you shouldn't be doing opensource. I'm not sure what this is in reference to so it seems to be a non-sequitur, but I like the policy of at least waiting a day for review of non-critical fixes. Phrased another way, let people in every timezone have a chance.
Re: [gentoo-dev] Proposal: change to default policy of doing changes to packages that are maintained by other developers
On Mon, Oct 21, 2019 at 10:37 AM Piotr Karbowski wrote: > > Hi, > > I'd like to bring the topic of defining default policy to do changes to > packages within ::gentoo that one does not maintain. > > This topic goes back from time to time on #gentoo-dev, and as I was > told, it was originally sent to gentoo-dev mailing list by robbat2 (I > failed to find this in archive, so if anyone have copy of it, please share). > > Current policy is to never touch ebuild that one did not claim as > maintainer unless maintainer of said package allowed you to do so. > > This is a bit unhealthy, especially when some developers that maintain > packages are out of reach, or the patches to update ebuild just rot on > the bugzilla and are not taken in by maintainers. > > What I'd like to end with would be to set a policy that allows any > developer with write access to ebuilds tree do changes that are small in > scope, like a minor bug fixes, adding missing flags, version bumps, > anything, that does not require complete overhaul of ebuild, with the > option to set in metadata.xml that policy for specified package is to > deny anyone but maintainers from doing changes. > > The packages that would require a flag to prohibit non-maintainers from > doing changes would of course be those of toolchain, or other big in > user base packages that are in very good shape, as in gnome packages, GNOME is not in good shape. > kde packages, X11 packages and so on. These are in good shape. > Of course, the policy would also define, that if there are any bug > introduced by changes that non-maintainer made, it's responsibility of > those who did the change in first place to fix it and clean any mess > that it has created. > > I personally am fine with others doing changes to packages I own, as > long as they won't break anything and I do know from the discussion on > #gentoo-dev, that there are others who have similar opinion about it. > > Those who feel territorial and those who believe only maintainers should > maintain specified packages can just set the flag in metadata.xml and > continue with the current state of things for their packages. > > The reason why I would like to get default policy to allow-all is that I > do not believe most of developers would want to go around all the > packages they own and set it manually to allow others doing changes even > if they're fine with others touching those packages. > > What do you think folks? I typically operate this way: If the package maintainer is active (on IRC, mailing lists, bugzilla) I file a bug. If no response after a week or two, depending on the importance of the change I commit it myself. If the package is not actively maintained (maintainer-needed@ or the maintainer has not touched the package in a long time while there are open bugs without response, etc) and the change is trivial (missing dependency, simple version bump for non-critical package, etc), then I'll just commit it directly. If the package is not actively maintained and the change is not trivial, I file a bug and try to get the maintainer to review. If they don't respond, I try to get others that may be interested to review and then commit after 2 weeks. For tree-wide stuff (package rename, removing amd64-fbsd keywords, $Header: ...$, etc) I think a mailing list post announcing what you're doing is a good idea. Wait a couple days for feedback and then do it. No need to get an ack from individual maintainers. I feel like these are pretty reasonable guidelines and have rarely gotten me yelled at. I know that a lot of the details of my personal behavior are subjective, but maybe that's good enough? Not sure. We seem to always have some disagreeable person force us to codify common sense. Am I missing any cases? The only thing I can think of is maintainers that are so territorial that try to prevent others from committing to their packages and also are not responsive to bug reports. Fortunately that is uncommon, but unfortunately it does still happen today. I don't really know how to solve that, but /that/ seems like the only real problem case to me.
Re: [gentoo-dev] New distfile mirror layout
> "RY" == Richard Yao writes: RY> ext4 is probably okay, but don’t quote me on that. Ext4 works fine here for a local distfiles mirror. -JimC -- James Cloos OpenPGP: 0x997A9F17ED7DAEA6
Re: [gentoo-dev] Proposal: change to default policy of doing changes to packages that are maintained by other developers
On Mon, 21 Oct 2019 19:37:28 +0200 Piotr Karbowski wrote: > This is a bit unhealthy, especially when some developers that maintain > packages are out of reach, or the patches to update ebuild just rot on > the bugzilla and are not taken in by maintainers. IME this is far from the norm and should not be used as the justification here. I would argue some *honest* attempt be made to contact the people officially responsible for the package. If they can't be contacted in a reasonable time frame, sure, by all means. But I cannot support a policy where it creates a conjecture of "I think this patch doesn't matter, so I'll just do it". Because in practice, no change, no matter how apparently insignificant, is immune from the risk of creating a quality reduction. And no change, is immune from potentially affecting the package maintainers workflow. ( For example, if you drop in a sed, or a patch, when the package uses a carefully curated tar-ball of patches which are also carefully tested against upstream sources on travis or something, by dropping the patch in unconsulted, you run the risk of pissing somebody off by adding a patch in ways that by-passes and potentially confounds these efforts. ) It really sucks having to review somebodies changes after-the-fact where you discover the change long after it was done, because nobody even tried to communicate. Reasonable attempts should be made, _especially_ if there are multiple maintainers for a package, or a project with multiple members. If you don't have the patience to even wait _one_ day for a response, maybe you shouldn't be doing opensource. pgp87nO9m3NId.pgp Description: OpenPGP digital signature
Re: [gentoo-dev] New distfile mirror layout
On Mon, Oct 21, 2019 at 9:42 AM Richard Yao wrote: > Also, another idea is to use a cheap hash function (e.g. fletcher) and just > have the mirrors do the hashing behind the scenes. Then we would have the > best of both worlds. It probably would have been better to make these suggestions when the GLEP was discussed close to two years ago. I'm glad that we have ideas for improvements but I worry that we're just backseat driving at this point given that the GLEP's now implemented.
Re: [gentoo-dev] New distfile mirror layout
On 21.10.2019 3:05, Joshua Kinard wrote: > So looking at texlive-latexextra-2019-r2.ebuild, it defines three variables: > > - TEXLIVE_MODULE_CONTENTS, with 1,241 space-delimited module names > - TEXLIVE_MODULE_DOC_CONTENTS, with 1,227 space-delimited doc names > - TEXLIVE_MODULE_SRC_CONTENTS, with 745 space-delimited src names > > Then, in texlive-module.eclass, there's these loops: > > for i in ${TEXLIVE_MODULE_CONTENTS}; do > SRC_URI="${SRC_URI} > mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" > done > > # Forge doc SRC_URI > [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} doc? (" > for i in ${TEXLIVE_MODULE_DOC_CONTENTS}; do > SRC_URI="${SRC_URI} > mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" > done > [ -n "${TEXLIVE_MODULE_DOC_CONTENTS}" ] && SRC_URI="${SRC_URI} )" > > # Forge source SRC_URI > if [ -n "${TEXLIVE_MODULE_SRC_CONTENTS}" ] ; then > SRC_URI="${SRC_URI} source? (" > for i in ${TEXLIVE_MODULE_SRC_CONTENTS}; do > SRC_URI="${SRC_URI} > mirror://gentoo/texlive-module-${i}-${PV}.${PKGEXT}" > done > SRC_URI="${SRC_URI} )" > fi > > I think this is definitely an issue with how this package is laying out its > needed distfiles. It really should be leveraging CTAN system at a minimum > to fetch all of the needed distfiles so we can get them off of our > distfiles mirror. Then it would be interesting to re-run the math on > the distfiles distribution using the different schemes highlighted in the > GLEP-75 paper. TexLive distributes collections of macros, not packages separately, they make their packaging based on CTAN. In the meantime CTAN packages are not versioned, they only have internal release number, no tags, releases and so on, see [1]. I also fail to see what problem you try to solve when suggest fetching macros from CTAN, you are going to have the same amount of data mirrored as a result. [1] - https://ctan.org/tex-archive/systems/texlive/tlnet/archive signature.asc Description: OpenPGP digital signature
[gentoo-dev] Last Rites: dev-python/gnome-keyring-python
# Matt Turner (2019-10-21) # Replaced by introspection bindings. Bug #628938 # Removal in 30 days dev-python/gnome-keyring-python signature.asc Description: PGP signature
Re: [gentoo-dev] Proposal: change to default policy of doing changes to packages that are maintained by other developers
On 19-10-21 19:37:28, Piotr Karbowski wrote: > Hi, > > I'd like to bring the topic of defining default policy to do changes to > packages within ::gentoo that one does not maintain. > > This topic goes back from time to time on #gentoo-dev, and as I was > told, it was originally sent to gentoo-dev mailing list by robbat2 (I > failed to find this in archive, so if anyone have copy of it, please share). > > Current policy is to never touch ebuild that one did not claim as > maintainer unless maintainer of said package allowed you to do so. > > This is a bit unhealthy, especially when some developers that maintain > packages are out of reach, or the patches to update ebuild just rot on > the bugzilla and are not taken in by maintainers. > > What I'd like to end with would be to set a policy that allows any > developer with write access to ebuilds tree do changes that are small in > scope, like a minor bug fixes, adding missing flags, version bumps, > anything, that does not require complete overhaul of ebuild, with the > option to set in metadata.xml that policy for specified package is to > deny anyone but maintainers from doing changes. > > The packages that would require a flag to prohibit non-maintainers from > doing changes would of course be those of toolchain, or other big in > user base packages that are in very good shape, as in gnome packages, > kde packages, X11 packages and so on. > > Of course, the policy would also define, that if there are any bug > introduced by changes that non-maintainer made, it's responsibility of > those who did the change in first place to fix it and clean any mess > that it has created. > > I personally am fine with others doing changes to packages I own, as > long as they won't break anything and I do know from the discussion on > #gentoo-dev, that there are others who have similar opinion about it. > > Those who feel territorial and those who believe only maintainers should > maintain specified packages can just set the flag in metadata.xml and > continue with the current state of things for their packages. > > The reason why I would like to get default policy to allow-all is that I > do not believe most of developers would want to go around all the > packages they own and set it manually to allow others doing changes even > if they're fine with others touching those packages. > > What do you think folks? > > -- Piotr. > I like the idea of setting metadata.xml options so repoman can help enforce things. Not sure if we talked about it earlier but it was an option that popped up in the last couple of weeks in the -dev channel. -- Matthew Thode (prometheanfire) signature.asc Description: PGP signature
Re: [gentoo-dev] Last rites: ruby24-only packages
On Sun, Oct 20, 2019 at 10:31 PM Hans de Graaff wrote: > On Sun, 2019-10-20 at 12:15 -0700, Alec Warner wrote: > > Infra uses thin a lot, is there a replacement? > > > www-servers/puma would be a good replacement. > > Feel free to unmask it for now if that helps infra to transition. Upstream > EOL for ruby 2.4 is March 2020, so we could wait until then if needed. > Nope, keep it masked in ::gentoo please. > > Hans > >
[gentoo-dev] Proposal: change to default policy of doing changes to packages that are maintained by other developers
Hi, I'd like to bring the topic of defining default policy to do changes to packages within ::gentoo that one does not maintain. This topic goes back from time to time on #gentoo-dev, and as I was told, it was originally sent to gentoo-dev mailing list by robbat2 (I failed to find this in archive, so if anyone have copy of it, please share). Current policy is to never touch ebuild that one did not claim as maintainer unless maintainer of said package allowed you to do so. This is a bit unhealthy, especially when some developers that maintain packages are out of reach, or the patches to update ebuild just rot on the bugzilla and are not taken in by maintainers. What I'd like to end with would be to set a policy that allows any developer with write access to ebuilds tree do changes that are small in scope, like a minor bug fixes, adding missing flags, version bumps, anything, that does not require complete overhaul of ebuild, with the option to set in metadata.xml that policy for specified package is to deny anyone but maintainers from doing changes. The packages that would require a flag to prohibit non-maintainers from doing changes would of course be those of toolchain, or other big in user base packages that are in very good shape, as in gnome packages, kde packages, X11 packages and so on. Of course, the policy would also define, that if there are any bug introduced by changes that non-maintainer made, it's responsibility of those who did the change in first place to fix it and clean any mess that it has created. I personally am fine with others doing changes to packages I own, as long as they won't break anything and I do know from the discussion on #gentoo-dev, that there are others who have similar opinion about it. Those who feel territorial and those who believe only maintainers should maintain specified packages can just set the flag in metadata.xml and continue with the current state of things for their packages. The reason why I would like to get default policy to allow-all is that I do not believe most of developers would want to go around all the packages they own and set it manually to allow others doing changes even if they're fine with others touching those packages. What do you think folks? -- Piotr. signature.asc Description: OpenPGP digital signature
Re: [gentoo-dev] New distfile mirror layout
> On Oct 20, 2019, at 2:51 AM, Michał Górny wrote: > > On Sat, 2019-10-19 at 19:24 -0400, Joshua Kinard wrote: >>> On 10/18/2019 09:41, Michał Górny wrote: >>> Hi, everybody. >>> >>> It is my pleasure to announce that yesterday (EU) evening we've switched >>> to a new distfile mirror layout. Users will be switching to the new >>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded >>> already -- as their caches expire (24hrs). >>> >>> The new layout is mostly a bow towards mirror admins, for some of whom >>> having a 6+ files in a single directory have been a problem. >>> However, I suppose some of you also found e.g. the directory index >>> hardly usable due to its size. >>> >>> Throughout a transitional period (whose exact length hasn't been decided >>> yet), both layouts will be available. Afterwards, the old layout will >>> be removed from mirrors. This has a few implications: >>> >>> 1. Users who don't upgrade their package managers in time will lose >>> the ability of fetching from Gentoo mirrors. This shouldn't be that >>> much of a problem given that the core software needed to upgrade Portage >>> should all have reliable upstream SRC_URIs. >>> >>> 2. mirror://gentoo/file URIs will stop working. While technically you >>> could use mirror://gentoo/XX/file, I'd rather recommend finally >>> discarding its usage and moving distfiles to devspace. >>> >>> 3. Directly fetching files from distfiles.gentoo.org will become >>> a little harder. To fetch a distfile named 'foo-1.tar.gz', you'd have >>> to use something like: >>> >>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2 >>> 1b >>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz >>> ... >>> >>> >>> Alternatively, you can: >>> >>> $ wget http://distfiles.gentoo.org/distfiles/INDEX >>> >>> and grep for the right path there. This INDEX is also a more >>> lightweight alternative to HTML indexes generated by the servers. >>> >>> >>> If you're interested in more background details and some plots, see [1]. >>> >>> [1] >>> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html >>> >> >> So the answer I didn't really see directly stated here is, where do new >> distfiles need to go //now//? E.g., if on woodpecker, I currently cp a >> distfile to /space/distfiles-local. What is the new directory I need to >> use? And if mirror://gentoo/${FOO} is going away, for the new distfiles >> target, what would be the applicable prefix to use? >> >> Directly using devspace seems like a bad idea, IMHO. Once long ago, we all >> got chastised for doing exactly that. Too much possibility of fragmentation >> as devs retire or package maintainership changes hands. > > Today you get chastised for using /space/distfiles-local and not > following policy changes. The devmanual states that it's deprecated > since at least 2011, and talks of using d.g.o [1]. > >> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a >> hash-based naming scheme on the new distfiles layout. I really kind prefer >> breaking the directories up based on the first letter of the distfiles in >> question, factoring case-sensitivity in (so you'd have 52 top-level >> directories for A-Z and a-z, plus 10 more for 0-9). Under each of those >> directories, additional subdirectories for the next few letters (say, >> letters 2-3). Yes, this leads to some orphan cases where a distfile might >> live on its own, but from a direct navigation standpoint, it's easy to find >> for someone browsing the distfiles server and easy to predict where a >> distfile is at. >> >> No math, statistical analysis, or deep-rooted knowledge of filesystems >> behind that paragraph. Just a plain old unfiltered opinion. Sometimes, I >> need to go get a distfile off the Gentoo mirrors, and being able to quickly >> find it in the mirror root is great. Having to do hash calculations to work >> out the file path will be *really* annoying. > > Your solution still doesn't solve the problem of having 8k-24k files > in a single directory, even if you use 7 letters of prefix. So it just > creates a lot of tiny directory noise for no practical gain. > > [1] > https://devmanual.gentoo.org/general-concepts/mirrors/index.html#suitable-download-hosts If we consider the access frequency, it might actually not be that bad. Consider a simple example with 500 files and two directory buckets. If we have 250 in each, then the size of the directory is always 250. However, if 50 files are accessed 90% of the time, then putting 450 into one directory and that 50 into another directory, we end up with the performance of the O(n) directory lookup being consistent with there being only 90 files in each directory. I am not sure if we should be discarding all other considerations to make changes to benefit O(n) directory lookup filesystems, but if we are, then the hashing approach is not necessarily the best one. It is only the
Re: [gentoo-dev] New distfile mirror layout
On Sun, 20 Oct 2019 20:05:40 -0400 Joshua Kinard wrote: > Longer-term, I think this entire approach should be revisited by the TeX > team to make it behave more like Perl or Python packages by having discrete > ebuilds for these modules. That's not exactly a small undertaking, but > this current approach feels very kludgy in its design and is probably > asking for trouble. I looked at several of the modules on CTAN, and they > each have their own version and even have different licenses. With the current state of the portage dependency resolver, and with regards to the constant problems end users face with it, I really can't advise this unless you need to. Currently working on vendoring rust in an overlay, and 128 ebuilds just to satisfy the dependencies enough to test *one* package is a bit of a piss-take. I'd suggest waiting a few years for portage to see some improvements here before taking on something that ambitious when the current approach works well enough. pgpt_T_YZWasM.pgp Description: OpenPGP digital signature
Re: [gentoo-dev] New distfile mirror layout
On Sun, 20 Oct 2019 16:57:54 -0400 Joshua Kinard wrote: > I know we've got a ton of Perl packages for the core set of Perl modules, > but doesn't the CPAN eclass also have the capability to auto-generate an > ebuild package for virtually any Perl package distributed via CPAN? Can > that logic be used with the CTAN system in its own eclass and then we remove > the 16k+ texlive modules off of our mirrors completely? Or at the worst, we > might just have to generate ebuilds for texlive modules and treat them as > discrete, installed packages. - Perl packages never have more than 1:1 source archives per ebuild - Perl upstream naming doesn't habitually use "perl-" as an archive prefix - Everything that is packaged for Perl in Gentoo is mirrored to the Gentoo distfiles mirror, and this causes no issues. So I don't think any comparison here makes sense. pgpS7vStGzQ3x.pgp Description: OpenPGP digital signature