[gentoo-portage-dev] [PATCH] fetch: add force parameter (bug 697566)

2019-10-19 Thread Zac Medico
Add a force parameter which forces download even when a file already
exists in DISTDIR (and no digests are available to verify it). This
avoids the need to remove the existing file in advance, which makes
it possible to atomically replace the file and avoid interference
with concurrent processes. This is useful when using FETCHCOMMAND to
fetch a mirror's layout.conf file, for the purposes of bug 697566.

Bug: https://bugs.gentoo.org/697566
Signed-off-by: Zac Medico 
---
 lib/portage/package/ebuild/fetch.py| 17 ---
 lib/portage/tests/ebuild/test_fetch.py | 40 --
 2 files changed, 50 insertions(+), 7 deletions(-)

diff --git a/lib/portage/package/ebuild/fetch.py 
b/lib/portage/package/ebuild/fetch.py
index 76e4636c2..e01b4df02 100644
--- a/lib/portage/package/ebuild/fetch.py
+++ b/lib/portage/package/ebuild/fetch.py
@@ -432,7 +432,7 @@ def get_mirror_url(mirror_url, filename, cache_path=None):
 
 def fetch(myuris, mysettings, listonly=0, fetchonly=0,
locks_in_subdir=".locks", use_locks=1, try_mirrors=1, digests=None,
-   allow_missing_digests=True):
+   allow_missing_digests=True, force=False):
"""
Fetch files to DISTDIR and also verify digests if they are available.
 
@@ -455,6 +455,14 @@ def fetch(myuris, mysettings, listonly=0, fetchonly=0,
@param allow_missing_digests: Enable fetch even if there are no digests
available for verification.
@type allow_missing_digests: bool
+   @param bool: Force download, even when a file already exists in
+   DISTDIR. This is most useful when there are no digests 
available,
+   since otherwise download will be automatically forced if the
+   existing file does not match the available digests. Also, this
+   avoids the need to remove the existing file in advance, which
+   makes it possible to atomically replace the file and avoid
+   interference with concurrent processes.
+   @type force: bool
@rtype: int
@return: 1 if successful, 0 otherwise.
"""
@@ -878,7 +886,7 @@ def fetch(myuris, mysettings, listonly=0, fetchonly=0,
eout.quiet = mysettings.get("PORTAGE_QUIET") == 
"1"
match, mystat = _check_distfile(
myfile_path, pruned_digests, eout, 
hash_filter=hash_filter)
-   if match:
+   if match and not (force and not orig_digests):
# Skip permission adjustment for 
symlinks, since we don't
# want to modify anything outside of 
the primary DISTDIR,
# and symlinks typically point to 
PORTAGE_RO_DISTDIRS.
@@ -1042,10 +1050,11 @@ def fetch(myuris, mysettings, listonly=0, fetchonly=0,

os.unlink(download_path)
except EnvironmentError:
pass
-   elif myfile not in mydigests:
+   elif not orig_digests:
# We don't have a digest, but 
the file exists.  We must
# assume that it is fully 
downloaded.
-   continue
+   if not force:
+   continue
else:
if 
(mydigests[myfile].get("size") is not None
and 
mystat.st_size < mydigests[myfile]["size"]
diff --git a/lib/portage/tests/ebuild/test_fetch.py 
b/lib/portage/tests/ebuild/test_fetch.py
index f50fea0dd..f4eb0404b 100644
--- a/lib/portage/tests/ebuild/test_fetch.py
+++ b/lib/portage/tests/ebuild/test_fetch.py
@@ -119,10 +119,44 @@ class EbuildFetchTestCase(TestCase):
with open(foo_path, 'rb') as f:
self.assertNotEqual(f.read(), 
distfiles['foo'])
 
-   # Remove the stale file in order to forcefully 
update it.
-   os.unlink(foo_path)
+   # Use force=True to update the stale file.
+   self.assertTrue(bool(run_async(fetch, foo_uri, 
settings, try_mirrors=False, force=True)))
 
-   self.assertTrue(bool(run_async(fetch, foo_uri, 
settings, try_mirrors=False)))
+   with open(foo_path, 'rb') as f:
+   self.assertEqual(f.read(), 
distfiles['foo'])
+
+ 

Re: [gentoo-dev] Packages up for grabs due to cardoe being MIA

2019-10-19 Thread Yixun Lan
Hello

It's sad to see cardoe gone ..
I wish he will come back someday

On 16:44 Fri 13 Sep , Michał Górny wrote:
> Hello,
> 
> The following packages are now up for grabs since Undertakers have not
> received any reply nor seen any activity from cardoe:
> 
> dev-util/crash [b,v]
I'd like to take this one ..
but feel free to co-maintain if someone's interested in 

> 
> The packages marked [v] have version bump requests open.  The packages
> marked [b] have other bugs reported.
> 
> -- 
> Best regards,
> Michał Górny
> 



-- 
Yixun Lan (dlan)
Gentoo Linux Developer
GPG Key ID AABEFD55


signature.asc
Description: Digital signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Joshua Kinard
On 10/19/2019 19:57, Alec Warner wrote:
> On Sat, Oct 19, 2019 at 4:24 PM Joshua Kinard  wrote:
> 
>> On 10/18/2019 09:41, Michał Górny wrote:
>>> Hi, everybody.
>>>
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>>
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 6+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>>
>>> Throughout a transitional period (whose exact length hasn't been decided
>>> yet), both layouts will be available.  Afterwards, the old layout will
>>> be removed from mirrors.  This has a few implications:
>>>
>>> 1. Users who don't upgrade their package managers in time will lose
>>> the ability of fetching from Gentoo mirrors.  This shouldn't be that
>>> much of a problem given that the core software needed to upgrade Portage
>>> should all have reliable upstream SRC_URIs.
>>>
>>> 2. mirror://gentoo/file URIs will stop working.  While technically you
>>> could use mirror://gentoo/XX/file, I'd rather recommend finally
>>> discarding its usage and moving distfiles to devspace.
>>>
>>> 3. Directly fetching files from distfiles.gentoo.org will become
>>> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
>>> to use something like:
>>>
>>> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
>>> 1b
>>> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
>>> ...
>>>
>>>
>>> Alternatively, you can:
>>>
>>> $ wget http://distfiles.gentoo.org/distfiles/INDEX
>>>
>>> and grep for the right path there.  This INDEX is also a more
>>> lightweight alternative to HTML indexes generated by the servers.
>>>
>>>
>>> If you're interested in more background details and some plots, see [1].
>>>
>>> [1]
>> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
>>>
>>
>> So the answer I didn't really see directly stated here is, where do new
>> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
>> distfile to /space/distfiles-local.  What is the new directory I need to
>> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
>> target, what would be the applicable prefix to use?
>>
> 
> 
> 
> 
>>
>> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
>> got chastised for doing exactly that.  Too much possibility of
>> fragmentation
>> as devs retire or package maintainership changes hands.
>>
>> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
>> hash-based naming scheme on the new distfiles layout.  I really kind prefer
>> breaking the directories up based on the first letter of the distfiles in
>> question, factoring case-sensitivity in (so you'd have 52 top-level
>> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
>> directories, additional subdirectories for the next few letters (say,
>> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
>> live on its own, but from a direct navigation standpoint, it's easy to find
>> for someone browsing the distfiles server and easy to predict where a
>> distfile is at.
>>
>> No math, statistical analysis, or deep-rooted knowledge of filesystems
>> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
>> need to go get a distfile off the Gentoo mirrors, and being able to quickly
>> find it in the mirror root is great.  Having to do hash calculations to
>> work
>> out the file path will be *really* annoying.
>>
> 
> So if you want a tool that "downloads a distfile off of the mirrors" we
> should be able to build such a utility.
> 
> I'm not really sure why that tool needs to be:
> *copy DISTFILENAME*
> wget distilfes.gentoo.org/$PASTE
> 
> It could just `ebuild portageq download $DISTFILENAME or similar.`
> 
> -A

Sometimes, I'm not on a Gentoo system, or even a Linux/Unix platform, when I
go to fetch a distfile.  Could (and have) fetched as such off of Debian's
mirrors before, but Gentoo is what I know and fetching a distfile off of
those mirrors manually was generally very straight forward.

Not a common case, and certainly not a blocker.  I was just pointing out
that hashed-based naming is decidedly a lot less human-friendly.  But,
that's been the general trend for all-things technology these last few years.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Alec Warner
On Sat, Oct 19, 2019 at 4:24 PM Joshua Kinard  wrote:

> On 10/18/2019 09:41, Michał Górny wrote:
> > Hi, everybody.
> >
> > It is my pleasure to announce that yesterday (EU) evening we've switched
> > to a new distfile mirror layout.  Users will be switching to the new
> > layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> > already -- as their caches expire (24hrs).
> >
> > The new layout is mostly a bow towards mirror admins, for some of whom
> > having a 6+ files in a single directory have been a problem.
> > However, I suppose some of you also found e.g. the directory index
> > hardly usable due to its size.
> >
> > Throughout a transitional period (whose exact length hasn't been decided
> > yet), both layouts will be available.  Afterwards, the old layout will
> > be removed from mirrors.  This has a few implications:
> >
> > 1. Users who don't upgrade their package managers in time will lose
> > the ability of fetching from Gentoo mirrors.  This shouldn't be that
> > much of a problem given that the core software needed to upgrade Portage
> > should all have reliable upstream SRC_URIs.
> >
> > 2. mirror://gentoo/file URIs will stop working.  While technically you
> > could use mirror://gentoo/XX/file, I'd rather recommend finally
> > discarding its usage and moving distfiles to devspace.
> >
> > 3. Directly fetching files from distfiles.gentoo.org will become
> > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > to use something like:
> >
> > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > 1b
> > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > ...
> >
> >
> > Alternatively, you can:
> >
> > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> >
> > and grep for the right path there.  This INDEX is also a more
> > lightweight alternative to HTML indexes generated by the servers.
> >
> >
> > If you're interested in more background details and some plots, see [1].
> >
> > [1]
> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> >
>
> So the answer I didn't really see directly stated here is, where do new
> distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
> distfile to /space/distfiles-local.  What is the new directory I need to
> use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
> target, what would be the applicable prefix to use?
>




>
> Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
> got chastised for doing exactly that.  Too much possibility of
> fragmentation
> as devs retire or package maintainership changes hands.
>
> I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
> hash-based naming scheme on the new distfiles layout.  I really kind prefer
> breaking the directories up based on the first letter of the distfiles in
> question, factoring case-sensitivity in (so you'd have 52 top-level
> directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
> directories, additional subdirectories for the next few letters (say,
> letters 2-3).  Yes, this leads to some orphan cases where a distfile might
> live on its own, but from a direct navigation standpoint, it's easy to find
> for someone browsing the distfiles server and easy to predict where a
> distfile is at.
>
> No math, statistical analysis, or deep-rooted knowledge of filesystems
> behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
> need to go get a distfile off the Gentoo mirrors, and being able to quickly
> find it in the mirror root is great.  Having to do hash calculations to
> work
> out the file path will be *really* annoying.
>

So if you want a tool that "downloads a distfile off of the mirrors" we
should be able to build such a utility.

I'm not really sure why that tool needs to be:
*copy DISTFILENAME*
wget distilfes.gentoo.org/$PASTE

It could just `ebuild portageq download $DISTFILENAME or similar.`

-A






>
> --
> Joshua Kinard
> Gentoo/MIPS
> ku...@gentoo.org
> rsa6144/5C63F4E3F5C6C943 2015-04-27
> 177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943
>
> "The past tempts us, the present confuses us, the future frightens us.  And
> our lives slip away, moment by moment, lost in that vast, terrible
> in-between."
>
> --Emperor Turhan, Centauri Republic
>
>


Re: [gentoo-dev] [PATCH] use.desc: add global USE flag 'split-sbin'

2019-10-19 Thread Joshua Kinard
On 10/16/2019 14:19, William Hubbs wrote:
> On Wed, Oct 16, 2019 at 07:17:09PM +0200, Ulrich Mueller wrote:
>>> On Wed, 16 Oct 2019, William Hubbs wrote:
>>
>>> Back in the day, the s in /sbin and /usr/sbin meant static, not super
>>> user. All binaries in those directories were statically linked.
>>

[snip]

> 
> Please read the links I posted before --specifically the comments
> from Rob.
> 
> Also, there is this.
> 
> https://news.ycombinator.com/item?id=3519952
> 
> Tl;dr the bin sbin separation is a historical separation that doesn't
> make sense any longer.

This is just your opinion.  Why does it not make sense?  Please back that
up.  Especially the "historical separation" bit.  Why is is historical?
Whom is the authority on that?  Is this strictly a Gentoo thing?  Is RedHat
doing this?  Is someone else?  Etc...

FWIW, my opinion is I //like// the separation of /sbin and /bin.  In fact,
I'm that old codger who //still// likes keeping /usr/bin and /usr/sbin
separate (yes, on separate partitions).  Maybe it's because I'm really poor
at organizing (and staying organized), so dumping everything into one spot
-- which is something I do at home WAY too much -- just strikes me as a bad
idea.  Binning stuff into different buckets offers SOME degree of
organization.  It also means 'ls -l /bin' is still somewhat readable on a
system with a full desktop installed.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Joshua Kinard
On 10/18/2019 09:41, Michał Górny wrote:
> Hi, everybody.
> 
> It is my pleasure to announce that yesterday (EU) evening we've switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
> 
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 6+ files in a single directory have been a problem. 
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
> 
> Throughout a transitional period (whose exact length hasn't been decided
> yet), both layouts will be available.  Afterwards, the old layout will
> be removed from mirrors.  This has a few implications:
> 
> 1. Users who don't upgrade their package managers in time will lose
> the ability of fetching from Gentoo mirrors.  This shouldn't be that
> much of a problem given that the core software needed to upgrade Portage
> should all have reliable upstream SRC_URIs.
> 
> 2. mirror://gentoo/file URIs will stop working.  While technically you
> could use mirror://gentoo/XX/file, I'd rather recommend finally
> discarding its usage and moving distfiles to devspace.
> 
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
> 
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
> 
> 
> Alternatively, you can:
> 
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
> 
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.
> 
> 
> If you're interested in more background details and some plots, see [1].
> 
> [1] 
> https://dev.gentoo.org/~mgorny/articles/improving-distfile-mirror-structure.html
> 

So the answer I didn't really see directly stated here is, where do new
distfiles need to go //now//?  E.g., if on woodpecker, I currently cp a
distfile to /space/distfiles-local.  What is the new directory I need to
use?  And if mirror://gentoo/${FOO} is going away, for the new distfiles
target, what would be the applicable prefix to use?

Directly using devspace seems like a bad idea, IMHO.  Once long ago, we all
got chastised for doing exactly that.  Too much possibility of fragmentation
as devs retire or package maintainership changes hands.

I looked at the whitepaper'ish-like writeup, and I kinda don't like using a
hash-based naming scheme on the new distfiles layout.  I really kind prefer
breaking the directories up based on the first letter of the distfiles in
question, factoring case-sensitivity in (so you'd have 52 top-level
directories for A-Z and a-z, plus 10 more for 0-9).  Under each of those
directories, additional subdirectories for the next few letters (say,
letters 2-3).  Yes, this leads to some orphan cases where a distfile might
live on its own, but from a direct navigation standpoint, it's easy to find
for someone browsing the distfiles server and easy to predict where a
distfile is at.

No math, statistical analysis, or deep-rooted knowledge of filesystems
behind that paragraph.  Just a plain old unfiltered opinion.  Sometimes, I
need to go get a distfile off the Gentoo mirrors, and being able to quickly
find it in the mirror root is great.  Having to do hash calculations to work
out the file path will be *really* annoying.

-- 
Joshua Kinard
Gentoo/MIPS
ku...@gentoo.org
rsa6144/5C63F4E3F5C6C943 2015-04-27
177C 1972 1FB8 F254 BAD0 3E72 5C63 F4E3 F5C6 C943

"The past tempts us, the present confuses us, the future frightens us.  And
our lives slip away, moment by moment, lost in that vast, terrible in-between."

--Emperor Turhan, Centauri Republic



Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Richard Yao



> On Oct 19, 2019, at 4:03 PM, Michał Górny  wrote:
> 
> On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:
 On Oct 18, 2019, at 9:10 PM, Richard Yao  wrote:
>>> 
>>> 
> On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
 On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
>> On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
> Hi, everybody.
> It is my pleasure to announce that yesterday (EU) evening we've 
> switched
> to a new distfile mirror layout.  Users will be switching to the new
> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
> already -- as their caches expire (24hrs).
> The new layout is mostly a bow towards mirror admins, for some of whom
> having a 6+ files in a single directory have been a problem.
> However, I suppose some of you also found e.g. the directory index
> hardly usable due to its size.
> This sounds like a filesystem issue. Do we know which filesystems are 
> suffering?
> ZFS should be fine. I believe ext2/ext3 have problems with this many 
> files. ext4 is probably okay, but don’t quote me on that.
 Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
 may apply only to older ntfs versions.  NFS has been mentioned too.
>>> 
>>> ext2 and vfat are not surprises to me (outside of the idea that anyone 
>>> would use them for a mirror). NTFS and NFS are though.
 However, just because modern filesystems can handle them efficiently, it
 doesn't mean having directories that huge comes with zero cost.
>>> While I am okay with the change, what do you mean when you say that having 
>>> huge directories does not come with zero cost?
>>> 
>>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
>>> this, but the impact should be negligible. Filesystems with O(log n) 
>>> directory lookups would see faster directory lookups.
>>> 
>>> Outside of directory lookups, this could speed up up searches and sort 
>>> operations when listing everything with just about any filesystem 
>>> benefiting from the improvement.
>>> 
>>> Listing directories on such filesystems should not benefit from this unless 
>>> you are using ls where the default behavior is to sort the directory 
>>> contents (which is where the improvement when sorting comes into play). The 
>>> need to sort the directory contents by default keeps ls from displaying 
>>> anything until it has scanned the entire directory. The asymptotic 
>>> complexity of a fast comparison based sort improves in this situation from 
>>> O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory 
>>> independently. A further speed up could be obtained by doing multithreading 
>>> to parallelize the sort operations.
>> I read your original email late at night and I misread the description of 
>> how this works.
>> 
>> At an initial glance, I thought we were doing a prefix approach (with the 
>> caveat that buckets are unbalanced). In reality, we are doing a 
>> cryptographic hash of the filenames.
>> 
>> That would keep all buckets balanced, which gives the best directory lookup 
>> times on O(log n) lookup filesystems, but I think there is something to be 
>> gained from using the less optimal approach of using filename prefixes:
>> 
>> * some regex searches on distfiles can be accelerated
>> * generating a sorted list of all distfiles becomes asymptotically faster
>> * it is easy for a user to find all versions of a given distfile
>> * no need to calculate a cryptographic hash
>> 
>> I realize that I am late to propose it, but could we consider a switch to 
>> this alternative arrangement?
> 
> No, we can't.  Please read either the original discussion on the bug, or
> the linked article.  It's explained in detail why this won't work.
Alright. I am convinced. Thanks.
> 
> -- 
> Best regards,
> Michał Górny
> 




Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Michał Górny
On Sat, 2019-10-19 at 15:26 -0400, Richard Yao wrote:
> > On Oct 18, 2019, at 9:10 PM, Richard Yao  wrote:
> > 
> > 
> > > > On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
> > > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
> > > > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny  
> > > > > > > > > wrote:
> > > > > > > > Hi, everybody.
> > > > > > > > It is my pleasure to announce that yesterday (EU) evening we've 
> > > > > > > > switched
> > > > > > > > to a new distfile mirror layout.  Users will be switching to 
> > > > > > > > the new
> > > > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they 
> > > > > > > > upgraded
> > > > > > > > already -- as their caches expire (24hrs).
> > > > > > > > The new layout is mostly a bow towards mirror admins, for some 
> > > > > > > > of whom
> > > > > > > > having a 6+ files in a single directory have been a problem.
> > > > > > > > However, I suppose some of you also found e.g. the directory 
> > > > > > > > index
> > > > > > > > hardly usable due to its size.
> > > > This sounds like a filesystem issue. Do we know which filesystems are 
> > > > suffering?
> > > > ZFS should be fine. I believe ext2/ext3 have problems with this many 
> > > > files. ext4 is probably okay, but don’t quote me on that.
> > > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> > > may apply only to older ntfs versions.  NFS has been mentioned too.
> > 
> > ext2 and vfat are not surprises to me (outside of the idea that anyone 
> > would use them for a mirror). NTFS and NFS are though.
> > > However, just because modern filesystems can handle them efficiently, it
> > > doesn't mean having directories that huge comes with zero cost.
> > While I am okay with the change, what do you mean when you say that having 
> > huge directories does not come with zero cost?
> > 
> > Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
> > this, but the impact should be negligible. Filesystems with O(log n) 
> > directory lookups would see faster directory lookups.
> > 
> > Outside of directory lookups, this could speed up up searches and sort 
> > operations when listing everything with just about any filesystem 
> > benefiting from the improvement.
> > 
> > Listing directories on such filesystems should not benefit from this unless 
> > you are using ls where the default behavior is to sort the directory 
> > contents (which is where the improvement when sorting comes into play). The 
> > need to sort the directory contents by default keeps ls from displaying 
> > anything until it has scanned the entire directory. The asymptotic 
> > complexity of a fast comparison based sort improves in this situation from 
> > O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory 
> > independently. A further speed up could be obtained by doing multithreading 
> > to parallelize the sort operations.
> I read your original email late at night and I misread the description of how 
> this works.
> 
> At an initial glance, I thought we were doing a prefix approach (with the 
> caveat that buckets are unbalanced). In reality, we are doing a cryptographic 
> hash of the filenames.
> 
> That would keep all buckets balanced, which gives the best directory lookup 
> times on O(log n) lookup filesystems, but I think there is something to be 
> gained from using the less optimal approach of using filename prefixes:
> 
> * some regex searches on distfiles can be accelerated
> * generating a sorted list of all distfiles becomes asymptotically faster
> * it is easy for a user to find all versions of a given distfile
> * no need to calculate a cryptographic hash
> 
> I realize that I am late to propose it, but could we consider a switch to 
> this alternative arrangement?

No, we can't.  Please read either the original discussion on the bug, or
the linked article.  It's explained in detail why this won't work.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Richard Yao



> On Oct 18, 2019, at 9:10 PM, Richard Yao  wrote:
> 
> 
>>> On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
 On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
>>> Hi, everybody.
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 6+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
>>> This sounds like a filesystem issue. Do we know which filesystems are 
>>> suffering?
>>> ZFS should be fine. I believe ext2/ext3 have problems with this many files. 
>>> ext4 is probably okay, but don’t quote me on that.
>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>> may apply only to older ntfs versions.  NFS has been mentioned too.
> 
> ext2 and vfat are not surprises to me (outside of the idea that anyone would 
> use them for a mirror). NTFS and NFS are though.
>> However, just because modern filesystems can handle them efficiently, it
>> doesn't mean having directories that huge comes with zero cost.
> While I am okay with the change, what do you mean when you say that having 
> huge directories does not come with zero cost?
> 
> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
> this, but the impact should be negligible. Filesystems with O(log n) 
> directory lookups would see faster directory lookups.
> 
> Outside of directory lookups, this could speed up up searches and sort 
> operations when listing everything with just about any filesystem benefiting 
> from the improvement.
> 
> Listing directories on such filesystems should not benefit from this unless 
> you are using ls where the default behavior is to sort the directory contents 
> (which is where the improvement when sorting comes into play). The need to 
> sort the directory contents by default keeps ls from displaying anything 
> until it has scanned the entire directory. The asymptotic complexity of a 
> fast comparison based sort improves in this situation from O(nlogn) to 
> O(nlog(n/b)) provided that you sort each subdirectory independently. A 
> further speed up could be obtained by doing multithreading to parallelize the 
> sort operations.
I read your original email late at night and I misread the description of how 
this works.

At an initial glance, I thought we were doing a prefix approach (with the 
caveat that buckets are unbalanced). In reality, we are doing a cryptographic 
hash of the filenames.

That would keep all buckets balanced, which gives the best directory lookup 
times on O(log n) lookup filesystems, but I think there is something to be 
gained from using the less optimal approach of using filename prefixes:

* some regex searches on distfiles can be accelerated
* generating a sorted list of all distfiles becomes asymptotically faster
* it is easy for a user to find all versions of a given distfile
* no need to calculate a cryptographic hash

I realize that I am late to propose it, but could we consider a switch to this 
alternative arrangement?

The bulk of the performance gain should be realized with either approach.

> Since I know someone will call me out on that comment, I will explain. Each 
> bucket has roughly n/b items in it where n is the total number and b is the 
> number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each 
> of the b buckets. The buckets are pre-sorted by prefix, so the result is now 
> sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) 
> comparison sort on this very special case where you call it multiple times on 
> data that has been persorted by prefix into buckets.
> 
> Is there any other benefit to this or did I get everything?
> 
> By the way, it is offtopic for the thread, but it occurs to me that a hybrid 
> of radix sort and A comparison based sort could give us a general sorting 
> algorithm that is asymptotically faster than O(nlogn).
>> [1] https://bugs.gentoo.org/534528
>> --
>> Best regards,
>> Michał Górny




[gentoo-dev] Last rites: sci-chemistry/aria, sci-chemistry/ccpn, sci-chemistry/ccpn-data, sci-chemistry/cns

2019-10-19 Thread Michał Górny
# Michał Górny  (2019-10-19)
# sci-chemistry/ccpn is unfetchable and mirror-restricted.
# sci-chemistry/aria is its reverse dependency which can't be installed
# as a result.
# sci-chemistry/cns is fetch-restricted and the package request form
# is dead.  Also, its only dependency is aria.
# Removal in 30 days.  Bug #695784.
sci-chemistry/aria
sci-chemistry/ccpn
sci-chemistry/ccpn-data
sci-chemistry/cns

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


[gentoo-dev] Last rites: games-fps/postal2

2019-10-19 Thread Michał Górny
# Michał Górny  (2019-10-19)
# The Linux installer/update is unfetchable, and can't be redistributed.
# Removal in 30 days.  Bug #695778.
games-fps/postal2

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


[gentoo-dev] Last rites: sci-biology/ariadne

2019-10-19 Thread Michał Górny
# Michał Górny  (2019-10-19)
# Upstream homepage and sources are gone.  The license raises doubt
# as to whether we can redistribute it.  No new releases since being
# added in 2005.
# Removal in 30 days.  Bug #694926.
sci-biology/ariadne

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


[gentoo-dev] Last rites: sys-auth/sakcl

2019-10-19 Thread Michał Górny
# Michał Górny  (2019-10-19)
# Unmaintained package with incorrect LICENSE and a failing build
# (#679204).
# Removal in 30 days.  Bug #694450.
sys-auth/sakcl

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Michał Górny
On Sat, 2019-10-19 at 15:31 +0200, Fabian Groffen wrote:
> Hi,
> 
> On 18-10-2019 15:41:32 +0200, Michał Górny wrote:
> > 3. Directly fetching files from distfiles.gentoo.org will become
> > a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> > to use something like:
> > 
> > $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> > 1b
> > $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> > ...
> > 
> > 
> > Alternatively, you can:
> > 
> > $ wget http://distfiles.gentoo.org/distfiles/INDEX
> > 
> > and grep for the right path there.  This INDEX is also a more
> > lightweight alternative to HTML indexes generated by the servers.
> 
> Would it be possible to run a service that sends a 302 for the
> distfiles/foo-1.tar.gz to the appropriate bucket such that manual
> fetching doesn't require to calculate the hash?
> 
> I prototyped this myself for distfiles.prefix, and seems like a nice
> guesture for at least the transition period?
> 

That would only for servers whose admins would explicitly install
the service, i.e. not for anyone using GENTOO_MIRRORS.  If you're
talking purely about distfiles.gentoo.org, we may add something like
that by the end of transitional period.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Fabian Groffen
Hi,

On 18-10-2019 15:41:32 +0200, Michał Górny wrote:
> 3. Directly fetching files from distfiles.gentoo.org will become
> a little harder.  To fetch a distfile named 'foo-1.tar.gz', you'd have
> to use something like:
> 
> $ printf '%s' foo-1.tar.gz | b2sum | cut -c1-2
> 1b
> $ wget http://distfiles.gentoo.org/distfiles/1b/foo-1.tar.gz
> ...
> 
> 
> Alternatively, you can:
> 
> $ wget http://distfiles.gentoo.org/distfiles/INDEX
> 
> and grep for the right path there.  This INDEX is also a more
> lightweight alternative to HTML indexes generated by the servers.

Would it be possible to run a service that sends a 302 for the
distfiles/foo-1.tar.gz to the appropriate bucket such that manual
fetching doesn't require to calculate the hash?

I prototyped this myself for distfiles.prefix, and seems like a nice
guesture for at least the transition period?

Thanks,
Fabian


-- 
Fabian Groffen
Gentoo on a different level


signature.asc
Description: PGP signature


Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Richard Yao



> On Oct 19, 2019, at 2:17 AM, Michał Górny  wrote:
> 
> On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote:
 On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
>>> 
>>> On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
 On Oct 18, 2019, at 9:42 AM, Michał Górny  wrote:
>>> Hi, everybody.
>>> It is my pleasure to announce that yesterday (EU) evening we've switched
>>> to a new distfile mirror layout.  Users will be switching to the new
>>> layout either as they upgrade Portage to 2.3.77 or -- if they upgraded
>>> already -- as their caches expire (24hrs).
>>> The new layout is mostly a bow towards mirror admins, for some of whom
>>> having a 6+ files in a single directory have been a problem.
>>> However, I suppose some of you also found e.g. the directory index
>>> hardly usable due to its size.
 This sounds like a filesystem issue. Do we know which filesystems are 
 suffering?
 ZFS should be fine. I believe ext2/ext3 have problems with this many 
 files. ext4 is probably okay, but don’t quote me on that.
>>> 
>>> Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
>>> may apply only to older ntfs versions.  NFS has been mentioned too.
>> 
>> ext2 and vfat are not surprises to me (outside of the idea that anyone would 
>> use them for a mirror). NTFS and NFS are though.
> 
> Are you surprised that people use NTFS on Windows?  Or that they use
> local mirrors over NFS?  The latter still needs to be addressed
> separatel, provided that they mount it on DISTDIR.
I am surprised that it was an issue on NTFS because it uses B-trees. As for 
NFS, I had expected that to be more dependent on the local filesystem than on 
NFS itself. If it has a slowdown when used on a filesystem that had fast 
directory operations, that might be a bug.
> 
>>> However, just because modern filesystems can handle them efficiently, it
>>> doesn't mean having directories that huge comes with zero cost.
>> While I am okay with the change, what do you mean when you say that having 
>> huge directories does not come with zero cost?
>> 
>> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
>> this
> 
> O(1) or O(n)?
ZFS uses extendible hashing for its directories, so the data structure used is 
amortized O(1). You might consider it O(log n) due to the indirect tree 
traversal needed to find the direct block containing the hash table entry. With 
caching of indirect blocks, it should be amortized O(1) to find the direct 
block in practice as far as read IOs are considered. In addition, the base of 
the logarithm is 128 or 1024 depending on the pool feature flags.
> 
>> , but the impact should be negligible. Filesystems with O(log n) directory 
>> lookups would see faster directory lookups.
>> 
>> Outside of directory lookups, this could speed up up searches and sort 
>> operations when listing everything with just about any filesystem benefiting 
>> from the improvement.
>> 
>> Listing directories on such filesystems should not benefit from this unless 
>> you are using ls where the default behavior is to sort the directory 
>> contents (which is where the improvement when sorting comes into play). The 
>> need to sort the directory contents by default keeps ls from displaying 
>> anything until it has scanned the entire directory. The asymptotic 
>> complexity of a fast comparison based sort improves in this situation from 
>> O(nlogn) to O(nlog(n/b)) provided that you sort each subdirectory 
>> independently. A further speed up could be obtained by doing multithreading 
>> to parallelize the sort operations.
>> 
>> Since I know someone will call me out on that comment, I will explain. Each 
>> bucket has roughly n/b items in it where n is the total number and b is the 
>> number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort 
>> each of the b buckets. The buckets are pre-sorted by prefix, so the result 
>> is now sorted. You therefore get O(nlog(n/b)) time complexity out of an 
>> O(nlogn) comparison sort on this very special case where you call it 
>> multiple times on data that has been persorted by prefix into buckets.
>> 
>> Is there any other benefit to this or did I get everything?
> 
> Listings for individual directories won't cause major pain to browsers
> anymore.  Not that there's much reason to do them.
That makes sense.
> 
> All kinds of per-direction operations will consume less memory
> and be potentially faster.
Userland would save memory when sorting or grepping a directory listing by 
virtue of having to process less data for grep and less data at a time for 
sorting (if it takes advantage of this). That would have performance benefits 
in userland.

The kernel would have little memory savings and in some cases might be slightly 
worse. It is negligible. Performance in the kernel ought to be slightly better 
on filesystems with O(log n) directory operations, but I would only 

Re: [gentoo-dev] New distfile mirror layout

2019-10-19 Thread Michał Górny
On Fri, 2019-10-18 at 21:09 -0400, Richard Yao wrote:
> > On Oct 18, 2019, at 4:49 PM, Michał Górny  wrote:
> > 
> > On Fri, 2019-10-18 at 15:53 -0400, Richard Yao wrote:
> > > > > > > On Oct 18, 2019, at 9:42 AM, Michał Górny  
> > > > > > > wrote:
> > > > > > Hi, everybody.
> > > > > > It is my pleasure to announce that yesterday (EU) evening we've 
> > > > > > switched
> > > > > > to a new distfile mirror layout.  Users will be switching to the new
> > > > > > layout either as they upgrade Portage to 2.3.77 or -- if they 
> > > > > > upgraded
> > > > > > already -- as their caches expire (24hrs).
> > > > > > The new layout is mostly a bow towards mirror admins, for some of 
> > > > > > whom
> > > > > > having a 6+ files in a single directory have been a problem.
> > > > > > However, I suppose some of you also found e.g. the directory index
> > > > > > hardly usable due to its size.
> > > This sounds like a filesystem issue. Do we know which filesystems are 
> > > suffering?
> > > ZFS should be fine. I believe ext2/ext3 have problems with this many 
> > > files. ext4 is probably okay, but don’t quote me on that.
> > 
> > Ext2, VFAT and NTFS were mentioned on the bug [1], though I suppose this
> > may apply only to older ntfs versions.  NFS has been mentioned too.
> 
> ext2 and vfat are not surprises to me (outside of the idea that anyone would 
> use them for a mirror). NTFS and NFS are though.

Are you surprised that people use NTFS on Windows?  Or that they use
local mirrors over NFS?  The latter still needs to be addressed
separatel, provided that they mount it on DISTDIR.

> > However, just because modern filesystems can handle them efficiently, it
> > doesn't mean having directories that huge comes with zero cost.
> While I am okay with the change, what do you mean when you say that having 
> huge directories does not come with zero cost?
> 
> Filesystems with O(1) directory lookups like ZFS would probably be hurt by 
> this

O(1) or O(n)?

> , but the impact should be negligible. Filesystems with O(log n) directory 
> lookups would see faster directory lookups.
> 
> Outside of directory lookups, this could speed up up searches and sort 
> operations when listing everything with just about any filesystem benefiting 
> from the improvement.
> 
> Listing directories on such filesystems should not benefit from this unless 
> you are using ls where the default behavior is to sort the directory contents 
> (which is where the improvement when sorting comes into play). The need to 
> sort the directory contents by default keeps ls from displaying anything 
> until it has scanned the entire directory. The asymptotic complexity of a 
> fast comparison based sort improves in this situation from O(nlogn) to 
> O(nlog(n/b)) provided that you sort each subdirectory independently. A 
> further speed up could be obtained by doing multithreading to parallelize the 
> sort operations.
> 
> Since I know someone will call me out on that comment, I will explain. Each 
> bucket has roughly n/b items in it where n is the total number and b is the 
> number of buckets. Sorting one bucket is O(n/b * log(n/b)). Loop to sort each 
> of the b buckets. The buckets are pre-sorted by prefix, so the result is now 
> sorted. You therefore get O(nlog(n/b)) time complexity out of an O(nlogn) 
> comparison sort on this very special case where you call it multiple times on 
> data that has been persorted by prefix into buckets.
> 
> Is there any other benefit to this or did I get everything?

Listings for individual directories won't cause major pain to browsers
anymore.  Not that there's much reason to do them.

All kinds of per-direction operations will consume less memory
and be potentially faster.

-- 
Best regards,
Michał Górny



signature.asc
Description: This is a digitally signed message part