Re: [gentoo-portage-dev] Re: search functionality in emerge

2008-12-01 Thread Emma Strubell
I completely forgot about Google's Summer of Code! Thanks for reminding me.
Hopefully I won't forget again by the time summer rolls around, obviously I
wouldn't mind getting a little extra money for doing something I'd do for
free anyway.

On a more related note: What, exactly, does porttree.py do? And am I correct
in thinking that my suffix tree(s) should somewhat replace porttree.py? Or,
should I be using porttree.py in order to populate my tree? I think I have
the suffix tree sufficiently figured out, I'm just trying to determine
where, exactly, the tree will fit in to the portage code, and what the best
way to populate it (with package names and some corresponding metadata)
would be.

On Mon, Dec 1, 2008 at 2:34 AM, Duncan [EMAIL PROTECTED] wrote:

 Emma Strubell [EMAIL PROTECTED] posted
 [EMAIL PROTECTED], excerpted
 below, on  Sun, 30 Nov 2008 18:42:11 -0500:

  i am really
  interested in contributing to Gentoo and portage in the future, though.
  I'm thinking this summer I'll have a chance...

 FWIW, Gentoo usually participates in the Google Summer of Code.  Assuming
 they have it again next year, if you're already considering spending some
 time on Gentoo code this summer, might as well try to get paid a little
 something for it.  It could/should be a nice resume booster, too. =:^)

 --
 Duncan - List replies preferred.   No HTML msgs.
 Every nonfree program has a lord, a master --
 and if you use the program, he is your master.  Richard Stallman





Re: [gentoo-portage-dev] Re: search functionality in emerge

2008-12-01 Thread Zac Medico
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Emma Strubell wrote:
 I completely forgot about Google's Summer of Code! Thanks for reminding me.
 Hopefully I won't forget again by the time summer rolls around, obviously I
 wouldn't mind getting a little extra money for doing something I'd do for
 free anyway.
 
 On a more related note: What, exactly, does porttree.py do? And am I correct
 in thinking that my suffix tree(s) should somewhat replace porttree.py? Or,
 should I be using porttree.py in order to populate my tree?

You should use portree.py to populate it. Specifically, you should
use portdbapi.aux_get() calls to access the package metadata that
you'll need, similar to how the code in the existing search class
accesses it.

 I think I have
 the suffix tree sufficiently figured out, I'm just trying to determine
 where, exactly, the tree will fit in to the portage code, and what the best
 way to populate it (with package names and some corresponding metadata)
 would be.

There are there possible times that I imagine a person might want to
populate it:

1) Automatically after emerge --sync. This should not be mandatory
since it will be somewhat time consuming and some users are very
sensitive about --sync time. Note that FEATURES=metadate-transfer is
disabled by default in the latest versions of portage, specifically
to reduce --sync time.

2) On demand, when emerge --search is invoked. The calling user will
need appropriate file system permissions in order to update the
search index.

3) On request, by calling a command that is specifically designed to
generate the search index. This could be a subcommand of emaint.

For the index file format, it would be simplest to use a python
pickle file, but you might choose another format if you'd like the
index to be accessible without python and the portage API (probably
not necessary).
- --
Thanks,
Zac
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkk0JFAACgkQ/ejvha5XGaONDACgixnmCh9Ei6MyUGIZXpiFt7F2
gqMAoOhf5H2uZHB7xhjecOcL0G3w/cqR
=hFNz
-END PGP SIGNATURE-



Re: [gentoo-portage-dev] Re: search functionality in emerge

2008-12-01 Thread Emma Strubell
Thanks for the clarification. I was planning on forcing an update of the
index as a part of emerge --sync, and implementing a command that would
update the search index (leaving it up to the user to update after making
any manual changes to the portage tree). That way the search index should
always be up-to-date when emerge -s is called. It does make sense for the
update upon --sync to be optional, but I guess I don't see why the update
should always be SO slow. Of course the first population of the tree will
take quite a while, but assuming regular (daily?) --syncs (and therefore
updates to the index), subsequent updates shouldn't take very long, since
there will only be a few (hundred?) changes to be made to the tree.

And I do plan on using a pickling the search tree :]

Emma

On Mon, Dec 1, 2008 at 12:52 PM, Zac Medico [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Emma Strubell wrote:
  I completely forgot about Google's Summer of Code! Thanks for reminding
 me.
  Hopefully I won't forget again by the time summer rolls around, obviously
 I
  wouldn't mind getting a little extra money for doing something I'd do for
  free anyway.
 
  On a more related note: What, exactly, does porttree.py do? And am I
 correct
  in thinking that my suffix tree(s) should somewhat replace porttree.py?
 Or,
  should I be using porttree.py in order to populate my tree?

 You should use portree.py to populate it. Specifically, you should
 use portdbapi.aux_get() calls to access the package metadata that
 you'll need, similar to how the code in the existing search class
 accesses it.

  I think I have
  the suffix tree sufficiently figured out, I'm just trying to determine
  where, exactly, the tree will fit in to the portage code, and what the
 best
  way to populate it (with package names and some corresponding metadata)
  would be.

 There are there possible times that I imagine a person might want to
 populate it:

 1) Automatically after emerge --sync. This should not be mandatory
 since it will be somewhat time consuming and some users are very
 sensitive about --sync time. Note that FEATURES=metadate-transfer is
 disabled by default in the latest versions of portage, specifically
 to reduce --sync time.

 2) On demand, when emerge --search is invoked. The calling user will
 need appropriate file system permissions in order to update the
 search index.

 3) On request, by calling a command that is specifically designed to
 generate the search index. This could be a subcommand of emaint.

 For the index file format, it would be simplest to use a python
 pickle file, but you might choose another format if you'd like the
 index to be accessible without python and the portage API (probably
 not necessary).
 - --
 Thanks,
 Zac
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.9 (GNU/Linux)

 iEYEARECAAYFAkk0JFAACgkQ/ejvha5XGaONDACgixnmCh9Ei6MyUGIZXpiFt7F2
 gqMAoOhf5H2uZHB7xhjecOcL0G3w/cqR
 =hFNz
 -END PGP SIGNATURE-




Re: [gentoo-portage-dev] Re: search functionality in emerge

2008-12-01 Thread Tambet
I would suggest a different way of updates. When you manually change portage
tree, you have to make an overlay. Overlay, as it's updated and managed by
human being, will be always small (unless someone makes a script, which
creates million overlay updates, but I dont think it would be efficient way
to do anything). So, when you search, you can search Portage tree with
index, which is updated with --sync and then search overlay, which is small
and fast to search anyway. Overlay should not have index in such case. If
anyone is going to change portage tree by hand, those changes will be lost
with next --sync and thus noone should do it anyway - this case should not
be considered at all.

Tambet - technique evolves to art, art evolves to magic, magic evolves to
just doing.


2008/12/1 Emma Strubell [EMAIL PROTECTED]

 Thanks for the clarification. I was planning on forcing an update of the
 index as a part of emerge --sync, and implementing a command that would
 update the search index (leaving it up to the user to update after making
 any manual changes to the portage tree). That way the search index should
 always be up-to-date when emerge -s is called. It does make sense for the
 update upon --sync to be optional, but I guess I don't see why the update
 should always be SO slow. Of course the first population of the tree will
 take quite a while, but assuming regular (daily?) --syncs (and therefore
 updates to the index), subsequent updates shouldn't take very long, since
 there will only be a few (hundred?) changes to be made to the tree.

 And I do plan on using a pickling the search tree :]

 Emma


 On Mon, Dec 1, 2008 at 12:52 PM, Zac Medico [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Emma Strubell wrote:
  I completely forgot about Google's Summer of Code! Thanks for reminding
 me.
  Hopefully I won't forget again by the time summer rolls around,
 obviously I
  wouldn't mind getting a little extra money for doing something I'd do
 for
  free anyway.
 
  On a more related note: What, exactly, does porttree.py do? And am I
 correct
  in thinking that my suffix tree(s) should somewhat replace porttree.py?
 Or,
  should I be using porttree.py in order to populate my tree?

 You should use portree.py to populate it. Specifically, you should
 use portdbapi.aux_get() calls to access the package metadata that
 you'll need, similar to how the code in the existing search class
 accesses it.

  I think I have
  the suffix tree sufficiently figured out, I'm just trying to determine
  where, exactly, the tree will fit in to the portage code, and what the
 best
  way to populate it (with package names and some corresponding metadata)
  would be.

 There are there possible times that I imagine a person might want to
 populate it:

 1) Automatically after emerge --sync. This should not be mandatory
 since it will be somewhat time consuming and some users are very
 sensitive about --sync time. Note that FEATURES=metadate-transfer is
 disabled by default in the latest versions of portage, specifically
 to reduce --sync time.

 2) On demand, when emerge --search is invoked. The calling user will
 need appropriate file system permissions in order to update the
 search index.

 3) On request, by calling a command that is specifically designed to
 generate the search index. This could be a subcommand of emaint.

 For the index file format, it would be simplest to use a python
 pickle file, but you might choose another format if you'd like the
 index to be accessible without python and the portage API (probably
 not necessary).
 - --
 Thanks,
 Zac
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.9 (GNU/Linux)

 iEYEARECAAYFAkk0JFAACgkQ/ejvha5XGaONDACgixnmCh9Ei6MyUGIZXpiFt7F2
 gqMAoOhf5H2uZHB7xhjecOcL0G3w/cqR
 =hFNz
 -END PGP SIGNATURE-





Re: [gentoo-portage-dev] Re: search functionality in emerge

2008-12-01 Thread Emma Strubell
Good point. I may just ignore overlays completely because 1) I don't use
them and 2) does anyone really need to search an overlay anyway? aren't any
packages added via an overlay added deliberately?

On Mon, Dec 1, 2008 at 4:52 PM, Tambet [EMAIL PROTECTED] wrote:

 I would suggest a different way of updates. When you manually change
 portage tree, you have to make an overlay. Overlay, as it's updated and
 managed by human being, will be always small (unless someone makes a script,
 which creates million overlay updates, but I dont think it would be
 efficient way to do anything). So, when you search, you can search Portage
 tree with index, which is updated with --sync and then search overlay, which
 is small and fast to search anyway. Overlay should not have index in such
 case. If anyone is going to change portage tree by hand, those changes will
 be lost with next --sync and thus noone should do it anyway - this case
 should not be considered at all.

 Tambet - technique evolves to art, art evolves to magic, magic evolves to
 just doing.


 2008/12/1 Emma Strubell [EMAIL PROTECTED]

 Thanks for the clarification. I was planning on forcing an update of the
 index as a part of emerge --sync, and implementing a command that would
 update the search index (leaving it up to the user to update after making
 any manual changes to the portage tree). That way the search index should
 always be up-to-date when emerge -s is called. It does make sense for the
 update upon --sync to be optional, but I guess I don't see why the update
 should always be SO slow. Of course the first population of the tree will
 take quite a while, but assuming regular (daily?) --syncs (and therefore
 updates to the index), subsequent updates shouldn't take very long, since
 there will only be a few (hundred?) changes to be made to the tree.

 And I do plan on using a pickling the search tree :]

 Emma


 On Mon, Dec 1, 2008 at 12:52 PM, Zac Medico [EMAIL PROTECTED] wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Emma Strubell wrote:
  I completely forgot about Google's Summer of Code! Thanks for reminding
 me.
  Hopefully I won't forget again by the time summer rolls around,
 obviously I
  wouldn't mind getting a little extra money for doing something I'd do
 for
  free anyway.
 
  On a more related note: What, exactly, does porttree.py do? And am I
 correct
  in thinking that my suffix tree(s) should somewhat replace porttree.py?
 Or,
  should I be using porttree.py in order to populate my tree?

 You should use portree.py to populate it. Specifically, you should
 use portdbapi.aux_get() calls to access the package metadata that
 you'll need, similar to how the code in the existing search class
 accesses it.

  I think I have
  the suffix tree sufficiently figured out, I'm just trying to determine
  where, exactly, the tree will fit in to the portage code, and what the
 best
  way to populate it (with package names and some corresponding metadata)
  would be.

 There are there possible times that I imagine a person might want to
 populate it:

 1) Automatically after emerge --sync. This should not be mandatory
 since it will be somewhat time consuming and some users are very
 sensitive about --sync time. Note that FEATURES=metadate-transfer is
 disabled by default in the latest versions of portage, specifically
 to reduce --sync time.

 2) On demand, when emerge --search is invoked. The calling user will
 need appropriate file system permissions in order to update the
 search index.

 3) On request, by calling a command that is specifically designed to
 generate the search index. This could be a subcommand of emaint.

 For the index file format, it would be simplest to use a python
 pickle file, but you might choose another format if you'd like the
 index to be accessible without python and the portage API (probably
 not necessary).
 - --
 Thanks,
 Zac
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.9 (GNU/Linux)

 iEYEARECAAYFAkk0JFAACgkQ/ejvha5XGaONDACgixnmCh9Ei6MyUGIZXpiFt7F2
 gqMAoOhf5H2uZHB7xhjecOcL0G3w/cqR
 =hFNz
 -END PGP SIGNATURE-






Re: [gentoo-portage-dev] Re: search functionality in emerge

2008-12-01 Thread René 'Necoro' Neumann
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Emma Strubell schrieb:
 2) does anyone really need to search an overlay anyway?

Of course. Take large (semi-)official overlays like sunrise. They can
easily be seen as a second portage tree.
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
=+lCO
-END PGP SIGNATURE-



Re: [gentoo-portage-dev] Time to say goodbye

2008-12-01 Thread Ned Ludd

On Sun, 2008-11-30 at 16:19 +0100, Marius Mauch wrote:
 So, time has come for me to realize that my time with Gentoo is over. I
 haven't actually been doing much Gentoo work over the last months due
 to personal reasons (nothing Gentoo related), and I don't see that
 situation changing in the near future. In fact I've already reassigned
 or dropped most of my responsibilites in Gentoo a while ago, so there
 are just a few pet projects left to give away:
 - my gentoo-stats project (in the portage/gentoo-stats svn repository).
 I know quite a few people are interested in the idea of collecting
 various statistic data from gentoo user systems, and I'd encourage
 everyone who wants to implement such a system to at least look at it (I
 may have even finished it if I wouldn't have wasted my time focusing on
 the wrong problems). There is quite a bit of documentation also that
 should help to get you started
 - a graphical security update tool (see bug #190397)
 
 So if anyone wants to adopt those, complete or just parts, just take
 them. As for Portage, Zac has practically already filled my role.
 
 So I guess that wraps it up. It's been a nice ride most of the time,
 but now it's time for me to leave the Gentoo train.
 
 Marius

I will always remember you as the guy who provided us with the much
needed glsa*.py (thank you again)
Take care and I wish you the best in all your future endeavors.



-- 
Ned Ludd [EMAIL PROTECTED]
Gentoo Linux




Re: [gentoo-portage-dev] Re: search functionality in emerge

2008-12-01 Thread Tambet
2008/12/2 Emma Strubell [EMAIL PROTECTED]

 True, true. Like I said, I don't really use overlays, so excuse my
 igonrance.


Do you know an order of doing things:

Rules of Optimization:

   - Rule 1: Don't do it.
   - Rule 2 (for experts only): Don't do it yet.

What this actually means - functionality comes first. Readability comes
next. Optimization comes last. Unless you are creating a fancy 3D engine for
kung fu game.

If you are going to exclude overlays, you are removing functionality - and,
indeed, absolutely has-to-be-there functionality, because noone would
intuitively expect search function to search only one subset of packages,
however reasonable this subset would be. So, you can't, just can't, add this
package into portage base - you could write just another external search
package for portage.

I looked this code a bit and:
Portage's __init__.py contains comment # search functionality. After
this comment, there is a nice and simple search class.
It also contains method def action_sync(...), which contains
synchronization stuff.

Now, search class will be initialized by setting up 3 databases - porttree,
bintree and vartree, whatever those are. Those will be in self._dbs array
and porttree will be in self._portdb.

It contains some more methods:
_findname(...) will return result of self._portdb.findname(...) with same
parameters or None if it does not exist.
Other methods will do similar things - map one or another method.
execute will do the real search...
Now - for package in self.portdb.cp_all() is important here ...it
currently loops over whole portage tree. All kinds of matching will be done
inside.
self.portdb obviously points to porttree.py (unless it points to fake tree).
cp_all will take all porttrees and do simple file search inside. This method
should contain optional index search.

self.porttrees = [self.porttree_root] + \
[os.path.realpath(t) for t in 
self.mysettings[PORTDIR_OVERLAY].split()]

So, self.porttrees contains list of trees - first of them is root, others
are overlays.

Now, what you have to do will not be harder just because of having overlay
search, too.

You have to create method def cp_index(self), which will return dictionary
containing package names as keys. For oroot... will be self.porttrees[1:],
not self.porttrees - this will only search overlays. d = {} will be
replaced with d = self.cp_index(). If index is not there, old version will
be used (thus, you have to make internal porttrees variable, which contains
all or all except first).

Other methods used by search are xmatch and aux_get - first used several
times and last one used to get description. You have to cache results of
those specific queries and make them use your cache - as you can see, those
parts of portage are already able to use overlays. Thus, you have to put
your code again in beginning of those functions - create index_xmatch and
index_aux_get methods, then make those methods use them and return their
results unless those are None (or something other in case none is already
legal result) - if they return None, old code will be run and do it's job.
If index is not created, result is None. In index_** methods, just check if
query is what you can answer and if it is, then answer it.

Obviously, the simplest way to create your index is to delete index, then
use those same methods to query for all nessecary information - and fastest
way would be to add updating index directly into sync, which you could do
later.

Please, also, make those commands to turn index on and off (last one should
also delete it to save disk space). Default should be off until it's fast,
small and reliable. Also notice that if index is kept on hard drive, it
might be faster if it's compressed (gz, for example) - decompressing takes
less time and more processing power than reading it fully out.

Have luck!

-BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Emma Strubell schrieb:
  2) does anyone really need to search an overlay anyway?

 Of course. Take large (semi-)official overlays like sunrise. They can
 easily be seen as a second portage tree.
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.9 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
 =+lCO
 -END PGP SIGNATURE-

 On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann [EMAIL 
 PROTECTED]wrote:




Re: [gentoo-portage-dev] Re: search functionality in emerge

2008-12-01 Thread Emma Strubell
yes, yes, i know, you're right :]

and thanks a bunch for the outline! about the compression, I agree that it
would be a good idea, but I don't know how to implement it. not that it
would be difficult... I'm guessing there's a gzip module for python that
would make it pretty straightforward? I think I'm getting ahead of myself,
though. I haven't even implemented the suffix tree yet!

Emma

On Mon, Dec 1, 2008 at 7:20 PM, Tambet [EMAIL PROTECTED] wrote:

 2008/12/2 Emma Strubell [EMAIL PROTECTED]

 True, true. Like I said, I don't really use overlays, so excuse my
 igonrance.


 Do you know an order of doing things:

 Rules of Optimization:

- Rule 1: Don't do it.
- Rule 2 (for experts only): Don't do it yet.

 What this actually means - functionality comes first. Readability comes
 next. Optimization comes last. Unless you are creating a fancy 3D engine for
 kung fu game.

 If you are going to exclude overlays, you are removing functionality - and,
 indeed, absolutely has-to-be-there functionality, because noone would
 intuitively expect search function to search only one subset of packages,
 however reasonable this subset would be. So, you can't, just can't, add this
 package into portage base - you could write just another external search
 package for portage.

 I looked this code a bit and:
 Portage's __init__.py contains comment # search functionality. After
 this comment, there is a nice and simple search class.
 It also contains method def action_sync(...), which contains
 synchronization stuff.

 Now, search class will be initialized by setting up 3 databases - porttree,
 bintree and vartree, whatever those are. Those will be in self._dbs array
 and porttree will be in self._portdb.

 It contains some more methods:
 _findname(...) will return result of self._portdb.findname(...) with same
 parameters or None if it does not exist.
 Other methods will do similar things - map one or another method.
 execute will do the real search...
 Now - for package in self.portdb.cp_all() is important here ...it
 currently loops over whole portage tree. All kinds of matching will be done
 inside.
 self.portdb obviously points to porttree.py (unless it points to fake
 tree).
 cp_all will take all porttrees and do simple file search inside. This
 method should contain optional index search.

   self.porttrees = [self.porttree_root] + \
   [os.path.realpath(t) for t in 
 self.mysettings[PORTDIR_OVERLAY].split()]

 So, self.porttrees contains list of trees - first of them is root, others
 are overlays.

 Now, what you have to do will not be harder just because of having overlay
 search, too.

 You have to create method def cp_index(self), which will return dictionary
 containing package names as keys. For oroot... will be self.porttrees[1:],
 not self.porttrees - this will only search overlays. d = {} will be
 replaced with d = self.cp_index(). If index is not there, old version will
 be used (thus, you have to make internal porttrees variable, which contains
 all or all except first).

 Other methods used by search are xmatch and aux_get - first used several
 times and last one used to get description. You have to cache results of
 those specific queries and make them use your cache - as you can see, those
 parts of portage are already able to use overlays. Thus, you have to put
 your code again in beginning of those functions - create index_xmatch and
 index_aux_get methods, then make those methods use them and return their
 results unless those are None (or something other in case none is already
 legal result) - if they return None, old code will be run and do it's job.
 If index is not created, result is None. In index_** methods, just check if
 query is what you can answer and if it is, then answer it.

 Obviously, the simplest way to create your index is to delete index, then
 use those same methods to query for all nessecary information - and fastest
 way would be to add updating index directly into sync, which you could do
 later.

 Please, also, make those commands to turn index on and off (last one should
 also delete it to save disk space). Default should be off until it's fast,
 small and reliable. Also notice that if index is kept on hard drive, it
 might be faster if it's compressed (gz, for example) - decompressing takes
 less time and more processing power than reading it fully out.

 Have luck!

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Emma Strubell schrieb:
  2) does anyone really need to search an overlay anyway?

 Of course. Take large (semi-)official overlays like sunrise. They can
 easily be seen as a second portage tree.
 -BEGIN PGP SIGNATURE-
 Version: GnuPG v2.0.9 (GNU/Linux)
 Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

 iEYEARECAAYFAkk0YpEACgkQ4UOg/zhYFuD3jQCdG/ChDmyOncpgUKeMuqDxD1Tt
 0mwAn2FXskdEAyFlmE8shUJy7WlhHr4S
 =+lCO
 -END PGP SIGNATURE-

 On Mon, Dec 1, 2008 at 5:17 PM, René 'Necoro' Neumann [EMAIL