Re: Fwd: [gentoo-portage-dev] search functionality in emerge

2009-02-13 Thread Marijn Schouten (hkBst)
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Emma Strubell wrote:
> Hi!
> 
> If I can find an unpickler that can unpickle at a reasonable speed, my
> search implementation would be significantly faster than the one currently
> in use. I'd show you my code, but I have to admit I'm intimidated by Alec's
> recent picking apart of Doug's code! For example, I don't even know how to
> use docstrings... The code probably could be cleaned up a lot in general
> since I was eventually just trying to get it to work before it was due.

Please don't be intimidated by it. Code review is one of the best methods to
improve your skills. We all sucked at programming at one time and perhaps we
still suck in anything but our favorite language. But if we are to improve
ourselves we need to spend a lot of time reading and coding and still we will
not always get it right. Furthermore other people learn from our code review,
such as you learnt about docstrings. Here[1] is a quick explanation of them.

Have fun,

Marijn

[1]:http://epydoc.sourceforge.net/docstrings.html

- --
Sarcasm puts the iron in irony, cynicism the steel.

Marijn Schouten (hkBst), Gentoo Lisp project, Gentoo ML
, #gentoo-{lisp,ml} on FreeNode
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmVd5AACgkQp/VmCx0OL2zIFQCgyJYZve1o6DnBBV/HgRV/gWMc
9NkAoLl0M4NX8l+kgWYY3B1dQQtU0/4k
=p/Pq
-END PGP SIGNATURE-



Re: Fwd: [gentoo-portage-dev] search functionality in emerge

2009-02-12 Thread Emma Strubell
Oh, I meant to say, I was indeed using cPickle. That was my first thought as
well, that for some reason pickle was being loaded instead of cPickle, but
no.

On Thu, Feb 12, 2009 at 4:05 PM, Mike Auty  wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Emma Strubell wrote:
> > There must be a faster python pickler out there, though, any
> > recommendations?
>
> Right off the bat, you might try cPickle.  It should work identically,
> just faster.  Also you can try importing psyco, if it's present it will
> try semi-compiling some bits and pieces and *might* offer some speed-ups
> (as in, it won't always, for small projects it might actually slow it
> down).
>
> Mike  5:)
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.9 (GNU/Linux)
>
> iEYEARECAAYFAkmUjy8ACgkQu7rWomwgFXrW6wCfS9zTTgqbhiDyaU1opDJO3BM2
> VO4AoIaPQ+t27OnTGh7tBEH/mqYntO/v
> =NzDj
> -END PGP SIGNATURE-
>
>


Re: Fwd: [gentoo-portage-dev] search functionality in emerge

2009-02-12 Thread Mike Auty
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Emma Strubell wrote:
> There must be a faster python pickler out there, though, any
> recommendations?

Right off the bat, you might try cPickle.  It should work identically,
just faster.  Also you can try importing psyco, if it's present it will
try semi-compiling some bits and pieces and *might* offer some speed-ups
(as in, it won't always, for small projects it might actually slow it down).

Mike  5:)
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)

iEYEARECAAYFAkmUjy8ACgkQu7rWomwgFXrW6wCfS9zTTgqbhiDyaU1opDJO3BM2
VO4AoIaPQ+t27OnTGh7tBEH/mqYntO/v
=NzDj
-END PGP SIGNATURE-



Fwd: [gentoo-portage-dev] search functionality in emerge

2009-02-12 Thread Emma Strubell
Hi!

So, my project did result in... something. Nothing too impressive, though.
My implementation of search ended up being about the same speed as the
current implementation because of the pickle module. I finished my project
at the very last minute before it was due, so I didn't have time to find and
implement an alternative pickling/serialization module. There must be a
faster python pickler out there, though, any recommendations? After I turned
in my project I had final exams and then winter break, and I basically
didn't want to look at my code at all during that time. Now that you've
brought it up, though, I wouldn't mind working on it, perhaps polishing it
(okay, it needs more than polishing) so that it might actually be a nice
addition to portage? I'm not doing any coding for any of my classes this
semester (except for some assembler) so I definitely wouldn't mind working
on this on the side.

The reason why it will need (significantly) more work is because I basically
had no idea what I was getting into with the regex search. I implemented $
and *, if I remember correctly, and for anything else the search just
defaults to the current portage search. I don't know whether implementing
regex search with the suffix tree that I used to implement the search would
make sense... I'll have to think about it some more. In fact, I have nothing
else to do this rainy afternoon :]

If I can find an unpickler that can unpickle at a reasonable speed, my
search implementation would be significantly faster than the one currently
in use. I'd show you my code, but I have to admit I'm intimidated by Alec's
recent picking apart of Doug's code! For example, I don't even know how to
use docstrings... The code probably could be cleaned up a lot in general
since I was eventually just trying to get it to work before it was due.

Thanks for asking, let me know what you think! (Also, sorry, René, for
sending this to you twice.)

Emma

On Thu, Feb 12, 2009 at 2:16 PM, René 'Necoro' Neumann wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Hey,
>
> has your project resulted in anything? :)
>
> Just curios about perhaps nice portage additions ;)
>
> Regards,
> Necoro
>
> Emma Strubell schrieb:
> > Hi everyone. My name is Emma, and I am completely new to this list. I've
> > been using Gentoo since 2004, including Portage of course, and before I
> say
> > anything else I'd like to say thanks to everyone for such a kickass
> package
> > management system!!
> >
> > Anyway, for my final project in my Data Structures & Algorithms class
> this
> > semester, I would like to modify the search functionality in emerge.
> > Something I've always noticed about 'emerge -s' or '-S' is that, in
> general,
> > it takes a very long time to perform the searches. (Although, lately it
> does
> > seem to be running faster, specifically on my laptop as opposed to my
> > desktop. Strangely, though, it seems that when I do a simple 'emerge -av
> > whatever' on my laptop it takes a very long time for emerge to find the
> > package and/or determine the dependecies -  whatever it's doing behind
> that
> > spinner. I can definitely go into more detail about this if anyone's
> > interested. It's really been puzzling me!) So, as my final project I've
> > proposed to improve the time it takes to perform a search using emerge.
> My
> > professor suggested that I look into implementing indexing.
> >
> > However, I've started looking at the code, and I must admit I'm pretty
> > overwhelmed! I don't know where to start. I was wondering if anyone on
> here
> > could give me a quick overview of how the search function currently
> works,
> > an idea as to what could be modified or implemented in order to improve
> the
> > running time of this code, or any tip really as to where I should start
> or
> > what I should start looking at. I'd really appreciate any help or
> advice!!
> >
> > Thanks a lot, and keep on making my Debian-using professor jealous :]
> > Emma
> >
>
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkmUdZIACgkQ4UOg/zhYFuDRQQCfeVXb6uy+wBSKll4MHq54MiyX
> VawAn0TWrTBVKuxAPFWpQMvvO3yED5Fs
> =dBni
> -END PGP SIGNATURE-
>


Re: [gentoo-portage-dev] search functionality in emerge

2009-02-12 Thread René 'Necoro' Neumann
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hey,

has your project resulted in anything? :)

Just curios about perhaps nice portage additions ;)

Regards,
Necoro

Emma Strubell schrieb:
> Hi everyone. My name is Emma, and I am completely new to this list. I've
> been using Gentoo since 2004, including Portage of course, and before I say
> anything else I'd like to say thanks to everyone for such a kickass package
> management system!!
> 
> Anyway, for my final project in my Data Structures & Algorithms class this
> semester, I would like to modify the search functionality in emerge.
> Something I've always noticed about 'emerge -s' or '-S' is that, in general,
> it takes a very long time to perform the searches. (Although, lately it does
> seem to be running faster, specifically on my laptop as opposed to my
> desktop. Strangely, though, it seems that when I do a simple 'emerge -av
> whatever' on my laptop it takes a very long time for emerge to find the
> package and/or determine the dependecies -  whatever it's doing behind that
> spinner. I can definitely go into more detail about this if anyone's
> interested. It's really been puzzling me!) So, as my final project I've
> proposed to improve the time it takes to perform a search using emerge. My
> professor suggested that I look into implementing indexing.
> 
> However, I've started looking at the code, and I must admit I'm pretty
> overwhelmed! I don't know where to start. I was wondering if anyone on here
> could give me a quick overview of how the search function currently works,
> an idea as to what could be modified or implemented in order to improve the
> running time of this code, or any tip really as to where I should start or
> what I should start looking at. I'd really appreciate any help or advice!!
> 
> Thanks a lot, and keep on making my Debian-using professor jealous :]
> Emma
> 

-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkmUdZIACgkQ4UOg/zhYFuDRQQCfeVXb6uy+wBSKll4MHq54MiyX
VawAn0TWrTBVKuxAPFWpQMvvO3yED5Fs
=dBni
-END PGP SIGNATURE-



Re: [gentoo-portage-dev] search functionality in emerge

2008-11-30 Thread Emma Strubell
You guys all have some great ideas, but I don't think I'd have enough time
to be able to implement them before my project is due... especially because
they appear to be a bit beyond my current programming skills. I would love
to devote a lot more time to this project, but I just can't right now
because I already have a lot of other things on my plate. i am really
interested in contributing to Gentoo and portage in the future, though. I'm
thinking this summer I'll have a chance... Anyway, I'm going to try to keep
it simple and just implement a suffix trie, and hope that that provides some
measurable speed improvement :] Thanks again for everyone's help, though,
and I'll definitely share the (amature and minimal, sorry!) results of my
project if you're interested.

Emma

On Mon, Nov 24, 2008 at 12:15 PM, tvali <[EMAIL PROTECTED]> wrote:

> I take it shortly together as Rene didn't catch all and so I was fuzzy:
>
> Portage tree has automatically updateable parts, which should not changed
> by user, and overlay, which will be. Thus, index of this automatic part
> should be updated only after "emerge --sync".
>
> Speedup should contain custom filesystem, which would be called PortageFS,
> for example. In initial version, PortageFS uses current portage tree and
> generates additional indexes.
>
> So, when you bootup, you have portage tree in /usr/portage. At some point,
> PortageFS is mounted into the same directory, /usr/portage. It will map real
> /usr/portage directory into /usr/portage mount point and create some
> additional folders like /usr/portage/search, which maps files to do real
> searches. /usr/portage/handler would be a file, where you can write query
> and read result. It also contains virtual files to check dependancies and
> such stuff - many things you could use with your scripts.
>
> When it's mounted, every change is noticed and indexes will be
> automagically updated (and sometimes after communication with portage - for
> example, updates when doing "emerge --sync" should not happen automagically
> maybe, as it makes things slower. When it's not mounted, you can change user
> files, but must run some notification script afterwards maybe to rebuild
> indexes.
>
> Indexes are built-in into FS.
>
> If PortageFS is not mounted, for example because of some emergency reboot,
> portage can work without indexes, using real directory instead of this mount
> point.
>


Re: [gentoo-portage-dev] search functionality in emerge

2008-11-24 Thread tvali
I take it shortly together as Rene didn't catch all and so I was fuzzy:

Portage tree has automatically updateable parts, which should not changed by
user, and overlay, which will be. Thus, index of this automatic part should
be updated only after "emerge --sync".

Speedup should contain custom filesystem, which would be called PortageFS,
for example. In initial version, PortageFS uses current portage tree and
generates additional indexes.

So, when you bootup, you have portage tree in /usr/portage. At some point,
PortageFS is mounted into the same directory, /usr/portage. It will map real
/usr/portage directory into /usr/portage mount point and create some
additional folders like /usr/portage/search, which maps files to do real
searches. /usr/portage/handler would be a file, where you can write query
and read result. It also contains virtual files to check dependancies and
such stuff - many things you could use with your scripts.

When it's mounted, every change is noticed and indexes will be automagically
updated (and sometimes after communication with portage - for example,
updates when doing "emerge --sync" should not happen automagically maybe, as
it makes things slower. When it's not mounted, you can change user files,
but must run some notification script afterwards maybe to rebuild indexes.

Indexes are built-in into FS.

If PortageFS is not mounted, for example because of some emergency reboot,
portage can work without indexes, using real directory instead of this mount
point.


Re: [gentoo-portage-dev] search functionality in emerge

2008-11-24 Thread tvali
2008/11/24 René 'Necoro' Neumann <[EMAIL PROTECTED]>

> What you mentioned for the filesystem might be a nice thing (actually I
> started something like this some time ago [1] , though it is now dead
> ;)), but it does not help in the index/determine changes thing. It is
> just another API :).
>

My thoughline is that when this FS is mounted, it's *only *portage dir - so
having this FS mounted, changes are all noticed, because you do all changes
in that FS. Anyway, when you unmount it and remount, some things might go
wrong and that's what I'm thinking about ...but that's not a big problem.

Perhaps the "index after sync" is sufficient for most parts of the
> userbase - but esp. those who often deal with their own local overlays
> (like me) do not want to have to re-index manually - esp. if re-indexing
> takes a long time. The best solution would be to have portage find a)
> THAT something has been changed and b) WHAT has been changed. So that it
> only has to update these parts of the index, and thus do not be sth
> enerving for the users (remind the "Generate Metadata" stuff (or
> whatever it was called) in older portage versions, which alone seemed to
> take longer than the rest of the sync progress)
>
> Regards,
> René
>
> [1] https://launchpad.net/catapultfs
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2.0.9 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org
>
> iEYEARECAAYFAkkqxSsACgkQ4UOg/zhYFuBPSACdH9H6VChrhlcovucgVAcCsp/B
> j+AAmgPXPmuBs5GWnNAfs5nss4HlBEMT
> =WG8B
> -END PGP SIGNATURE-
>
>


-- 
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.
Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi
täica pea nagu prügikast...


Re: [gentoo-portage-dev] search functionality in emerge

2008-11-24 Thread René 'Necoro' Neumann
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

tvali schrieb:
> But about filesystem...
> 
> [... snip lots of stuff ...]

What you mentioned for the filesystem might be a nice thing (actually I
started something like this some time ago [1] , though it is now dead
;)), but it does not help in the index/determine changes thing. It is
just another API :).

Perhaps the "index after sync" is sufficient for most parts of the
userbase - but esp. those who often deal with their own local overlays
(like me) do not want to have to re-index manually - esp. if re-indexing
takes a long time. The best solution would be to have portage find a)
THAT something has been changed and b) WHAT has been changed. So that it
only has to update these parts of the index, and thus do not be sth
enerving for the users (remind the "Generate Metadata" stuff (or
whatever it was called) in older portage versions, which alone seemed to
take longer than the rest of the sync progress)

Regards,
René

[1] https://launchpad.net/catapultfs
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkqxSsACgkQ4UOg/zhYFuBPSACdH9H6VChrhlcovucgVAcCsp/B
j+AAmgPXPmuBs5GWnNAfs5nss4HlBEMT
=WG8B
-END PGP SIGNATURE-



Re: [gentoo-portage-dev] search functionality in emerge

2008-11-24 Thread tvali
There is one clear problem:

   1. Some other app opens some portage file.
   2. Tree is mounted and indexed.
   3. Other app changes this file.
   4. Index is out-of-date.

To disallow such thing it should be first suggested that all scripts change
portage tree only after mount. As defence against those, which dont listen
to that suggestion, portage should just not use this altered data - portage
should totally rely on it's internal index and when you change some file and
index is not updated, you change should be as well as lost. Does this make
portage tree twice as big as it is?

I guess not, because:

   - Useflags can be indexed and refferred with numbers.
   - Licence, homepage and such data is not needed to be duplicated.

Also, as overlay directories are suggested anyway, is it needed at all to
check *all* files for updates? I think that when one does something wrong,
it's OK when everything goes boom and if someone has update scripts, which
dont use overlays and other suggested ways to do thing, then adding one more
thing, which breaks, is not bad. Hashing those few files isnt bad idea and
keeping internal duplicate of overlay directory is not so bad, too - then
you need to "emerge --commithandmadeupdates" and that's all.

Some things, which could be used to boost:

   - Dependancy searches are saved - so that "emerge -p pck1 pck2 pck3"
   saves data about deps of those 3 packages.
   - Package name list is saved.
   - All packages are given integer ID.
   - List of all words in package descriptions are saved and connected to
   their internal ID's. This could be used to make smaller index file. So when
   i search for "al", then all words containing those chars like "all" are
   considered and -S search will run only on those packages.
   - Hash file of whole portage tree is saved to understand if it's changed
   after last remount.

2008/11/24 tvali <[EMAIL PROTECTED]>

> So, mornings are smarter than evenings (it's Estonian saying) ...at night,
> I thought more about this filesystem thing and found that it simply answers
> all needs, actually. Now I did read some messages here and thought how it
> could be made real simple, at least as I understand this word. Yesterday I
> searched if custom filesystems could have custom functionality and did not
> find any, so I wrote this list of big bunch of classes, which might be
> overkill as I think now.
>
> First thing about that indexing - if you dont create daemon nor filesystem,
> you can create commands "emerge --indexon", "emerge --indexoff", "emerge
> --indexrenew". Then, index is renewed on "emerge --sync" and such, but when
> user changes files manually, she has to renew index manually - not much
> asked, isn't it? If someone is going to open the cover of her computer, she
> will take the responsibility to know some basic things about electricity and
> that they should change smth in bios after adding and removing some parts of
> computer. Maybe it should even be "emerge --commithandmadechanges", which
> will index or do some other things, which are needed after handmade changes.
> More such things might emerge in future, I guess.
>
> But about filesystem...
>
> Consider such thing that when you have filesystem, you might have some
> directory, which you could not list, but where you can read files. Imagine
> some function, which is able to encode and decode queryes into filesystem
> format.
>
> If you have such function: search(packagename, "dependencies") you can
> write it as file path:
> /cgi-bin/search/packagename/dependencies - and packagename can be encoded
> by replacing some characters with some codes and separating long strings
> with /. Also, you could have API, which has one file in directory, from
> where you can read some tmp filename, then write your query to that file and
> read the result from the same or similarly-named file with different
> extension. So, FS provides some ways to create custom queries - actually
> that idea came because there was idea of creating FS as cgi server on LUFS
> page, thus this "cgi-bin" starting here is to simplify. I think it's similar
> to how files in /dev/ directory behave - you open some file and start
> writing and reading, but this file actually is zero-sized and contains
> nothing.
>
> Under such case, API could be written to provide this filesystem and
> nothing more. If it is custom-mapped filesystem, then it could provide
> search and such directories, which can be used by portage and others. If
> not, it would work as it used to.
>
> So, having filesystem, which contains such stuff (i call this subdir "dev"
> here):
>
>- /dev/search - write your query here and read the result.
>- /dev/search/searchstring - another way for user to just read some
>listings with her custom script.
>- /portage/directory/category/packagename/depslist.dev - contains
>dynamic list of package dependencies.
>- /dev/version - some integer, which will grow every time any change to
>port

Re: [gentoo-portage-dev] search functionality in emerge

2008-11-24 Thread tvali
So, mornings are smarter than evenings (it's Estonian saying) ...at night, I
thought more about this filesystem thing and found that it simply answers
all needs, actually. Now I did read some messages here and thought how it
could be made real simple, at least as I understand this word. Yesterday I
searched if custom filesystems could have custom functionality and did not
find any, so I wrote this list of big bunch of classes, which might be
overkill as I think now.

First thing about that indexing - if you dont create daemon nor filesystem,
you can create commands "emerge --indexon", "emerge --indexoff", "emerge
--indexrenew". Then, index is renewed on "emerge --sync" and such, but when
user changes files manually, she has to renew index manually - not much
asked, isn't it? If someone is going to open the cover of her computer, she
will take the responsibility to know some basic things about electricity and
that they should change smth in bios after adding and removing some parts of
computer. Maybe it should even be "emerge --commithandmadechanges", which
will index or do some other things, which are needed after handmade changes.
More such things might emerge in future, I guess.

But about filesystem...

Consider such thing that when you have filesystem, you might have some
directory, which you could not list, but where you can read files. Imagine
some function, which is able to encode and decode queryes into filesystem
format.

If you have such function: search(packagename, "dependencies") you can write
it as file path:
/cgi-bin/search/packagename/dependencies - and packagename can be encoded by
replacing some characters with some codes and separating long strings with
/. Also, you could have API, which has one file in directory, from where you
can read some tmp filename, then write your query to that file and read the
result from the same or similarly-named file with different extension. So,
FS provides some ways to create custom queries - actually that idea came
because there was idea of creating FS as cgi server on LUFS page, thus this
"cgi-bin" starting here is to simplify. I think it's similar to how files in
/dev/ directory behave - you open some file and start writing and reading,
but this file actually is zero-sized and contains nothing.

Under such case, API could be written to provide this filesystem and nothing
more. If it is custom-mapped filesystem, then it could provide search and
such directories, which can be used by portage and others. If not, it would
work as it used to.

So, having filesystem, which contains such stuff (i call this subdir "dev"
here):

   - /dev/search - write your query here and read the result.
   - /dev/search/searchstring - another way for user to just read some
   listings with her custom script.
   - /portage/directory/category/packagename/depslist.dev - contains dynamic
   list of package dependencies.
   - /dev/version - some integer, which will grow every time any change to
   portage tree is made.

Then, other functions would be added eventually.

Now, things simple:

   - Create standard filesystem, which can be used to contain portage tree.
   - Add all nessecary notifications to change and update files.
   - *Mount this filesystem to the same dir, where actual files are placed -
   if it's not mounted, portage will almost not notice this (so in emergency,
   things are just slower). You can navigate to a directory, then mount new one
   - I am not on linux box right now, but if I remember correctly, you can use
   files in real directory after mounting smth other there in such way.*
   - Create indexes and other stuff.

2008/11/24 Fabian Groffen <[EMAIL PROTECTED]>

> On 24-11-2008 10:34:28 +0100, René 'Necoro' Neumann wrote:
> > tvali schrieb:
> > > There is daemon, which notices about filesystem changes -
> > > http://pyinotify.sourceforge.net/ would be a good choice.
> >
> > Disadvantage: Has to run all the time (I see already some people crying:
> > "oh noez. not yet another daemon...").
>
> ... and it is Linux only, which spoils the fun.
>
>
> --
> Fabian Groffen
> Gentoo on a different level
>
>


-- 
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.
Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi
täica pea nagu prügikast...


Re: [gentoo-portage-dev] search functionality in emerge

2008-11-24 Thread Fabian Groffen
On 24-11-2008 10:34:28 +0100, René 'Necoro' Neumann wrote:
> tvali schrieb:
> > There is daemon, which notices about filesystem changes -
> > http://pyinotify.sourceforge.net/ would be a good choice.
> 
> Disadvantage: Has to run all the time (I see already some people crying:
> "oh noez. not yet another daemon...").

... and it is Linux only, which spoils the fun.


-- 
Fabian Groffen
Gentoo on a different level



Re: [gentoo-portage-dev] search functionality in emerge

2008-11-24 Thread René 'Necoro' Neumann
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

tvali schrieb:
> There is daemon, which notices about filesystem changes -
> http://pyinotify.sourceforge.net/ would be a good choice.

Disadvantage: Has to run all the time (I see already some people crying:
"oh noez. not yet another daemon..."). Problem with offline changes
(which might be overcome by a one-time check on daemon-startup ... but
this would really increase the startup time).

I have built an algorithm, which does sth like:
for overlay in OVERLAYS + PORTDIR:
db[overlay] = md5("".join(f.st_mtime for files(overlay)))

and then compare the MD5-values on later runs.
This is fast if the portage stuff is already cached - else it is quite
slow ;). Another disadvantage is, that it does not know, WHAT changes do
have occurred and thus has to re-read the complete overlay.

I like the filesystem idea more, than the one with the daemon :). Write
a new FS (using FUSE f.ex. (LUFS is deprecated)) which provides a
logfile. This logfile can either just contain the time of the latest
change in the complete subtree, or even some kind of log stating WHICH
files have been changed.

I think, this should even be possible, if the tree is not on its own
partition.

Of course, this should be clearly an opt-in solution: If the user does
not modify the trees by hand, or does so seldomly, the "create index
after sync" (similarly to 'eix-sync') is sufficient.

Regards,
René
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkqdSMACgkQ4UOg/zhYFuDFLACaAn7skiCsy9pHutXf5ETa5db5
BP8AnR8lqj7c6u8HPKVbOsHVTFuGAqfG
=G+lV
-END PGP SIGNATURE-



Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Marius Mauch
On Sun, 23 Nov 2008 21:01:40 -0800 (PST)
devsk <[EMAIL PROTECTED]> wrote:

> > not relying on custom system daemonsrunning in the background.
> 
> Why is a portage daemon such a bad thing? Or hard to do? I would very
> much like a daemon running on my system which I can configure to sync
> the portage tree once a week (or month if I am lazy), give me a
> summary of hot fixes, security fixes in a nice email, push important
> announcements and of course, sync caches on detecting changes (which
> should be trivial with notify daemons all over the place) etc. Why is
> it such a bad thing?

Well, as an opt-in solution it might work (though most of what you
described is IMO just stuff for cron, no need to reinvent the wheel).

What I was saying is that _relying_ on custom system
daemons/filesystems for a _core subsystem_ of portage is the wrong
way, simply because it adds a substantial amount of complexity to the
whole package management architecture. It's one more thing that can
(and will) break, one more layer to take into account for any design
decisions, one more component that has to be secured, one more obstacle
to overcome when you want to analyze/debug things.
And special care must be taken if it requires special kernel support
and/or external packages. Do you want to make inotify support mandatory
to use portage efficiently? (btw, looks like inotify doesn't really
work with NFS mounts, which would already make such a daemon completely
useless for people using a NFS-shared repository)

And finally, if you look at the use cases, a daemon is simply overkill
for most cases, as the vast majority of people only use emerge
--sync (or wrappers) and maybe layman to change the tree, usually once
per day or less often. Do you really want to push another system daemon
on users that isn't of use to them?

> Its crazy to think that security updates need to be pulled in Linux.

That's IMO better be handled via an applet (bug #190397 has some code),
or just check for updates after a sync (as syncing is the only
way for updates to become available at this time). Maybe a message
could be added after sync if there are pending GLSAs, now that the glsa
support code is in portage.

Marius



Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread devsk
> not relying on custom system daemonsrunning in the background.

Why is a portage daemon such a bad thing? Or hard to do? I would very much like 
a daemon running on my system which I can configure to sync the portage tree 
once a week (or month if I am lazy), give me a summary of hot fixes, security 
fixes in a nice email, push important announcements and of course, sync caches 
on detecting changes (which should be trivial with notify daemons all over the 
place) etc. Why is it such a bad thing?

Its crazy to think that security updates need to be pulled in Linux.

-devsk



- Original Message 
From: Marius Mauch <[EMAIL PROTECTED]>
To: gentoo-portage-dev@lists.gentoo.org
Sent: Sunday, November 23, 2008 7:12:57 PM
Subject: Re: [gentoo-portage-dev] search functionality in emerge

On Sun, 23 Nov 2008 07:17:40 -0500
"Emma Strubell" <[EMAIL PROTECTED]> wrote:

> However, I've started looking at the code, and I must admit I'm pretty
> overwhelmed! I don't know where to start. I was wondering if anyone
> on here could give me a quick overview of how the search function
> currently works, an idea as to what could be modified or implemented
> in order to improve the running time of this code, or any tip really
> as to where I should start or what I should start looking at. I'd
> really appreciate any help or advice!!

Well, it depends how much effort you want to put into this. The current
interface doesn't actually provide a "search" interface, but merely
functions to
1) list all package names - dbapi.cp_all()
2) list all package names and versions - dbapi.cpv_all()
3) list all versions for a given package name - dbapi.cp_list()
4) read metadata (like DESCRIPTION) for a given package name and
version - dbapi.aux_get()

One of the main performance problems of --search is that there is no
persistent cache for functions 1, 2 and 3, so if you're "just"
interested in performance aspects you might want to look into that.
The issue with implementing a persistent cache is that you have to
consider both cold and hot filesystem cache cases: Loading an index
file with package names and versions might improve the cold-cache case,
but slow things down when the filesystem cache is populated.
As has been mentioned, keeping the index updated is the other major
issue, especially as it has to be portable and should require little or
no configuration/setup for the user (so no extra daemons or special
filesystems running permanently in the background). The obvious
solution would be to generate the cache after `emerge --sync` (and other
sync implementations) and hope that people don't modify their tree and
search for the changes in between (that's what all the external tools
do). I don't know if there is actually a way to do online updates while
still improving performance and not relying on custom system daemons
running in the background.

As for --searchdesc, one problem is that dbapi.aux_get() can only
operate on a single package-version on each call (though it can read
multiple metadata variables). So for description searches the control
flow is like this (obviously simplified):

result = []
# iterate over all packages
for package in dbapi.cp_all():
# determine the current version of each package, this is 
# another performance issue.
version = get_current_version(package)
# read package description from metadata cache
description = dbapi.aux_get(version, ["DESCRIPTION"])[0]
# check if the description matches
if matches(description, searchkey):
result.append(package)

There you see the three bottlenecks: the lack of a pregenerated package
list, the version lookup for *each* package and the actual metadata
read. I've already talked about the first, so lets look at the other
two. The core problem there is that DESCRIPTION (like all standard
metadata variables) is version specific, so to access it you need to
determine a version to use, even though in almost all cases the
description is the same (or very similar) for all versions. So the
proper solution would be to make the description a property of the
package name instead of the package version, but that's a _huge_ task
you're probably not interested in. What _might_ work here is to add
support for an optional package-name->description cache that can be
generated offline and includes those packages where all versions have
the same description, and fall back to the current method if the
package is not included in the cache. (Don't think about caching the
version lookup, that's system dependent and therefore not suitable for
caching).

Hope it has become clear that while the actual search algorithm might
be simple and not very efficient, the real problem lies in getting the
data to operate on. 

That and the somewhat limited dbapi interface.

Disclaimer: The stuff below involves extend

Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Marius Mauch
On Sun, 23 Nov 2008 07:17:40 -0500
"Emma Strubell" <[EMAIL PROTECTED]> wrote:

> However, I've started looking at the code, and I must admit I'm pretty
> overwhelmed! I don't know where to start. I was wondering if anyone
> on here could give me a quick overview of how the search function
> currently works, an idea as to what could be modified or implemented
> in order to improve the running time of this code, or any tip really
> as to where I should start or what I should start looking at. I'd
> really appreciate any help or advice!!

Well, it depends how much effort you want to put into this. The current
interface doesn't actually provide a "search" interface, but merely
functions to
1) list all package names - dbapi.cp_all()
2) list all package names and versions - dbapi.cpv_all()
3) list all versions for a given package name - dbapi.cp_list()
4) read metadata (like DESCRIPTION) for a given package name and
version - dbapi.aux_get()

One of the main performance problems of --search is that there is no
persistent cache for functions 1, 2 and 3, so if you're "just"
interested in performance aspects you might want to look into that.
The issue with implementing a persistent cache is that you have to
consider both cold and hot filesystem cache cases: Loading an index
file with package names and versions might improve the cold-cache case,
but slow things down when the filesystem cache is populated.
As has been mentioned, keeping the index updated is the other major
issue, especially as it has to be portable and should require little or
no configuration/setup for the user (so no extra daemons or special
filesystems running permanently in the background). The obvious
solution would be to generate the cache after `emerge --sync` (and other
sync implementations) and hope that people don't modify their tree and
search for the changes in between (that's what all the external tools
do). I don't know if there is actually a way to do online updates while
still improving performance and not relying on custom system daemons
running in the background.

As for --searchdesc, one problem is that dbapi.aux_get() can only
operate on a single package-version on each call (though it can read
multiple metadata variables). So for description searches the control
flow is like this (obviously simplified):

result = []
# iterate over all packages
for package in dbapi.cp_all():
# determine the current version of each package, this is 
# another performance issue.
version = get_current_version(package)
# read package description from metadata cache
description = dbapi.aux_get(version, ["DESCRIPTION"])[0]
# check if the description matches
if matches(description, searchkey):
result.append(package)

There you see the three bottlenecks: the lack of a pregenerated package
list, the version lookup for *each* package and the actual metadata
read. I've already talked about the first, so lets look at the other
two. The core problem there is that DESCRIPTION (like all standard
metadata variables) is version specific, so to access it you need to
determine a version to use, even though in almost all cases the
description is the same (or very similar) for all versions. So the
proper solution would be to make the description a property of the
package name instead of the package version, but that's a _huge_ task
you're probably not interested in. What _might_ work here is to add
support for an optional package-name->description cache that can be
generated offline and includes those packages where all versions have
the same description, and fall back to the current method if the
package is not included in the cache. (Don't think about caching the
version lookup, that's system dependent and therefore not suitable for
caching).

Hope it has become clear that while the actual search algorithm might
be simple and not very efficient, the real problem lies in getting the
data to operate on. 

That and the somewhat limited dbapi interface.

Disclaimer: The stuff below involves extending and redesigning some
core portage APIs. This isn't something you can do on a weekend, only
work on this if you want to commit yourself to portage development
for a long time.

The functions listed above are the bare minimum to
perform queries on the package repositories, but they're very
low-level. That means that whenever you want to select packages by
name, description, license, dependencies or other variables you need
quite a bit of custom code, more if you want to combine multiple
searches, and much more if you want to do it efficient and flexible.
See http://dev.gentoo.org/~genone/scripts/metalib.py and
http://dev.gentoo.org/~genone/scripts/metascan for a somewhat flexible,
but very inefficient search tool (might not work anymore due to old
age).

Ideally repository searches could be done without writing any
application code using some kind of query language, similar to how SQL
works for generic database searches (obviously not that 

Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread tvali
There is daemon, which notices about filesystem changes -
http://pyinotify.sourceforge.net/ would be a good choice.

In case many different applications use portage tree directly without using
any portage API (which is a bad choice, I think, and should be deprecated),
then there is a kind of "hack" - using
http://www.freenet.org.nz/python/lufs-python/ to create a new filesystem
(damn now I would like to have some time to join this game). I hope it's
possible to build it everywhere where gentoo should work, but it'n no
problem if it's not - you can implement it in such way that it's not needed.
I totally agree, that filesystem is a bottleneck, but this suffix trie would
check for directories first, I guess. Now, having this custom filesystem,
which actually serves portage tree like some odd API, you can have backwards
compability and still create your own thing.

Having such classes (numbers show implementation order; this is not
specified here if proxies are abstract classes, base classes or smth. other,
just it shows some relations between some imaginary objects):

   - *1. PortageTreeApi* - Proxy for different portage trees on FS or SQL or
   other.
   - *2. PortageTreeCachedApi *- same, as previous, but contains boosted
   memory cache. It should be able to save it's state, which is simply writing
   it's inner variables into file.
   - *3. PortageTreeDaemon *- has interface compatible with PortageTreeAPI,
   this daemon serves portage tree to PortageTreeFS and portage tree itself. In
   reality, it should be base class of *PortageTreeApi* and *
   PortageTreeCachedApi* so that they could be directly used as daemons.
   When cached API is used as daemon, it should be able to check filesystem
   changes - thus, implementations should contain change trigger callbacks.
   - *4. PortageTreeFS *- filesystem, which can be used to map any of those
   to filesystem. Connectable with PortageTreeApi or PortageTreeDaemon. This
   creates filesystems, which can be used for backwards-compability. This
   cannot be used on architectures, which dont implement lufs-python or analog.
   - *6. PortageTreeServer *- server, which serves data from
   PortageTreeDaemon, PortageTreeCachedApi or PortageTreeApi to some other
   computer.
   - Implementations can be proxied through *PortageTreeApi*, *
   PortageTreeCachedApi* or *PortageTreeDaemon*.
  - *5. PortageTreeImplementationAsSqlDb*
  - *1. PortageTreeImplementationAsFilesystem*
  - *3. PortageTreeImplementationAsDaemon* - client, actually.
  - *6. PortageTreeImplementationAsServer* - client, too.

So, *1* - creating PortageTreeApi and PortageTreeImplementationAsFilesystem
is pure refactoring task, at first. Then, adding more advanced functions to
PortageTreeApi is basically refactoring, too. PortageTreeApi should not
become too complex or contain any advanced tasks, which are not purely
db-specific, so some common baseclass could implement more high-level
things.
Then, *2* - this is finishing your schoolwork, but not yet in most powerful
way as we are having only index then, and first search is still slow. At
beginning this cache is unable to provide data about changes in portage tree
(which could be implemented by some versioning after this new api is only
place to update it), so it should have index update command and be only used
in search.
Then, *3* - having portage tree daemon means that things can really be
cached now and this cache can be kept in memory; also it means updates on
filesystem changes.
Then, *4* - having PortageTreeFS means that now you can easily implement
portage tree on faster medium without losing backwards-compability.
Now, *5* - implementation as SQL DB is logical as SQL is standardized and
common language for creating fast databases.
Eventually, *6* - this has really nothing to do with boosting search, but in
fast network it could still boost emerge by removing need for emerge --sync
for local networks.

I think that then it would be considered to have synchronization also in
those classes - CachedApi almost needs it to be faster with server-client
connections. After that, ImplementationAsSync and ImplementationAsWebRsSync
could be added and sync server built onto this daemon. As I really doubt
that emerge --sync is currently also ultraslow - I see no meaning in waiting
a long time to get few new items as currently seems to happen -, it would
boost another life-critical part of portage.

So, hope that helps a bit - have luck!

2008/11/23 René 'Necoro' Neumann <[EMAIL PROTECTED]>

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Mike Auty schrieb:
> > Finally there are overlays, and since these can change outside of an
> > "emerge --sync" (as indeed can the main tree), you'll have to reindex
> > these before each search request, or give the user stale data until they
> > manually reindex.
>
> Determining whether there has been a change to the ebuild system is a
> major point in the whole thing. What does a great index s

Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread René 'Necoro' Neumann
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Mike Auty schrieb:
> Finally there are overlays, and since these can change outside of an
> "emerge --sync" (as indeed can the main tree), you'll have to reindex
> these before each search request, or give the user stale data until they
> manually reindex.

Determining whether there has been a change to the ebuild system is a
major point in the whole thing. What does a great index serves you, if
it does not notice the changes the user made in his own local overlay?
:) Manually re-indexing is not a good choice I think...

If somebody comes up here with a good (and fast) solution, this would be
a nice thing ;) (need it myself).

Regards,
René
-BEGIN PGP SIGNATURE-
Version: GnuPG v2.0.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkkp0kAACgkQ4UOg/zhYFuAhTACfYDxNeQQG6dysgU5TrNEZGOiH
3CoAn2wV6g8/8uj+T99cxJGdQBxTtZjI
=2I2j
-END PGP SIGNATURE-



Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Mike Auty

Hiya Emma,
	Good luck on your project.  A couple of things to be weary of are disk 
I/O, metadata cache backends and overlays.
	Disk I/O can be a significant bottleneck.  Loading up a lot of files 
from disk (be it the metadata cache or whatever) can take a long time 
initially, but then be cached in RAM and so be much faster to access in 
the future.
	Portage allows for its internal metadata cache to be stored in a 
variety of formats, as long as there's a backend to support it.  This 
means simple speedups can be achieved using cdb or sqlite (if you google 
these and portage you'll get gentoo-wiki tips, which unfortunately 
you'll have to read from google's cache at the moment).  It also means 
that if you want to make use of this metadata from within portage, 
you'll have to rely on the API to tell the backend to get you all the 
data (and it may be difficult to speed up without writing your own backend).
	Finally there are overlays, and since these can change outside of an 
"emerge --sync" (as indeed can the main tree), you'll have to reindex 
these before each search request, or give the user stale data until they 
manually reindex.
	If you're interesting in implementing this in python, you may be 
interested in another package manager that can handle the main tree, 
also implemented in python, called pkgcore.  From what I understand, 
it's a similar code-base to portage, but its internal architecture may 
have changed a lot.
	I hope some of that helps, and isn't off putting.  I look forward to 
seeing the results!  5:)

Mike  5:)



Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread tvali
Yes if it would be a low-level implementation to portage, speeding up it's
native code for searching and using indexes, then it would make everything
faster, including emerge (because emerge does search first for package
relations). I have actually wanted to do it myself several years ago, so
reacting here to have my ideas discussed, too.

Douglas Anderson 16:46 reply is about locks and I think that it would need
to rethink portages locking methods - what, when and why it locks. This is
probably quite hard task by itself. Anyway, as portage actually lets user
make two emerges at the same time, locks might be OK, too.

I think that the best thing would be bottom-up refactoring - first to make a
list of lowest-level functions, which have to do with reading data from
portage tree or writing into it; then making indexer class, which will be
used by all of those low-level functions.

To have it OOP, it should be implemented in such way:

   - Low-level portage tree handler does everything with portage tree, no
   function in portage uses it directly.
   - Tree handler has several needed and several optional methods - so that
   implementing new handler would be easy, but creating things like native
   regex search would be possible.
   - One could implement a new tree handler with SQLite or other interface
   instead of filesystem and do other tricks through this interface; for
   example, boost it.

So, nice way to go would be:

   1. Implementing portage tree handler and it's proxy, which uses current
   portage tree in non-indexed way and simply gives methods for the same kind
   of access, as currently implemented one.
   2. Refactoring portage to rely only on portage tree handler and use
   direct portage tree nowhere. To test if it is so, portage tree should be
   moved to another directory, about which only this handler knows, and check
   if portage works well. Indexing all places, where portage uses it's tree
   handler (by std. comment, for example) and making clear, which methods would
   contain all boostable code of it.
   3. Implementing those methods in proxy, which could simulate fast regex
   search and other stuff using simplest possible interface of portage tree
   handler (smth. like four methods add, remove, get, list). Proxy should be
   able to use handler's methods if they are implemented.
   4. Refactoring portage to use advanced methods in proxy.
   5. Now, having taken all code together into one place and looking this
   nice and readable code, real optimizations could be discussed here, for
   example. Ideally, i think, portage would have such tree handlers:
  - Filesystem handler - fast searches over current portage tree
  structure.
  - SQL handler - rewrite of tree functions into SQL queries.
  - Network-based handler - maybe sometimes it would nice to have
  portage tree only in one machine of cluster, for example if I
want to have
  100 really small computers with fast connection with mother-computer and
  portage tree is too big to be wholly copied to all of them.
  - Memory-buffered handler with daemon, which is actually proxy to some
  other handler - daemon, which reads whole tree (from filesystem
or SQL) into
  memory on boot or first use, creates really fast index (because
now it does
  matter to have better indexing) and optionally deletes some [less needed]
  parts of it's index from memory when it's becoming full and behaves as
  really simple proxy if it stays full. This should be implemented after
  critical parts of filesystem or SQL handler.

2008/11/23 Emma Strubell <[EMAIL PROTECTED]>

> Wow, that's extremely helpful!! I happen to particularly enjoy tries, so
> the suffix trie sounds like a great idea. The trie class example is really
> helpful too, because this will be my first time programming in Python, and
> it's a bit easier to figure out what's going on syntax-wise in that simple
> trie class than in the middle of the portage source code!
>
> Seriously, thanks again :]
>
>
> On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston <[EMAIL PROTECTED]>wrote:
>
>> > Thanks for the replies! I know there are a couple programs out there
>> that
>> > basically already do what I'm looking to do... Unfortunately I wasn't
>> aware
>> > of these pre-existing utilities until after I submitted my project
>> proposal
>> > to my professor. So, I'm looking to implement a better search myself.
>> > Preferably by editing the existing portage code, not writing a separate
>> > program. So if anyone can offer any help regarding the actual
>> implementation
>> > of search in portage, I would greatly appreciate it!
>>
>> Most of the search implementation is in
>> /usr/lib/portage/pym/_emerge/__init__.py in class search.  The class's
>> execute() method simply iterates over all packages (and descriptions
>> and package sets) and matches against the searchkey.  You might need
>> to look into pym/portage/dbapi/porttree.py for portdbapi a

Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Emma Strubell
Wow, that's extremely helpful!! I happen to particularly enjoy tries, so the
suffix trie sounds like a great idea. The trie class example is really
helpful too, because this will be my first time programming in Python, and
it's a bit easier to figure out what's going on syntax-wise in that simple
trie class than in the middle of the portage source code!

Seriously, thanks again :]

On Sun, Nov 23, 2008 at 11:56 AM, Lucian Poston <[EMAIL PROTECTED]>wrote:

> > Thanks for the replies! I know there are a couple programs out there that
> > basically already do what I'm looking to do... Unfortunately I wasn't
> aware
> > of these pre-existing utilities until after I submitted my project
> proposal
> > to my professor. So, I'm looking to implement a better search myself.
> > Preferably by editing the existing portage code, not writing a separate
> > program. So if anyone can offer any help regarding the actual
> implementation
> > of search in portage, I would greatly appreciate it!
>
> Most of the search implementation is in
> /usr/lib/portage/pym/_emerge/__init__.py in class search.  The class's
> execute() method simply iterates over all packages (and descriptions
> and package sets) and matches against the searchkey.  You might need
> to look into pym/portage/dbapi/porttree.py for portdbapi as well.
>
> If you intend to index and support fast regex lookup, then you need to
> do some fancy indexing, which I'm not terribly familiar with.  You
> could follow in the steps of eix[1] or other indexed search utilities
> and design some sort of index layout, which is easier than the
> following thought.  You might consider implementing a suffix trie or
> similar that has sublinear regexp lookup and marshalling the structure
> for the index.  I couldn't find a python implementation for something
> like this, but here is a general trie class[2] that you might start
> with if you go that route.  There is a perl module[3],
> Tie::Hash::Regex, that does that, but properly implementing that in
> python would be a chore. :)
>
> That project sounds interesting and fun. Good luck!
>
> Lucian Poston
>
> [1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout
> [2]
> http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx
> [3]
> http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm
>
>


Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Lucian Poston
> Thanks for the replies! I know there are a couple programs out there that
> basically already do what I'm looking to do... Unfortunately I wasn't aware
> of these pre-existing utilities until after I submitted my project proposal
> to my professor. So, I'm looking to implement a better search myself.
> Preferably by editing the existing portage code, not writing a separate
> program. So if anyone can offer any help regarding the actual implementation
> of search in portage, I would greatly appreciate it!

Most of the search implementation is in
/usr/lib/portage/pym/_emerge/__init__.py in class search.  The class's
execute() method simply iterates over all packages (and descriptions
and package sets) and matches against the searchkey.  You might need
to look into pym/portage/dbapi/porttree.py for portdbapi as well.

If you intend to index and support fast regex lookup, then you need to
do some fancy indexing, which I'm not terribly familiar with.  You
could follow in the steps of eix[1] or other indexed search utilities
and design some sort of index layout, which is easier than the
following thought.  You might consider implementing a suffix trie or
similar that has sublinear regexp lookup and marshalling the structure
for the index.  I couldn't find a python implementation for something
like this, but here is a general trie class[2] that you might start
with if you go that route.  There is a perl module[3],
Tie::Hash::Regex, that does that, but properly implementing that in
python would be a chore. :)

That project sounds interesting and fun. Good luck!

Lucian Poston

[1] https://projects.gentooexperimental.org/eix/wiki/IndexFileLayout
[2] 
http://www.koders.com/python/fid7B6BC1651A9E8BBA547552FE3F039479A4DECC45.aspx
[3] http://search.cpan.org/~davecross/Tie-Hash-Regex-1.02/lib/Tie/Hash/Regex.pm



Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Douglas Anderson
Emma,

It would be great it you could speed search up a bit!

As these other guys have pointed out, we do have some indexing tools in
Gentoo already. Most users don't understand why that kind of functionality
isn't built directly into Portage, but IIRC it has something to do with the
fact that these fast search indexes aren't able to be written to by more
than one process at the same time, so for example if you had two emerges
finishing at the same time, Portage's current flat hash file can handle
that, but the faster db-based indexes can't.

Anyways, that's the way I, as a curious user, understood the problem.

You might be interested in reading this, very old forum thread about a
previous attempt:
http://forums.gentoo.org/viewtopic-t-261580-postdays-0-postorder-asc-start-0.html

On Sun, Nov 23, 2008 at 11:33 PM, Pacho Ramos <
[EMAIL PROTECTED]> wrote:

> El dom, 23-11-2008 a las 16:01 +0200, tvali escribió:
> > Try esearch.
> >
> > emerge esearch
> > esearch ...
> >
> > 2008/11/23 Emma Strubell <[EMAIL PROTECTED]>
> > Hi everyone. My name is Emma, and I am completely new to this
> > list. I've been using Gentoo since 2004, including Portage of
> > course, and before I say anything else I'd like to say thanks
> > to everyone for such a kickass package management system!!
> >
> > Anyway, for my final project in my Data Structures &
> > Algorithms class this semester, I would like to modify the
> > search functionality in emerge. Something I've always noticed
> > about 'emerge -s' or '-S' is that, in general, it takes a very
> > long time to perform the searches. (Although, lately it does
> > seem to be running faster, specifically on my laptop as
> > opposed to my desktop. Strangely, though, it seems that when I
> > do a simple 'emerge -av whatever' on my laptop it takes a very
> > long time for emerge to find the package and/or determine the
> > dependecies -  whatever it's doing behind that spinner. I can
> > definitely go into more detail about this if anyone's
> > interested. It's really been puzzling me!) So, as my final
> > project I've proposed to improve the time it takes to perform
> > a search using emerge. My professor suggested that I look into
> > implementing indexing.
> >
> > However, I've started looking at the code, and I must admit
> > I'm pretty overwhelmed! I don't know where to start. I was
> > wondering if anyone on here could give me a quick overview of
> > how the search function currently works, an idea as to what
> > could be modified or implemented in order to improve the
> > running time of this code, or any tip really as to where I
> > should start or what I should start looking at. I'd really
> > appreciate any help or advice!!
> >
> > Thanks a lot, and keep on making my Debian-using professor
> > jealous :]
> > Emma
> >
> >
> >
> > --
> > tvali
> >
> > Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.
> > Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju
> > mingi täica pea nagu prügikast...
>
> I use eix:
> emerge eix
>
> ;-)
>
>
>


Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Emma Strubell
Thanks for the replies! I know there are a couple programs out there that
basically already do what I'm looking to do... Unfortunately I wasn't aware
of these pre-existing utilities until after I submitted my project proposal
to my professor. So, I'm looking to implement a better search myself.
Preferably by editing the existing portage code, not writing a separate
program. So if anyone can offer any help regarding the actual implementation
of search in portage, I would greatly appreciate it!

Or, if anyone has an idea for a more productive/useful project I could work
on relating to portage (about the same difficulty, preferably at least a
little bit data structure related), please, let me know! Thanks again guys,

Emma

On Sun, Nov 23, 2008 at 9:33 AM, Pacho Ramos <
[EMAIL PROTECTED]> wrote:

> El dom, 23-11-2008 a las 16:01 +0200, tvali escribió:
> > Try esearch.
> >
> > emerge esearch
> > esearch ...
> >
> > 2008/11/23 Emma Strubell <[EMAIL PROTECTED]>
> > Hi everyone. My name is Emma, and I am completely new to this
> > list. I've been using Gentoo since 2004, including Portage of
> > course, and before I say anything else I'd like to say thanks
> > to everyone for such a kickass package management system!!
> >
> > Anyway, for my final project in my Data Structures &
> > Algorithms class this semester, I would like to modify the
> > search functionality in emerge. Something I've always noticed
> > about 'emerge -s' or '-S' is that, in general, it takes a very
> > long time to perform the searches. (Although, lately it does
> > seem to be running faster, specifically on my laptop as
> > opposed to my desktop. Strangely, though, it seems that when I
> > do a simple 'emerge -av whatever' on my laptop it takes a very
> > long time for emerge to find the package and/or determine the
> > dependecies -  whatever it's doing behind that spinner. I can
> > definitely go into more detail about this if anyone's
> > interested. It's really been puzzling me!) So, as my final
> > project I've proposed to improve the time it takes to perform
> > a search using emerge. My professor suggested that I look into
> > implementing indexing.
> >
> > However, I've started looking at the code, and I must admit
> > I'm pretty overwhelmed! I don't know where to start. I was
> > wondering if anyone on here could give me a quick overview of
> > how the search function currently works, an idea as to what
> > could be modified or implemented in order to improve the
> > running time of this code, or any tip really as to where I
> > should start or what I should start looking at. I'd really
> > appreciate any help or advice!!
> >
> > Thanks a lot, and keep on making my Debian-using professor
> > jealous :]
> > Emma
> >
> >
> >
> > --
> > tvali
> >
> > Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.
> > Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju
> > mingi täica pea nagu prügikast...
>
> I use eix:
> emerge eix
>
> ;-)
>
>
>


Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Pacho Ramos
El dom, 23-11-2008 a las 16:01 +0200, tvali escribió:
> Try esearch.
> 
> emerge esearch
> esearch ...
> 
> 2008/11/23 Emma Strubell <[EMAIL PROTECTED]>
> Hi everyone. My name is Emma, and I am completely new to this
> list. I've been using Gentoo since 2004, including Portage of
> course, and before I say anything else I'd like to say thanks
> to everyone for such a kickass package management system!!
> 
> Anyway, for my final project in my Data Structures &
> Algorithms class this semester, I would like to modify the
> search functionality in emerge. Something I've always noticed
> about 'emerge -s' or '-S' is that, in general, it takes a very
> long time to perform the searches. (Although, lately it does
> seem to be running faster, specifically on my laptop as
> opposed to my desktop. Strangely, though, it seems that when I
> do a simple 'emerge -av whatever' on my laptop it takes a very
> long time for emerge to find the package and/or determine the
> dependecies -  whatever it's doing behind that spinner. I can
> definitely go into more detail about this if anyone's
> interested. It's really been puzzling me!) So, as my final
> project I've proposed to improve the time it takes to perform
> a search using emerge. My professor suggested that I look into
> implementing indexing.
> 
> However, I've started looking at the code, and I must admit
> I'm pretty overwhelmed! I don't know where to start. I was
> wondering if anyone on here could give me a quick overview of
> how the search function currently works, an idea as to what
> could be modified or implemented in order to improve the
> running time of this code, or any tip really as to where I
> should start or what I should start looking at. I'd really
> appreciate any help or advice!!
> 
> Thanks a lot, and keep on making my Debian-using professor
> jealous :]
> Emma
> 
> 
> 
> -- 
> tvali
> 
> Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.
> Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju
> mingi täica pea nagu prügikast...

I use eix:
emerge eix

;-)




Re: [gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread tvali
Try esearch.

emerge esearch
esearch ...

2008/11/23 Emma Strubell <[EMAIL PROTECTED]>

> Hi everyone. My name is Emma, and I am completely new to this list. I've
> been using Gentoo since 2004, including Portage of course, and before I say
> anything else I'd like to say thanks to everyone for such a kickass package
> management system!!
>
> Anyway, for my final project in my Data Structures & Algorithms class this
> semester, I would like to modify the search functionality in emerge.
> Something I've always noticed about 'emerge -s' or '-S' is that, in general,
> it takes a very long time to perform the searches. (Although, lately it does
> seem to be running faster, specifically on my laptop as opposed to my
> desktop. Strangely, though, it seems that when I do a simple 'emerge -av
> whatever' on my laptop it takes a very long time for emerge to find the
> package and/or determine the dependecies -  whatever it's doing behind that
> spinner. I can definitely go into more detail about this if anyone's
> interested. It's really been puzzling me!) So, as my final project I've
> proposed to improve the time it takes to perform a search using emerge. My
> professor suggested that I look into implementing indexing.
>
> However, I've started looking at the code, and I must admit I'm pretty
> overwhelmed! I don't know where to start. I was wondering if anyone on here
> could give me a quick overview of how the search function currently works,
> an idea as to what could be modified or implemented in order to improve the
> running time of this code, or any tip really as to where I should start or
> what I should start looking at. I'd really appreciate any help or advice!!
>
> Thanks a lot, and keep on making my Debian-using professor jealous :]
> Emma
>



-- 
tvali

Kuskilt foorumist: http://www.cooltests.com - kui inglise keelt oskad.
Muide, üle 120 oled väga tark, üle 140 oled geenius, mingi 170 oled ju mingi
täica pea nagu prügikast...


[gentoo-portage-dev] search functionality in emerge

2008-11-23 Thread Emma Strubell
Hi everyone. My name is Emma, and I am completely new to this list. I've
been using Gentoo since 2004, including Portage of course, and before I say
anything else I'd like to say thanks to everyone for such a kickass package
management system!!

Anyway, for my final project in my Data Structures & Algorithms class this
semester, I would like to modify the search functionality in emerge.
Something I've always noticed about 'emerge -s' or '-S' is that, in general,
it takes a very long time to perform the searches. (Although, lately it does
seem to be running faster, specifically on my laptop as opposed to my
desktop. Strangely, though, it seems that when I do a simple 'emerge -av
whatever' on my laptop it takes a very long time for emerge to find the
package and/or determine the dependecies -  whatever it's doing behind that
spinner. I can definitely go into more detail about this if anyone's
interested. It's really been puzzling me!) So, as my final project I've
proposed to improve the time it takes to perform a search using emerge. My
professor suggested that I look into implementing indexing.

However, I've started looking at the code, and I must admit I'm pretty
overwhelmed! I don't know where to start. I was wondering if anyone on here
could give me a quick overview of how the search function currently works,
an idea as to what could be modified or implemented in order to improve the
running time of this code, or any tip really as to where I should start or
what I should start looking at. I'd really appreciate any help or advice!!

Thanks a lot, and keep on making my Debian-using professor jealous :]
Emma