Re: website catcher

2005-07-06 Thread Michael Ströder
jwaixs wrote:
 I need some kind
 of database that won't exit if the cgi-bin script has finished. This
 database need to be open all the time and communicate very easily with
 the cgi-bin framwork main class.

Maybe long-running multi-threaded processes for FastCGI, SCGI or similar
is what you're looking for instead short-lived CGI-BIN programs forked
by the web server.

Ciao, Michael.
-- 
http://mail.python.org/mailman/listinfo/python-list


website catcher

2005-07-03 Thread jwaixs
Hello,

I'm busy to build some kind of webpage framework written in Python. But
there's a small problem in this framework. This framework should open a
page, parse it, take some other information out of it and should store
it in some kind of fast storage. This storage need to be very fast so
every one who will ask for this page will get a parsed page returned
from this storage (catcher?).

But how could I design a good webcatcher? Is this possible in python,
because it should always run. Which won't work with cgi-bin pages
because the quit after the execute something. Or should it be build in
c and imported as a module or something?

Thank you,

Noud Aldenhoven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: website catcher

2005-07-03 Thread [EMAIL PROTECTED]
You can catch the content of an url like this:
http://www.python.org/doc/current/lib/node478.html, from here you can
parse it, and the store the result e.g. in dictionary, you will have a
very well performing solution like this.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: website catcher

2005-07-03 Thread jwaixs
Thank you, but it's not what I mean. I don't want some kind of client
parser thing. But I mean the page is already been parsed and ready to
be read. But I want to store this page for more use. I need some kind
of database that won't exit if the cgi-bin script has finished. This
database need to be open all the time and communicate very easily with
the cgi-bin framwork main class.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: website catcher

2005-07-03 Thread Diez B. Roggisch
jwaixs wrote:
 Thank you, but it's not what I mean. I don't want some kind of client
 parser thing. But I mean the page is already been parsed and ready to
 be read. But I want to store this page for more use. I need some kind
 of database that won't exit if the cgi-bin script has finished. This
 database need to be open all the time and communicate very easily with
 the cgi-bin framwork main class.

Why does it need to be open? Store it in a pickled file, an load read 
that pickle when you need it. Or not even as pickle, just as file in the 
FS. Basically what you are talking about is a webserver - so just use that.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: website catcher

2005-07-03 Thread jwaixs
If I should put the parsedwebsites in, for example, a tablehash it will
be at least 5 times faster than just putting it in a file that needs to
be stored on a slow harddrive. Memory is a lot faster than harddisk
space. And if there would be a lot of people asking for a page all of
them have to open that file. if that are 10 requests in 5 minutes
there's no real worry. If they are more that 10 request per second you
really have a big problem and the framework would probably crash or
will run uber slow. That's why I want to open the file only one time
and keep it saved in the memory of the server where it don't need to be
opened each time some is asking for it.

Noud Aldenhoven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: website catcher

2005-07-03 Thread Diez B. Roggisch
jwaixs wrote:
 If I should put the parsedwebsites in, for example, a tablehash it will
 be at least 5 times faster than just putting it in a file that needs to
 be stored on a slow harddrive. Memory is a lot faster than harddisk
 space. And if there would be a lot of people asking for a page all of
 them have to open that file. if that are 10 requests in 5 minutes
 there's no real worry. If they are more that 10 request per second you
 really have a big problem and the framework would probably crash or
 will run uber slow. That's why I want to open the file only one time
 and keep it saved in the memory of the server where it don't need to be
 opened each time some is asking for it.

I don't think that's correct. An apache serves static pages with high 
speed - and slow hardrives means about 32MByte/s nowadays. Which 
equals 256MBit/s - is your machine connected to a GBit connection? And 
if it's for internet usage, do you have a GBit connection - if so, I 
envy you...

And if your speed has to have that high, I wonder if python can be used 
at all. BTW, 10 reqeuest per seconds of maybe 100KB pages is next to 
nothing - just 10MBit. It's not really fast. And images and the like are 
also usually served from HD.

You are of course right that memory is faster than harddrives. but HDs 
are (ususally) faster than network IO - so that's your limiting factor, 
if at all. And starting CGI subrpocesses introduces also lots of 
overhead - better use fastcgis then.


I think that we're talking about two things here:

  - premature optimization on your side. Worry about speed later, if it 
_is_ an issue. Not now.

  - what you seem to want is a convenient way of having data serverd to 
you in a pythonesque way. I personally don't see anything wrong with 
storing and retrieving pages from HD - after all, that's where they end 
up anyway ebentually. So if you write yourself a HTMLRetrieval class 
that abstratcs that for you and

  1) takes a piece of HTML and stores that, maybe associated with some 
metadata
  2) can retrieve these chunks of based on some key

you are pretty much done. If you want, you can back it up using a RDBMS, 
hoping that it will do the in-memory-caching for you. But remember that 
there will be no connection pooling using CGIs, so that introduces overhead.

  Or you go for your  own standalone process that serves the pages 
through some RPC mechanism.

Or you ditch CGIs at all and use some webframework that serves from an 
permanenty running python process with several worker threads - then you 
can use in-process memory by global variables to store that memory. For 
that, I recommend twisted.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: website catcher

2005-07-03 Thread jwaixs
Well, thank you for your advice. So I have a couple of solutions, but
it can't become a server at its own so this means I will deal with
files.

Thank you for your advice, I'll first make it work... than the server.

Noud Aldenhoven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: website catcher

2005-07-03 Thread gene tani
maybe look at Harvestman

http://cheeseshop.python.org/HarvestMan/1.4%20final

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: website catcher

2005-07-03 Thread Mike Meyer
jwaixs [EMAIL PROTECTED] writes:

 If I should put the parsedwebsites in, for example, a tablehash it will
 be at least 5 times faster than just putting it in a file that needs to
 be stored on a slow harddrive. Memory is a lot faster than harddisk
 space. And if there would be a lot of people asking for a page all of
 them have to open that file. if that are 10 requests in 5 minutes
 there's no real worry. If they are more that 10 request per second you
 really have a big problem and the framework would probably crash or
 will run uber slow. That's why I want to open the file only one time
 and keep it saved in the memory of the server where it don't need to be
 opened each time some is asking for it.

While Diez gave you some good reasons not to worry about this, and had
some great advice, he missed one important reason you shouldn't worry
about this:

Your OS almost certainly has a disk cache.

This means that if you get 10 requests for a page in a second, the
first one will come off the disk and wind up in the OS disk cache. The
next nine requests will get the pages from the OS disk cache, and not
go to the disk at all.

When you keep these pages in memory yourself, you're basically
declaring that they are so important that you don't trust the OS to
cache them properly. The exact details of how your using extra memory
interact with the disk cache vary with the OS, but there's a fair
chance that you're cutting down on the amount of disk cache the system
will have available.

In the end, if the OS disagrees with you about how important your
pages are, it will win. Your pages will get paged out to disk, and
have to be read back from disk even though you have them stored in
memory. With extra overhead in the form of an interrupt when your
process tries to access the swapped out page, at that.

A bunch of very smart people have spent a lot of time making modern
operating systems perform well. Worrying about things that it is
already worrying about is generally a waste of time - a clear case of
premature optimization.

  mike
-- 
Mike Meyer [EMAIL PROTECTED]  http://www.mired.org/home/mwm/
Independent WWW/Perforce/FreeBSD/Unix consultant, email for more information.
-- 
http://mail.python.org/mailman/listinfo/python-list