Re: Large Data Set In Mod_Perl
On Fri, 2003-06-13 at 12:02, Patrick Mulvany wrote: > However If I ever heard of a case for use of a fixed width ascii file using spacing > records this is it. Why make your life difficult? Just use a dbm file. - Perrin
Re: Large Data Set In Mod_Perl
On Wed, May 28, 2003 at 10:07:39PM -0400, Dale Lancaster wrote: > For the perl hash, I would key the hash on the combo of planet and date, > something like: > > my %Planets = ( > > jupiter=> { > "1900-01-01"=> ( "5h 39m 18s", "+22o > 4.0'", 28.922, -15,128, -164.799, "set"), > "1900-01-02"=> ( "5h 39m 18s", "+22o > 4.0'", 28.922, -15,128, -164.799, "set"), > }, > > neptune=> { > "1900-01-01"=> ( "5h 39m 18s", "+22o > 4.0'", 28.922, -15,128, -164.799, "set"), > "1900-01-02"=> ( "5h 39m 18s", "+22o > 4.0'", 28.922, -15,128, -164.799, "set"), > }, > ) ; my $Planets = { jupiter=> { 1900 => { 01 => { 01 => 1, # Record number in a file 02 => 2, }. 02 => { ...}, This would not require the entire dataset to be stored in memory but rather an offset to a file possition which could be randomly accessed. However If I ever heard of a case for use of a fixed width ascii file using spacing records this is it. If you had one file per planet and assuming that you wanted to start on 1900-01-01 my $record_width=90; my $offset = (($year-1900)*372+(($month-1)*31)+($day-1))*$record_width; # 1900-01-01 would be offset 0 # 2003-06-13 would be offset 3463560 This format would require blank records inserted for 1900-02-30 etc. but a simple script could auto generate the file. One advantage of this would be the OS would file cache the read only file. Just my toughts, hope it helps. Paddy
Re: Large Data Set In Mod_Perl
Perrin Harkins wrote: simran wrote: I need to be able to say: * Lookup the _distance_ for the planet _mercury_ on the date _1900-01-01_ On the face of it, a relational database is best for that kind of query. However, if you won't get any fancier than that, you can get by with MLDBM or something similar. Currently i do this using a postgres database, however, my question is, is there a quicker way to do this in mod_perl - would a DB_File or some other structure be better? Query speed comes into question only when there is a heavy use. Postgress has an 'Explain' facility via pgsql. Just add Explain before the query and you will get the cost of the query. By creating proper indexes you can get good optimization. What if you add a table later and you need to join that with the planet table? If you keep your planet data somewhere else, then the access becomes cumbersome as well as slower. There are many ways to speed up Postgresql. I recommend the Posgresql book by Korryand Susan Douglas. I got it from Barnes and Nobles. IMHO stay with the relational database you are on and find ways to optimize. A DBM file will be faster. What you can do is build a key out of planet + date, so that you grab the right record with a single access. Either use MLDBM for storing hashes inside each record, or just a simple join/split approach. This would be a good idea if you are implementing your tool and you know what limitations you will be subjected to. MySQL would probably also be faster than PostgreSQL for this kind of simple read-only querying, but not as fast as a DBM file. SDBM_File is the fastest DBM around, if you can live with the space limitations it has. perhaps something such as copying the whole 800,000 rows to memory (as a hash?) on apache startup? Postgresql may have a way to 'stick' a table in memory like MySQL. That would be the fastest by far, but it will use a boatload of RAM. It's pretty easy to try, so test it and see if you can spare the RAM it requires. - Perrin
RE: Large Data Set In Mod_Perl
On Thu, 2003-05-29 at 13:10, Marc M. Adkins wrote: > My original comment was regarding threads, not processes. I run on Windows > and see only two Apache processes, yet I have a number of Perl interpreters > running in their own ithreads. My understanding of Perl ithreads is that > while the syntax tree is reused, data stored in the parent ithread is > cloned. Remember, this is an OS-level feature. Perl doesn't have to do anything. The OS keeps track of the fact that the pages in memory have not been touched since they were "copied" and doesn't actually bother to copy them. > In addition, since I'm on Windows, I'm not convinced that the type of > OS-level code sharing you're talking about is in fact done. Windows doesn't > fork(). It's not about forking, it's about having a modern virtual memory system. Windows definitely has this feature. - Perrin
RE: Large Data Set In Mod_Perl
> On Thu, 2003-05-29 at 12:59, Marc M. Adkins wrote: > > That's news to me (not being facetious). I was under the > impression that > > cloning Perl 5.8 ithreads cloned everything, that there was no > sharing of > > read-only data. > > We're not talking about ithreads here, just processes. The data is > shared by copy-on-write. It's an OS-level feature. See the mod_perl > docs for more info. My original comment was regarding threads, not processes. I run on Windows and see only two Apache processes, yet I have a number of Perl interpreters running in their own ithreads. My understanding of Perl ithreads is that while the syntax tree is reused, data stored in the parent ithread is cloned. In addition, since I'm on Windows, I'm not convinced that the type of OS-level code sharing you're talking about is in fact done. Windows doesn't fork(). mma
RE: Large Data Set In Mod_Perl
On Thu, 2003-05-29 at 11:59, Marc M. Adkins wrote: > > > perhaps something such as copying the whole 800,000 rows to > > > memory (as a hash?) on apache startup? > > > > That would be the fastest by far, but it will use a boatload of RAM. > > It's pretty easy to try, so test it and see if you can spare the RAM it > > requires. > > Always one of my favorite solutions to this sort of problem (dumb and fast) > but in mod_perl won't this eat RAM x number of mod_perl threads??? No. If you load the data during startup (before the fork) it will be shared unless you modify it. - Perrin
RE: Large Data Set In Mod_Perl
> > perhaps something such as copying the whole 800,000 rows to > > memory (as a hash?) on apache startup? > > That would be the fastest by far, but it will use a boatload of RAM. > It's pretty easy to try, so test it and see if you can spare the RAM it > requires. Always one of my favorite solutions to this sort of problem (dumb and fast) but in mod_perl won't this eat RAM x number of mod_perl threads??? In this case one of the advantages of the DBMS is that it is one copy of the data that everyone shares. mma
Re: Large Data Set In Mod_Perl
Hi there, On Wed, 28 May 2003, Perrin Harkins wrote: > simran wrote: [snip] > > * Lookup the _distance_ for the planet _mercury_ on the date _1900-01-01_ [snip] > you can get by with MLDBM or something similar. You might also want to investigate using a compiled C Btree library which could be tuned specifically to your dataset. Hard work. [snip] > > perhaps something such as copying the whole 800,000 rows to memory [snip] > That would be the fastest by far, but it will use a boatload of RAM. To economise on memory you could compress the data (or part of it) before storage/lookup using a fast compress/decompress algorithm. There would be a tradeoff between memory consumption and processor cycles of course. That kind of thing can get a bit complicated... :) 73, Ged.
Re: Large Data Set In Mod_Perl
I've dealt with fairly large sets, but not as static as yours. If your only keys for searching are planet and date, then a perl lookup with a hash will be faster overall since a DB lookup involves connecting to the database, doing the standard prepare/execute/fetch which could be as costly (for a single lookup) as the lookup itself. The actual lookup of the record in the database is probably as fast or faster than Perl (especially after the initial lookup that primes the caches) if you have indexed the columns on the table properly. If you are planning to do lots of lookups on this dataset, preloading the dataset in a perl hash would definitely be the better approach. If you are doing only a few lookups over a given period, it may or may not be worth it and taking up lots of memory for no reason and sticking with the db lookup would probably be best. For the perl hash, I would key the hash on the combo of planet and date, something like: my %Planets = ( jupiter=> { "1900-01-01"=> ( "5h 39m 18s", "+22o 4.0'", 28.922, -15,128, -164.799, "set"), "1900-01-02"=> ( "5h 39m 18s", "+22o 4.0'", 28.922, -15,128, -164.799, "set"), }, neptune=> { "1900-01-01"=> ( "5h 39m 18s", "+22o 4.0'", 28.922, -15,128, -164.799, "set"), "1900-01-02"=> ( "5h 39m 18s", "+22o 4.0'", 28.922, -15,128, -164.799, "set"), }, ) ; You could also just combine the planet and date as the string for the hash key like "jupiter1900-01-01" but not real sure if this buys you any performance - it might even be slightly slower since its working on a much larger single hash rather than a two dimensional hash - might be interesting to benchmark it on your size dataset to see what really happens. As to using DB_file, it would probably be somewhere between the Perl hash approach and using the standard SQL database interface. dale - Original Message - From: "simran" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Wednesday, May 28, 2003 9:29 PM Subject: Large Data Set In Mod_Perl > Hi All, > > For one of the websites i have developed (/am developing), i have a > dataset that i must refer to for some of the dynamic pages. > > The data is planetary data that is pretty much in spreadsheet format, > aka, i have just under 800,000 "rows" of data. I don't do any copmlex > searches or functions on the data. I simply need to look at certain > columns at certain times. > > sample data set: > > planet |date| right_ascension | declination | distance | altitude | azimuth | visibility > -++-+-+--+ --+--+ > jupiter | 1900-01-01 | 15h 57m 7s | -19° 37.2' |6.108 | 10.199 | 39.263 | up > mars| 1900-01-01 | 19h 2m 20s | -23° 36.7' |2.401 | 14.764 |-4.65 | up > mercury | 1900-01-01 | 17h 15m 16s | -21° 59.7' |1.151 | 14.041 | 20.846 | up > moon| 1900-01-01 | 18h 41m 17s | -21° 21.8' | 58.2 | 17.136 |0.343 | transit > neptune | 1900-01-01 | 5h 39m 18s | +22° 4.0' | 28.922 -15.128 | -164.799 | set > > > I need to be able to say: > > * Lookup the _distance_ for the planet _mercury_ on the date _1900-01-01_ > > Currently i do this using a postgres database, however, my question is, > is there a quicker way to do this in mod_perl - would a DB_File or some > other structure be better? > > I would be interested in knowing if others have dealt with large data > sets as above and what solutions they have used. > > A DB is quick, but is there something one can use in mod_perl that would > be quicker? perhaps something such as copying the whole 800,000 rows to > memory (as a hash?) on apache startup? > > simran. > >
Re: Large Data Set In Mod_Perl
simran wrote: I need to be able to say: * Lookup the _distance_ for the planet _mercury_ on the date _1900-01-01_ On the face of it, a relational database is best for that kind of query. However, if you won't get any fancier than that, you can get by with MLDBM or something similar. Currently i do this using a postgres database, however, my question is, is there a quicker way to do this in mod_perl - would a DB_File or some other structure be better? A DBM file will be faster. What you can do is build a key out of planet + date, so that you grab the right record with a single access. Either use MLDBM for storing hashes inside each record, or just a simple join/split approach. MySQL would probably also be faster than PostgreSQL for this kind of simple read-only querying, but not as fast as a DBM file. SDBM_File is the fastest DBM around, if you can live with the space limitations it has. perhaps something such as copying the whole 800,000 rows to memory (as a hash?) on apache startup? That would be the fastest by far, but it will use a boatload of RAM. It's pretty easy to try, so test it and see if you can spare the RAM it requires. - Perrin