Re: [RevServer tips] Spreading the load or why wise developers use asynchronous workflows

Richard Gaskin Wed, 04 Aug 2010 12:28:51 -0700

wayne durden wrote:

This is all very interesting to me because I am interested in moving a
desktop app that processes datafiles up to 100,000 lines which can mean for
each line comparing against the remainder (in reality sorts cust this down a
great deal), but this can run for minutes on a desktop app and I have got to
cut it down into asynchronous processing as per your article...

I don't know the specifics of your data or your needs, but lately I'vebeen experimenting with a variety of different ways to store data, andI've found that for many tasks using column-based storage over row-basedstorage can speed up searches and comparisons by orders of magnitude.

This is where the old acronyms OLAP and OLTP come in, with the "A" being"access" (analytics, data mining; mostly read operations) and "T" being"transaction" (posting as well as reading). That's anoversimplification, but spending some time following those links out inWikipedia from those can lead to all sorts of different ways to storeand index data for task-specific needs which can radically reduce CPUand RAM consumption.

For example, if you had a data set in which you had 300,000 addressrecords stored in eight fields, you could store them in eight files inwhich each stores only the values for a given column. Finding addressesin zip code would then no longer need to traverse the whole data set andparse each line, but merely pick up the one file for zip codes and"repeat for each" with those. Any columns you're not interested in fora given search are left on disk and take up zero RAM.

Then there are other things one can add in, like cardinal indexing ofcolumn values for one-step searches across data sets of any size.

Quick example using the zip code exercise again: You write an indexerthat runs through the data set and produces a stack file in which eachof the custom property keys of the stack is a zip code, and the value ofeach property is a list of the ID numbers of all the records that havethat zip code.


With that index you can now search in one step:

  get the uZipCodes["90031"] of stack "ZipIndex.rev"

...and you have an instant list of the ID of every record with that zipcode.


How to get the data once you've found those IDs?

There are an infinite number of ways to store data, but if you used evenjust simple tab-delimited files you'd be surprised how quickly you canget to what you want using the seek command if you write an index first.

Such a master index could also be a simple list of properties in a stack(by far the most efficient way to load persistent arrays in Rev, muchfaster than arrayDecode), in which each element key is the ID number ofthe record and each value is just two lines: the byte offset to thestart of the record, and the length of the record.

With that relatively small index you can get any record anywhere in evena giant file in four lines:


   open file tMyDataFile for read
   seek to tRecordStart in file tMyDataFile
   read from file tMyDataFile for tRecordLength
   close file tMyDataFile

On my slow Mac here I can use that to pull a record out of a 500 MB filecontaining 300,000 records in about 50 MICROseconds.

Since an index for a file like that will take only a few MBs it can beloaded in no time, and the seek command doesn't load the whole data fileinto RAM so the only memory consumption for getting the record is justthe record itself + the index + the engine's normal overhead.

Combined with the cardinal indexing described above and you can sliceand dice data any number of ways really quickly.

Of course this is only suited for OLAP-style tasks, dependent on thedata not changing frequently so it can be worthwhile indexing it withoutthe indexing adding more overhead than it's worth. FWIW, on my slow MacI can write the master index and two or three columnar cardinal indicesin well under a minutes.

For all sorts of task in which data is read far more frequently thanwritten, you can use methods like this to get ultra-fast results withminimal resource consumption.

If the data on the server is not modified there but merely used as adata repository for your searches, you could do the indexing tasks onyour desktop and just upload the index stacks to your server along witha copy of the file. The server load will always be minimal, and you cando some relatively massive tasks well under even most shared hosting limits.

Of course you could also use MySQL, CouchDB, or any number of otheroff-the-shelf solutions for much of this, but for some tasks you mayfind you can write an indexer and retriever faster in Rev than you coulddig up the syntax to do it in another language. :)

WARNING: Once you start exploring indexing techniques you may becomeaddicted; you will find yourself daydreaming about new methods at oddhours of the day, and time formerly spent with the family will suddenlybecome spent on the web learning even better methods. You may findyourself thinking about ways to use Rev's union and intersect commandson results from index searches to implement even complex AND and ORqueries in one step. Turning data inside out can cause your mind tocave in on itself, and worse you make like it. You have been warned.


--
 Richard Gaskin
 Fourth World
 Rev training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com
 revJournal blog: http://revjournal.com/blog.irv
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: [RevServer tips] Spreading the load or why wise developers use asynchronous workflows

Reply via email to