wayne durden wrote:

This is all very interesting to me because I am interested in moving a
desktop app that processes datafiles up to 100,000 lines which can mean for
each line comparing against the remainder (in reality sorts cust this down a
great deal), but this can run for minutes on a desktop app and I have got to
cut it down into asynchronous processing as per your article...

I don't know the specifics of your data or your needs, but lately I've been experimenting with a variety of different ways to store data, and I've found that for many tasks using column-based storage over row-based storage can speed up searches and comparisons by orders of magnitude.

This is where the old acronyms OLAP and OLTP come in, with the "A" being "access" (analytics, data mining; mostly read operations) and "T" being "transaction" (posting as well as reading). That's an oversimplification, but spending some time following those links out in Wikipedia from those can lead to all sorts of different ways to store and index data for task-specific needs which can radically reduce CPU and RAM consumption.

For example, if you had a data set in which you had 300,000 address records stored in eight fields, you could store them in eight files in which each stores only the values for a given column. Finding addresses in zip code would then no longer need to traverse the whole data set and parse each line, but merely pick up the one file for zip codes and "repeat for each" with those. Any columns you're not interested in for a given search are left on disk and take up zero RAM.

Then there are other things one can add in, like cardinal indexing of column values for one-step searches across data sets of any size.

Quick example using the zip code exercise again: You write an indexer that runs through the data set and produces a stack file in which each of the custom property keys of the stack is a zip code, and the value of each property is a list of the ID numbers of all the records that have that zip code.

With that index you can now search in one step:

  get the uZipCodes["90031"] of stack "ZipIndex.rev"

...and you have an instant list of the ID of every record with that zip code.

How to get the data once you've found those IDs?

There are an infinite number of ways to store data, but if you used even just simple tab-delimited files you'd be surprised how quickly you can get to what you want using the seek command if you write an index first.

Such a master index could also be a simple list of properties in a stack (by far the most efficient way to load persistent arrays in Rev, much faster than arrayDecode), in which each element key is the ID number of the record and each value is just two lines: the byte offset to the start of the record, and the length of the record.

With that relatively small index you can get any record anywhere in even a giant file in four lines:

   open file tMyDataFile for read
   seek to tRecordStart in file tMyDataFile
   read from file tMyDataFile for tRecordLength
   close file tMyDataFile

On my slow Mac here I can use that to pull a record out of a 500 MB file containing 300,000 records in about 50 MICROseconds.

Since an index for a file like that will take only a few MBs it can be loaded in no time, and the seek command doesn't load the whole data file into RAM so the only memory consumption for getting the record is just the record itself + the index + the engine's normal overhead.

Combined with the cardinal indexing described above and you can slice and dice data any number of ways really quickly.

Of course this is only suited for OLAP-style tasks, dependent on the data not changing frequently so it can be worthwhile indexing it without the indexing adding more overhead than it's worth. FWIW, on my slow Mac I can write the master index and two or three columnar cardinal indices in well under a minutes.

For all sorts of task in which data is read far more frequently than written, you can use methods like this to get ultra-fast results with minimal resource consumption.

If the data on the server is not modified there but merely used as a data repository for your searches, you could do the indexing tasks on your desktop and just upload the index stacks to your server along with a copy of the file. The server load will always be minimal, and you can do some relatively massive tasks well under even most shared hosting limits.

Of course you could also use MySQL, CouchDB, or any number of other off-the-shelf solutions for much of this, but for some tasks you may find you can write an indexer and retriever faster in Rev than you could dig up the syntax to do it in another language. :)


WARNING: Once you start exploring indexing techniques you may become addicted; you will find yourself daydreaming about new methods at odd hours of the day, and time formerly spent with the family will suddenly become spent on the web learning even better methods. You may find yourself thinking about ways to use Rev's union and intersect commands on results from index searches to implement even complex AND and OR queries in one step. Turning data inside out can cause your mind to cave in on itself, and worse you make like it. You have been warned.

--
 Richard Gaskin
 Fourth World
 Rev training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com
 revJournal blog: http://revjournal.com/blog.irv
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to