Hello everyone,

I've benefited immeasurably from the thoughtful comments of everyone on this list, so I'm throwing in my two cents for you to critique. The following is a primitive handler (because I'm a primitive scripter) I wrote in MetaCard a couple of years ago for reading and processing large text files.

The project was to index Unigene data files. These are flat-file database files from the Human Genome project, and they are roughly 450 to 500 MB. The way I processed a file is simply by reading it in a record at a time. Records vary from a few lines to many hundreds of lines, with an unknown number of variables in each record, and are delimited by ">>" or "//". The variable recordDelimiter is set accordingly. The data from the read is put into thisRecord and processed. Processing involves breaking down each record into its variables and writing them to separate index files, effectively creating a relational database.

What made a big difference in speed for both processing the data was how frequently variables in which processed data was accumulating were written to index files on disk and then emptied before continuing. Likewise, in searching and extracting information from the resulting index files, speed was enhanced by experimenting with the amount of data read as a searched progressed. Perhaps, in your case, it is best to read in many lines at a time and process them all together. It would not be too difficult to write a script that adjusts the number of lines that it reads in and processes as it goes along to find an optimum. Reading and writing often reduces speed, but processing "too" much data at a time does too. I haven't had an opportunity to study Richard, Jacqueline, and Xavier's posts on buffers, but I'm guessing that they're dealing with the optimal use of memory in this regard.

Below the handler is an example of one of my logs from indexing that was done on a modest 300 MHz G3 iBook. Processing a 483 MB file took 29 minutes. This may seem like a long time, but it only had to be done once. After that, searches are done on the index files. Scroll down and you'll see another log: this one from one of the searches on index files. A search for 2,065 genetic probes and the merger and summary of related data from the index files took 1.3 minutes. This isn't too shabby considering that most scientific web sites either restrict users to one query at a time (imagine submitting 2,065 queries), or, if they permit batch queries, there is little choice in the format of output or merge files. This is all to say that we're pleased.


open file filePath for read
repeat
read from file filePath until recordDelimiter
put the result into resultOfRead
put it into thisRecord
-- START PROCESSING AS DESIRED
-- Script for processing thisRecord goes here.
-- Monitor the variables or containers in which you allow data to accumulate,
-- so that they don't get too big and slow down performance. Experiment to
-- find the optimal number of times to write accumulated data to disk files.
if resultOfRead is not empty then exit repeat -- We're at the end-of-file (eof)
end repeat
close file filePath


Here is a sample log from one of the runs to index the source file.
Index created: Fri, Oct 24, 2003 11:44:04 AM
Target file: "UniGene Human 23 Oct 2003.txt"
Index set name (folder): "UniGene Human 23 Oct 2003 Index"
Index key (first column of every index file): "ID"
Number of records in source: 127,835
Number of variables found: 12
Total indexing time: about 28.9 minutes
Data processed: 483.01 MB
Indexing performance: 279.02 KB per second

Here is a sample log from one of the searches run on the index files that were created.
Your data file: "Human Probes 2065 x 68.txt"
Target file: "Consolidated SEQUENCE.txt"
Target set (folder): "UniGene Human 23 Oct 2003 Index"
Extraction set (folder): "Human Extraction"
Observations (lines) searched: 127,835
Hits: 1,631 (chunks found)
Misses: 360 (chunks not found)
Total unique chunks: 1,991
Fuzzy hits: not flagged for extractions from consolidated files
Records extracted: 1,562 (records where chunks were found)
Times data file read: 43
Average read size: 1.03 MB bytes
Average lines per read: 2,973
Total bytes read: 44.34 MB
Extraction time: about 1.3 minutes


        Regards,

        Gregory Lypny
        __________________________
        Associate Professor of Finance
        John Molson School of Business
        Concordia University
        Montreal, Canada

_______________________________________________
use-revolution mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to