Re: Reading a (BIG) text file one line at a time - in reality...

Richard Gaskin Tue, 23 Nov 2004 20:18:02 -0800

MisterX wrote:

Technology may have advanced in OS since the Apple ][ I started with but
still, VM is still a disk intensive task and more penalizing than you think
if not done right! Also remember that disk access is slower than ram and
let's not talk dma/cpuL1/cpuL2 counts of cache hits and misses... But...
Still a buffered read avoid any paging of memory (which goes back and forth to disk) and this saves about 90% of the time processing lines of a file.

As with most things in computing, it may be that the most accurate answer is "depends". ;) There are few absolutes in computing, with most solutions lying somewhere among the trade-offs between RAM and performance.

True, a buffered read is _generally_ faster, and certainly the most consistent across varied systems. But if this is a one-time project is the time savings in execution worth the time to script the buffered read routine? Is it a consumer app where you'd have no idea about system configuration, or an internal tool where you can run it on the system of your choice? Will most data files be this large, or is this rare in most day-to-day uses of the app?

These and other considerations may affect outcomes, as we'll see in the benchmark results below.

Again, I have a fully working free source code example...

And a blocking bug...

Since reading files is unrelated to the Geometry Manager issue (which can be worked around anyway in far less time than waiting for a fix, as discussed earlier), there was nothing to stop me from writing a quick benchmark script to try this out.

Before I could run the test I needed some data to test with, so I made a file with 100,000 lines, each with a randomly variable number of randomly-lengthed words, totalling 110.3 MBs when I generated it (the script used to generate it is copied below).

Then I wrote a quickie benchmark script which runs the three file-reading methods in question here:

Method 1 - Read all: Uses "put url..." to read the whole file in
                     one move, then uses "repeat for each" to
                     process lines.

Method 2 - Ready by line:  Uses "read from file...until cr", which is
                     the most convenient but of course the slowest
                     since it needs to compare each incoming character
                     as it reads.

Method 3 - Buffered: Uses an off-the-cuff buffering algorithm to grab
                     data from disk in 32k chunks, then finds lines
                     within the buffer.

Since the original question was about processing lines, the goal of all three methods was to do something with the contents of each line, so I arbitrarily chose to count the words in each line and add them to a total. While a meaningless task in itself, it allowed me to verify the accuracy of the methods by comparing the resulting word counts.

The benchmarking script is also copied below, and was run on a PowerBook G4 with a single 1 GHz processor and 768 MB total built-in memory.

It should be noted that the buffered-read method I wrote is likely not the fastest possible, but was thrown together quickly to get a reasonable test result. I'm sure one of the folks here can improve it, but given the overhead inherent in managing a buffer and multiple disk accesses the results were roughly in line with what I would expect.

I ran the test three times, each time with more apps open to further decrease the amount of available physical memory. In between each run I quit Rev to minimize side-effects from the last run, but note that I did not run multiple series in which the order of the tests was changed nor did I turn off networking, which may possibly affect outcomes. I did however include code outside of the timing to empty variables and free up time for the system to do housekeeping, and checked that all background apps were using zero or close to zero CPU time, so the results may not be too bad:


200 MB free RAM
---------------
Read all:      5.676 secs
Read by line: 87.298 secs
Read buffered: 7.017 secs


64 MB free RAM
--------------
Read all:      9.548 secs
Read by line: 92.606 secs
Read buffered: 7.601 secs


10 MB free RAM
--------------
Read all:     38.101 secs
Read by line: 94.309 secs
Read buffered: 7.734 secs

This test bears out the strengths and weaknesses of each method relative to the system configuration:

- If you have enough RAM for the data, reading the whole file at
  once can be a bit faster.

- If you don't have enough RAM for the whole file but there's enough
  for efficient swapping with VM, the buffered read was only faster
  by 2 secs per 110MBs of data.

- And of course if you starve your system before running such an
  intensive routine (I had to open two 3D apps, GoLive, PhotoShop,
  QuickTime Player, iTunes, and a bunch of others to bring
  available RAM down to 10MB), reading the whole file still beats
  reading each line by a huge margin but the VM swap gives the
  clear performance advantage to the buffered read.

- The buffered read method is the most scalable in terms of its
  relationship to available RAM, but less so in terms of data
  size: on small files the overhead of multiple disk accesses
  shows itself -- here are results on a 64k data file containing
  535 lines:

    Read all:      0.004 secs
    Read by line:  0.051 secs
    Read buffered: 0.011 secs

  I would guess that an optimized buffering routine would bring its
  results for small files more on par with the "read all" method,
  but not likely surpass it.

While fairly unscientific, the test was cost-effective in terms of letting me get back to work quickly. As with the Geometry Manager issue, it took longer to run the tests and write about them here than it did to write the script. ;) Carpe diem.

If any of you have time to improve the buffering method below I'd be interested in any significant changes to your test results.

--
 Richard Gaskin
 Fourth World Media Corporation
 __________________________________________________
 Rev tools and more: http://www.fourthworld.com/rev

--------------------------------------------------------
--
-- Script for generating random dummy data:
--
on mouseUp
  ask file "Name new test file:" with "TestData.txt"
  if it is empty then exit to top
  put it into tFile
  --
  open file tFile for text write
  repeat with i = 1 to 100000 -- lines
    if i mod 1000 = 0 then
      put i
      set the cursor to busy
    end if
    --
    put empty into tLine
    repeat random(100)+50 -- words, between 50 and 150 per line
      put empty into tWord
      repeat random(20) -- chars, between 1 and 20 per word
        put "x" after tWord
      end repeat
      put tWord & space after tLine
    end repeat
    write tLine & cr to file tFile
  end repeat
  close file tFile
end mouseUp


---------------------------------------------------------
--
-- Script for benchmarking three different methods
-- of reading a large file:

on mouseUp
  answer file "Select a text file:"
  if it is empty then exit to top
  put it into tFile
  --
  --
  -- Method 1: read all
  --
  put "Doing method 1..."
  get empty -- clear "it"
  wait 2 secs with messages -- give system some idle time
  put the millisecs into t
  --
  put 0 into tWordCount1
  put url ("file:"&tFile) into tVar
  repeat for each line tLine in tVar
    add the number of words of tLine to tWordCount1
  end repeat
  --
  put the millisecs - t into t1
  --
  --
  -- Method 2: read by line
  --
  put "Doing method 2..."
  put empty into tVar -- clear var
  get empty
  wait 2 secs with messages
  put the millisecs into t
  --
  put 0 into tWordCount2
  open file tFile for text read
  repeat
    read from file tFile until cr
    if it is empty then exit repeat
    add the number of words of it to tWordCount2
  end repeat
  close file tFile
  --
  put the millisecs - t into t2
  --
  --
  -- Method 3: buffered read
  --
  put "Doing method 3..."
  get empty
  wait 2 secs with messages
  put the millisecs into t
  --
  put 0 into tWordCount3
  open file tFile for text read
  put empty into tBuffer
  repeat
    read from file tFile for 32000
    if it is empty then exit repeat
    put it after tBuffer
    --
    if len(it) < 32000 then -- EOF
      put the number of lines of tBuffer into tNumLines
    else
      --  last line may be incomplete
      put the number of lines of tBuffer - 1 into tNumLines
    end if
    repeat tNumLines
      get line 1 of tBuffer
      add the number of words of it to tWordCount3
      delete line 1 of tBuffer
    end repeat
  end repeat
  --
  put the millisecs - t into t3
  --
  --
  put "Read all: "&t1/1000 &" secs    Read by line: "&t2/1000&\
      " secs    Read buffered: "&t3/1000 &" secs" \
      &cr& tWordCount1 && tWordCount2 && tWordCount3

end mouseUp


_______________________________________________
use-revolution mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Reading a (BIG) text file one line at a time - in reality...

Reply via email to