Jim Ault wrote:
--number of cols, and the length of the content before the column for
extracting could be the biggest factor.  Col 2 extract could be a lot faster
than col 9.  In most cases, knowing which column(s) you wish to extract will
mean you adjust your file format to put these closest to the first item.  If
you inherit the data or don't have a choice... c'est la guerre.

Excellent thoughts.

I modified the test to generate data rather than using the canned data supplied, adding this near the top of the test handler:

  put "4" into t
  -- Make cols:
  put empty into tRow
  repeat with i = 1 to 500
    put s & t into s
    put s & tab after tRow
  end repeat
  -- make rows:
  put empty into tData
  repeat with i = 1 to 500
    put tRow &cr after tData
  end repeat
  delete last char of tData
  -- Verify sizes:
  set the itemdel to tab
  answer "Cols: "&the number of items of line 1 of tData &\
      cr&"Rows: "& the number of lines of tData &\
      cr&"Size: "&len(tData)

This gave me a data set of 500 cols with 500 rows, with each column containing one more character than the last, the longest being 501 chars, with a total size of 63,125,499 chars. I left the functions themselves unchanged.

Having it get column 490 gave me these results:

  Split: 32110 ms (0.16 lines/ms)
  Repeat: 3946 ms (1.27 lines/ms)
  Same results?: true

Getting column 2 gave me:

  Split: 39192 ms (0.13 lines/ms)
  Repeat: 2495 ms (2 lines/ms)
  Same results?: true


So then I tried a very horizontal data set of just 20 rows but with 2000 columns in each, for a total size of 40,100,019 chars.

Grabbing column 1999 from this data set gave me:

  Split:  7849 ms (0.03 lines/ms)
  Repeat: 2328 ms (0.09 lines/ms)
  Same results?: true


So I think what we're seeing is that the overhead of parsing applies to both methods.

On the one hand, the "split" command ramps more gracefully the more horizontal the data gets when accessing items at the end of each row, but on the other hand its performance remains roughly the same no matter which item is obtained while "repeat for each" shows improvement with items closer to the left. And in all cases tested, "repeat for each" continues to best "split" in overall performance.

I imagine we could come up with a data set for which "split" outperforms "repeat for each", but my data sets are well under 20 MBs each (more commonly < 5 MBs), almost never exceeding 150 columns and each column in a given row would very rarely contain more than 1k, so these tests cover most real-world scenarios for my needs.

Just the same, if someone comes up with a real-world scenario in which "split" outperforms "repeat for each" I'd be very interested in learning what that data looks like and how it's used.

--
 Richard Gaskin
 Managing Editor, revJournal
 _______________________________________________________
 Rev tips, tutorials and more: http://www.revJournal.com
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to