Re: Extracting a column

Richard Gaskin Mon, 10 Dec 2007 11:39:38 -0800

Jim Ault wrote:

--number of cols, and the length of the content before the column for
extracting could be the biggest factor.  Col 2 extract could be a lot faster
than col 9.  In most cases, knowing which column(s) you wish to extract will
mean you adjust your file format to put these closest to the first item.  If
you inherit the data or don't have a choice... c'est la guerre.


Excellent thoughts.

I modified the test to generate data rather than using the canned datasupplied, adding this near the top of the test handler:


  put "4" into t
  -- Make cols:
  put empty into tRow
  repeat with i = 1 to 500
    put s & t into s
    put s & tab after tRow
  end repeat
  -- make rows:
  put empty into tData
  repeat with i = 1 to 500
    put tRow &cr after tData
  end repeat
  delete last char of tData
  -- Verify sizes:
  set the itemdel to tab
  answer "Cols: "&the number of items of line 1 of tData &\
      cr&"Rows: "& the number of lines of tData &\
      cr&"Size: "&len(tData)

This gave me a data set of 500 cols with 500 rows, with each columncontaining one more character than the last, the longest being 501chars, with a total size of 63,125,499 chars. I left the functionsthemselves unchanged.


Having it get column 490 gave me these results:

  Split: 32110 ms (0.16 lines/ms)
  Repeat: 3946 ms (1.27 lines/ms)
  Same results?: true

Getting column 2 gave me:

  Split: 39192 ms (0.13 lines/ms)
  Repeat: 2495 ms (2 lines/ms)
  Same results?: true

So then I tried a very horizontal data set of just 20 rows but with 2000columns in each, for a total size of 40,100,019 chars.


Grabbing column 1999 from this data set gave me:

  Split:  7849 ms (0.03 lines/ms)
  Repeat: 2328 ms (0.09 lines/ms)
  Same results?: true

So I think what we're seeing is that the overhead of parsing applies toboth methods.

On the one hand, the "split" command ramps more gracefully the morehorizontal the data gets when accessing items at the end of each row,but on the other hand its performance remains roughly the same no matterwhich item is obtained while "repeat for each" shows improvement withitems closer to the left. And in all cases tested, "repeat for each"continues to best "split" in overall performance.

I imagine we could come up with a data set for which "split" outperforms"repeat for each", but my data sets are well under 20 MBs each (morecommonly < 5 MBs), almost never exceeding 150 columns and each column ina given row would very rarely contain more than 1k, so these tests covermost real-world scenarios for my needs.

Just the same, if someone comes up with a real-world scenario in which"split" outperforms "repeat for each" I'd be very interested in learningwhat that data looks like and how it's used.


--
 Richard Gaskin
 Managing Editor, revJournal
 _______________________________________________________
 Rev tips, tutorials and more: http://www.revJournal.com
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Extracting a column

Reply via email to