Re: Another Revolution Success Story

Alex Tweedly Thu, 01 Jul 2004 08:00:10 -0700

At 00:25 01/07/2004 -0700, Richard Gaskin wrote:

I asked around on this some time ago, including quite a few programmers far smarter than me. The best algorithm we could come up with was one which walks through the data char by char, keeping track of when it's in field data and when it leaves the field, noting that commas are escaped inconsistently in MS products and not all fields have their data enclosed in quotes (FM Pro-exported CSV does, but it's a smarter tool in general than most of the oddities that come out of Redmond <g>).
My post from 14 June 2002 with my own CSV2Tab function is at
<http://lists.runrev.com/pipermail/metacard/2002-June/001767.html>.
Hats off to anyone who can improve it's speed, and a bottle of
12-year-old single malt to anyone who can come up with an algorithm I
can use which is at least twice as fast.


Now there's a challenge I can relate to :-)

BUT - the speed of the conversion depends on the data ... Enclosed below is a version of the script which is between 10% and 90% faster - and probably has potential to go even faster than that.

It uses the same set up as you did - so there are no quotes left except those around fields.

Then instead of walking through the data char by char, it use "split()" to divide into an array; the array elements must then alternate between in-quotes and not-in-quotes.

Each array element has only the relevant processing applied.

Note - the speed of the original is (roughly) based on the number of characters, while the speed of the new version is (very roughly) based on the number of quoted fields - so for a file of mainly short fields, all of which are quoted, it is only 10% or so faster (and there could be cases where it would even be slower). For a file with many unquoted fields, or where each field is quite large, it will be significantly faster.

function CSV2TabNew pData
  local tNuData -- contains tabbed copy of data
  local tReturnPlaceholder -- replaces cr in field data to avoid line
  --                          breaks which would be misread as records;
  --                          replaced later during dislay
  local tEscapedQuotePlaceholder -- used for keeping track of quotes in data
  local tInQuotedText -- flag set while reading data between quotes
  --
  put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
  put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
  --                       distinction between quotes in data and those
  --                       used in delimiters
  --
  -- Normalize line endings:
  replace crlf with cr in pData          -- Win to UNIX
  replace numtochar(13) with cr in pData -- Mac to UNIX
  --
  -- Put placeholder in escaped quote (non-delimiter) chars:
  replace ("\"&quote) with tEscapedQuotePlaceholder in pData
  --
  put space before pData   -- to avoid ambiguity of starting context
  split pData by quote
  put False into tInsideQuoted
  repeat for each element k in pData
    if (tInsideQuoted) then
      replace cr with tReturnPlaceholder in k
      put k after tNuData
      put False into tInsideQuoted
    else
      replace comma with tab in k
      put k after tNuData
      put true into tInsideQuoted
    end if
  end repeat
  --
  delete char 1 of tNuData -- remove the leading space
  replace tEscapedQuotePlaceholder with quote in tNuData
  return tNuData
end CSV2TabNew

Note also - this has about the same number of "fragilities" as the original (they both fail if the file is mal-formed in about the same number of ways). They also both fail if the original data contained any "escape"s (i.e. "\" chars) - they would be doubled in the original data and should be checked for before the set-up.

IMNSHO, CSV2Tab should be a built-in function. If there's some agreement on this and a willingness to vote for it I'll post the request to Bugzilla.

I'd suggest requesting that it be parameterized to handle the common variants of quoting and non-quoting. There's a good discussion of the problem (including ways it can go wrong beyond what we've talked about here), and a public domain implementation at http://www.python.org/peps/pep-0305.html#id7

The interface is perhaps wrong for Transcript, but the range of solutions it covers would be a good place to start.

-- Alex Tweedly.

---
Outgoing mail is certified Virus Free.
Checked by AVG anti-virus system (http://www.grisoft.com).
Version: 6.0.707 / Virus Database: 463 - Release Date: 15/06/2004

_______________________________________________
use-revolution mailing list
[EMAIL PROTECTED]
http://lists.runrev.com/mailman/listinfo/use-revolution

Re: Another Revolution Success Story

Reply via email to