At 00:25 01/07/2004 -0700, Richard Gaskin wrote:
I asked around on this some time ago, including quite a few programmers far smarter than me. The best algorithm we could come up with was one which walks through the data char by char, keeping track of when it's in field data and when it leaves the field, noting that commas are escaped
inconsistently in MS products and not all fields have their data
enclosed in quotes (FM Pro-exported CSV does, but it's a smarter tool in
general than most of the oddities that come out of Redmond <g>).
My post from 14 June 2002 with my own CSV2Tab function is at <http://lists.runrev.com/pipermail/metacard/2002-June/001767.html>.
Hats off to anyone who can improve it's speed, and a bottle of 12-year-old single malt to anyone who can come up with an algorithm I can use which is at least twice as fast.
Now there's a challenge I can relate to :-)
BUT - the speed of the conversion depends on the data ...
Enclosed below is a version of the script which is between 10% and 90% faster - and probably has potential to go even faster than that.
It uses the same set up as you did - so there are no quotes left except those around fields.
Then instead of walking through the data char by char, it use "split()" to divide into an array; the array elements must then alternate between in-quotes and not-in-quotes.
Each array element has only the relevant processing applied.
Note - the speed of the original is (roughly) based on the number of characters, while the speed of the new version is (very roughly) based on the number of quoted fields - so for a file of mainly short fields, all of which are quoted, it is only 10% or so faster (and there could be cases where it would even be slower). For a file with many unquoted fields, or where each field is quite large, it will be significantly faster.
function CSV2TabNew pData
local tNuData -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to avoid line
-- breaks which would be misread as records;
-- replaced later during dislay
local tEscapedQuotePlaceholder -- used for keeping track of quotes in data
local tInQuotedText -- flag set while reading data between quotes
--
put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
put numtochar(2) into tEscapedQuotePlaceholder -- used to simplify
-- distinction between quotes in data and those
-- used in delimiters
--
-- Normalize line endings:
replace crlf with cr in pData -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX
--
-- Put placeholder in escaped quote (non-delimiter) chars:
replace ("\""e) with tEscapedQuotePlaceholder in pData
--
put space before pData -- to avoid ambiguity of starting context
split pData by quote
put False into tInsideQuoted
repeat for each element k in pData
if (tInsideQuoted) then
replace cr with tReturnPlaceholder in k
put k after tNuData
put False into tInsideQuoted
else
replace comma with tab in k
put k after tNuData
put true into tInsideQuoted
end if
end repeat
--
delete char 1 of tNuData -- remove the leading space
replace tEscapedQuotePlaceholder with quote in tNuData
return tNuData
end CSV2TabNewNote also - this has about the same number of "fragilities" as the original (they both fail if the file is mal-formed in about the same number of ways). They also both fail if the original data contained any "escape"s (i.e. "\" chars) - they would be doubled in the original data and should be checked for before the set-up.
IMNSHO, CSV2Tab should be a built-in function. If there's some agreement on this and a willingness to vote for it I'll post the request to Bugzilla.
I'd suggest requesting that it be parameterized to handle the common variants of quoting and non-quoting. There's a good discussion of the problem (including ways it can go wrong beyond what we've talked about here), and a public domain implementation at
http://www.python.org/peps/pep-0305.html#id7
The interface is perhaps wrong for Transcript, but the range of solutions it covers would be a good place to start.
-- Alex Tweedly.
--- Outgoing mail is certified Virus Free. Checked by AVG anti-virus system (http://www.grisoft.com). Version: 6.0.707 / Virus Database: 463 - Release Date: 15/06/2004
_______________________________________________ use-revolution mailing list [EMAIL PROTECTED] http://lists.runrev.com/mailman/listinfo/use-revolution
