Re: CSV again.

2015-10-29 Thread Alex Tweedly

I did. And I get
test"

as expected. I'm obviously missing something here - but let's go 
off-list until we figure it out 


Here's my test script
   on mouseUp

   local tmp, t1
   put quote & "test" & CR & quote & quote & quote &CR into tmp

   put csvToTab3(tmp) into t1
   put t1 & CR after msg
   repeat for each char x in t1
  put chartonum(x) & ":" & X & CR after msg
   end repeat
   replace numtochar(29) with "" in t1
   replace numtochar(11) with "" in t1
   replace TAB with "" in t1
   put "[" & t1 & "]" & CR & CR  after msg
end mouseUp

and my output is
test"

116:t
101:e
115:s
116:t
11:
34:"
10:

[test"
]

Do you get different ? Can you please send me the output ?
Thanks
-- Alex.




On 30/10/2015 00:07, Mike Kerner wrote:

Try using exactly the string I sent: "test"""

I get test", when I think what you intend is test"

On Thu, Oct 29, 2015 at 7:25 PM, Alex Tweedly  wrote:


On 29/10/2015 14:41, Mike Kerner wrote:


Belay that.  Let's do this on the list.

Sure ...
On Thu, Oct 29, 2015 at 10:22 AM, Mike Kerner mailto:m...@mikekerner.com>> wrote:

 1) In v3, why did you remove the  substitution?  That just bit me.



Short answer : A bug.
Long answer : 2 bugs, but on the same line of code - so kind of just one
bug really :-)
Very Long Answer :
I had a version (say, 2.9) which I tested properly. Then I added some more
parameterization, and while doing that I thought "This line is wrong, it
shouldn't be doing "replace TAB with ...", it should be using one of these
new parameters". This was just plain wrong, so that's bug number 1.

Then I later realized that there was no case where I would need to do the
"replace" as written - so I commented out the line (also, wrong - that's
bug number 2).


Solution:
I enclose below a new version, csvToTab4. Only change (in the card script)
is that line 37 changed from
 -- replace pOldItemDelim with pNewTAB in theInsideStringSoFar
to
 replace TAB with pNewTAB in theInsideStringSoFar

And with that change it does (AFAIK) properly produce  (or whatever
you pass in as pNewTAB) for any embedded TAB chars.

2) I'm not sure we should bore everyone else with the details on the list,

but I'd like to pick your brain about some of the details of what you're
thinking in various parts of this as I intend to do some tweaking and
commenting for future reference.


Yeah, it would be great to improve the comments, and hopefully explain
what it's doing.

On 29/10/2015 15:01, Mike Kerner wrote:


So beyond the embedded , I found another issue.  Let's say the string
is
"test"""


The  is not handled.


Hmmm - in my testing it is, I give it ( last line is same as this example
you give )

INPUT

a,"b
c"
"cd"
"e"""

and get OUTPUT
abc
cd
e"

which I think is correct. Do you have a more complex test case, or do you
get different results ? Can you send me thae case where you see the problem
(off-list) ?  Thanks.

Should you perhaps do your substitutions on the "inside", instead of on the

"passedQuote"?

Hmmm - tempting, but no.

Firstly, it would need to do the replace in the current item both for
status = 'inside' and 'passedquote' because if you have input like
"one two""three""fourfive"
the status goes from 'inside' to 'passedquote' to 'inside' to
'passedquote' to etc. and for the latter TAB character it is 'passedquote'.

More generally, I want to do these substitutions in as few places as
possible (i.e. so that I am passing the longest possible string to the
engine to do a speedy 'replace'), so the best time to do that after
'passedquote'.

New version
function CSVToTab4 pData, pOldLineDelim, pOldItemDelim, pNewCR, pNewTAB
-- fill in defaults
if pOldLineDelim is empty then put CR into pOldLineDelim
if pOldItemDelim is empty then put COMMA into pOldItemDelim
if pNewCR is empty then put numtochar(11) into pNewCR   -- Use  for
quoted CRs
if pNewTAB is empty then put numtochar(29) into pNewTAB  -- Use
 (group separator) for quoted TABs

local tNuData -- contains tabbed copy of data

local tStatus, theInsideStringSoFar

-- Normalize line endings: REMOVED
-- Will normaly be correct already, only binfile: or similar chould
make this necessary
-- and that exceptional case should be the caller's responsibility

put "outside" into tStatus
set the itemdel to quote
repeat for each item k in pData
   -- put tStatus && k & CR after msg
   switch tStatus

  case "inside"
 put k after theInsideStringSoFar
 put "passedquote" into tStatus
 next repeat

  case "passedquote"
 -- decide if it was a duplicated escapedQuote or a closing
quote
 if k is empty then   -- it's a duplicated quote
put quote after theInsideStringSoFar
put "inside" into tStatus
next repeat
 end if
 -- not empty - so we remain inside the cell, though we have
left the quo

Re: CSV again.

2015-10-29 Thread Mike Kerner
Try using exactly the string I sent: "test"""

I get test", when I think what you intend is test"

On Thu, Oct 29, 2015 at 7:25 PM, Alex Tweedly  wrote:

>
> On 29/10/2015 14:41, Mike Kerner wrote:
>
>> Belay that.  Let's do this on the list.
>>
>> Sure ...
>
>> On Thu, Oct 29, 2015 at 10:22 AM, Mike Kerner > > wrote:
>>
>> 1) In v3, why did you remove the  substitution?  That just bit me.
>>
>>
> Short answer : A bug.
> Long answer : 2 bugs, but on the same line of code - so kind of just one
> bug really :-)
> Very Long Answer :
> I had a version (say, 2.9) which I tested properly. Then I added some more
> parameterization, and while doing that I thought "This line is wrong, it
> shouldn't be doing "replace TAB with ...", it should be using one of these
> new parameters". This was just plain wrong, so that's bug number 1.
>
> Then I later realized that there was no case where I would need to do the
> "replace" as written - so I commented out the line (also, wrong - that's
> bug number 2).
>
>
> Solution:
> I enclose below a new version, csvToTab4. Only change (in the card script)
> is that line 37 changed from
> -- replace pOldItemDelim with pNewTAB in theInsideStringSoFar
> to
> replace TAB with pNewTAB in theInsideStringSoFar
>
> And with that change it does (AFAIK) properly produce  (or whatever
> you pass in as pNewTAB) for any embedded TAB chars.
>
> 2) I'm not sure we should bore everyone else with the details on the list,
>> but I'd like to pick your brain about some of the details of what you're
>> thinking in various parts of this as I intend to do some tweaking and
>> commenting for future reference.
>>
> Yeah, it would be great to improve the comments, and hopefully explain
> what it's doing.
>
> On 29/10/2015 15:01, Mike Kerner wrote:
>
>> So beyond the embedded , I found another issue.  Let's say the string
>> is
>> "test"""
>>
>>
>> The  is not handled.
>>
> Hmmm - in my testing it is, I give it ( last line is same as this example
> you give )
>
> INPUT
>
> a,"b
> c"
> "cd"
> "e"""
>
> and get OUTPUT
> abc
> cd
> e"
>
> which I think is correct. Do you have a more complex test case, or do you
> get different results ? Can you send me thae case where you see the problem
> (off-list) ?  Thanks.
>
> Should you perhaps do your substitutions on the "inside", instead of on the
>> "passedQuote"?
>>
>> Hmmm - tempting, but no.
>
> Firstly, it would need to do the replace in the current item both for
> status = 'inside' and 'passedquote' because if you have input like
>"one two""three""fourfive"
> the status goes from 'inside' to 'passedquote' to 'inside' to
> 'passedquote' to etc. and for the latter TAB character it is 'passedquote'.
>
> More generally, I want to do these substitutions in as few places as
> possible (i.e. so that I am passing the longest possible string to the
> engine to do a speedy 'replace'), so the best time to do that after
> 'passedquote'.
>
> New version
> function CSVToTab4 pData, pOldLineDelim, pOldItemDelim, pNewCR, pNewTAB
>-- fill in defaults
>if pOldLineDelim is empty then put CR into pOldLineDelim
>if pOldItemDelim is empty then put COMMA into pOldItemDelim
>if pNewCR is empty then put numtochar(11) into pNewCR   -- Use  for
> quoted CRs
>if pNewTAB is empty then put numtochar(29) into pNewTAB  -- Use
>  (group separator) for quoted TABs
>
>local tNuData -- contains tabbed copy of data
>
>local tStatus, theInsideStringSoFar
>
>-- Normalize line endings: REMOVED
>-- Will normaly be correct already, only binfile: or similar chould
> make this necessary
>-- and that exceptional case should be the caller's responsibility
>
>put "outside" into tStatus
>set the itemdel to quote
>repeat for each item k in pData
>   -- put tStatus && k & CR after msg
>   switch tStatus
>
>  case "inside"
> put k after theInsideStringSoFar
> put "passedquote" into tStatus
> next repeat
>
>  case "passedquote"
> -- decide if it was a duplicated escapedQuote or a closing
> quote
> if k is empty then   -- it's a duplicated quote
>put quote after theInsideStringSoFar
>put "inside" into tStatus
>next repeat
> end if
> -- not empty - so we remain inside the cell, though we have
> left the quoted section
> -- NB this allows for quoted sub-strings within the cell
> content !!
> replace pOldLineDelim with pNewCR in theInsideStringSoFar
> replace TAB with pNewTAB in theInsideStringSoFar
> put theInsideStringSoFar after tNuData
>
>  case "outside"
> replace pOldItemDelim with TAB in k
> -- and deal with the "empty trailing item" issue in Livecode
> replace (pNewTAB & pOldLineDelim) with pNewTAB & pNewTAB & CR
> in k
> put k afte

Re: CSV again.

2015-10-29 Thread Alex Tweedly


On 29/10/2015 14:41, Mike Kerner wrote:

Belay that.  Let's do this on the list.


Sure ...
On Thu, Oct 29, 2015 at 10:22 AM, Mike Kerner > wrote:


1) In v3, why did you remove the  substitution?  That just bit me.



Short answer : A bug.
Long answer : 2 bugs, but on the same line of code - so kind of just one 
bug really :-)

Very Long Answer :
I had a version (say, 2.9) which I tested properly. Then I added some 
more parameterization, and while doing that I thought "This line is 
wrong, it shouldn't be doing "replace TAB with ...", it should be using 
one of these new parameters". This was just plain wrong, so that's bug 
number 1.


Then I later realized that there was no case where I would need to do 
the "replace" as written - so I commented out the line (also, wrong - 
that's bug number 2).



Solution:
I enclose below a new version, csvToTab4. Only change (in the card 
script) is that line 37 changed from

-- replace pOldItemDelim with pNewTAB in theInsideStringSoFar
to
replace TAB with pNewTAB in theInsideStringSoFar

And with that change it does (AFAIK) properly produce  (or whatever 
you pass in as pNewTAB) for any embedded TAB chars.


2) I'm not sure we should bore everyone else with the details on the 
list, but I'd like to pick your brain about some of the details of 
what you're thinking in various parts of this as I intend to do some 
tweaking and commenting for future reference.
Yeah, it would be great to improve the comments, and hopefully explain 
what it's doing.


On 29/10/2015 15:01, Mike Kerner wrote:

So beyond the embedded , I found another issue.  Let's say the string is
"test"""


The  is not handled.
Hmmm - in my testing it is, I give it ( last line is same as this 
example you give )


INPUT

a,"b
c"
"cd"
"e"""

and get OUTPUT
abc
cd
e"

which I think is correct. Do you have a more complex test case, or do 
you get different results ? Can you send me thae case where you see the 
problem (off-list) ?  Thanks.



Should you perhaps do your substitutions on the "inside", instead of on the
"passedQuote"?


Hmmm - tempting, but no.

Firstly, it would need to do the replace in the current item both for 
status = 'inside' and 'passedquote' because if you have input like

   "one two""three""fourfive"
the status goes from 'inside' to 'passedquote' to 'inside' to 
'passedquote' to etc. and for the latter TAB character it is 'passedquote'.


More generally, I want to do these substitutions in as few places as 
possible (i.e. so that I am passing the longest possible string to the 
engine to do a speedy 'replace'), so the best time to do that after 
'passedquote'.


New version
function CSVToTab4 pData, pOldLineDelim, pOldItemDelim, pNewCR, pNewTAB
   -- fill in defaults
   if pOldLineDelim is empty then put CR into pOldLineDelim
   if pOldItemDelim is empty then put COMMA into pOldItemDelim
   if pNewCR is empty then put numtochar(11) into pNewCR   -- Use  
for quoted CRs
   if pNewTAB is empty then put numtochar(29) into pNewTAB  -- Use 
 (group separator) for quoted TABs


   local tNuData -- contains tabbed copy of data

   local tStatus, theInsideStringSoFar

   -- Normalize line endings: REMOVED
   -- Will normaly be correct already, only binfile: or similar chould 
make this necessary

   -- and that exceptional case should be the caller's responsibility

   put "outside" into tStatus
   set the itemdel to quote
   repeat for each item k in pData
  -- put tStatus && k & CR after msg
  switch tStatus

 case "inside"
put k after theInsideStringSoFar
put "passedquote" into tStatus
next repeat

 case "passedquote"
-- decide if it was a duplicated escapedQuote or a closing 
quote

if k is empty then   -- it's a duplicated quote
   put quote after theInsideStringSoFar
   put "inside" into tStatus
   next repeat
end if
-- not empty - so we remain inside the cell, though we have 
left the quoted section
-- NB this allows for quoted sub-strings within the cell 
content !!

replace pOldLineDelim with pNewCR in theInsideStringSoFar
replace TAB with pNewTAB in theInsideStringSoFar
put theInsideStringSoFar after tNuData

 case "outside"
replace pOldItemDelim with TAB in k
-- and deal with the "empty trailing item" issue in Livecode
replace (pNewTAB & pOldLineDelim) with pNewTAB & pNewTAB & 
CR in k

put k after tNuData
put "inside" into tStatus
put empty into theInsideStringSoFar
next repeat
 default
put "defaulted"
break
  end switch
   end repeat

   -- and finally deal with the trailing item isse in input data
   -- i.e. the very last char is a quote, so there is no trigger to 
flush the

   --  last i

Re: CSV again.

2015-10-29 Thread Mike Kerner
So beyond the embedded , I found another issue.  Let's say the string is
"test"""


The  is not handled.

Should you perhaps do your substitutions on the "inside", instead of on the
"passedQuote"?

-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-29 Thread Mike Kerner
Alex,
So which version are you proposing as being current?  Is there some reason
why you removed handling embedded  in 3?

On Tue, Oct 20, 2015 at 12:36 AM, Kay C Lan 
wrote:

> This topic reminds me of time. If you think CSV is a standard that has no
> standard, making it difficult to program around, then don't even bother
> attempting to work with time. Here's a good summary - make sure you watch
> to the very end where he discusses the Google approach to one of the vary
> many idiosyncrasies of time you've probably never thought of:
>
> https://www.youtube.com/watch?v=-5wpm-gesOY
>
> Thought you may enjoy whilst nutting out your CSV algo.
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-19 Thread Kay C Lan
This topic reminds me of time. If you think CSV is a standard that has no
standard, making it difficult to program around, then don't even bother
attempting to work with time. Here's a good summary - make sure you watch
to the very end where he discusses the Google approach to one of the vary
many idiosyncrasies of time you've probably never thought of:

https://www.youtube.com/watch?v=-5wpm-gesOY

Thought you may enjoy whilst nutting out your CSV algo.
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-19 Thread Alex Tweedly

On 19/10/2015 02:52, Mike Kerner wrote:

Well, there goes that idea.  There are tutorials right on Git, but it might
be easier if you (and anyone else so not-inclined to Git) post here and
those of us who are at least inclined to try will make do with doing that
work for you.



OK, OK, I know I need to learn Git / github - and I will soon - but just 
not today. I looked at some of the tutorials, and decided they would 
take a small amount of time. But, I have between 1/2 and one hour or so 
to work on my favourite hobby - Livecode - and I'd rather spend it 
updating my CSV script than learning a tutorial that I probably won't 
have time to complete.


Yes - your change for the "no trailing CR' case is better than mine - 
there's no need to test, just change it.


However, I later decided that that wasn't the best approach  in 
conjunction with another change.


The various version of the script all have some initial replacements, like

  -- Normalize line endings:
  replace crlf with cr in pData  -- Win to UNIX
  replace numtochar(13) with cr in pData -- Mac to UNIX

I put these in initially because I didn't fully understand how Runtime 
Revolution handled these (what can I say, I'd only been a RR user for a 
couple of weeks at the time :-) :-).   I now believe that, so long as 
the data came from sensible place (i.e. a file, or a web site, or a 
database) and was pulled in in some sensible way (i.e. put URL 
"file:") or equivalent, then this is a non-issue. Otherwise, every 
real script that handled data would have this kind of thing in it - and 
they don't.   So - I think the 'replace' statements can be removed.


Once they are out, then we see that "pData" is a read-only parameter, 
until we add this extra CR. Since a large part of the initial purpose 
was to be efficient (in CPU and in memory usage) so we can handle 
*large* datasets, it would be desirable to keep pData as read-only, 
hence avoiding both a memory copy and the additional memory used. So 
instead of adding a CR, we can instead do that by checking just after 
the loop whether or not the situation exists, and handling it there.


So - given those two ideas, plus the need to parameterize,  I upgraded 
the code to

 - not do the initial replacements
 - be fully parameterized for input delimiters
 - be fully parameterized for TAB or CR characters within quoted cells
 - and do all the quote- replacement, etc.
( see code below )

I then tested three versions of code
 - the earlier csvToTab2 (i.e. adding the CR at the end)
 - this new version (called csvToTab3
 - Peter's csvToArray
against 3 input datasets - the two from Richard's article, plus one 
testing the case of no trailing CR.


Fortunately, all 3 produce equivalent output (not identical, since 
Peter's produces an array, doesn't remove quotes in cells and doesn't do 
the same things with enclosed CR and TABs - but equivalent.


I also added to my test script the ability to choose how many copies of 
the input data file to put into the variable before calling each 
function - to allow simple benchmarking. (All the code for the functions 
and the test button is below)


With that we get (remember - equivalent results)

1 copy of data (sample 1 from Richard - 7 lines, 370 chars)

csvToTab2 0 msecs
csvToTab3 0 msecs
csvToArray   6539 msecs

20,000 copies of the data ( - 140,000 lines, 7.4Mb)
csvToTab2   690 msecs
csvToTab3   566 msecs
csvToArray   not tested


-- Alex.

Code for the test button

on mouseUp
   local tChosenFile

   put empty into msg
   answer file "CSV file to process"
   if the result is not "Cancel" then
  put it into tChosenFile
   else
  exit mouseUp
   end if

   local tmp, t1, tmp1
   put URL ("file:" & tChosenFile) into tmp1
   put the number of chars in tmp1 & CR & tmp1 & CR after msg

   local tTimes
   ask "How many multiples" with 1
   put it into tTimes
   repeat tTimes
  put tmp1 after tmp
   end repeat

   local time1
   put the millisecs into time1
   put csvToTab2(tmp) into t1
   put "Version 2 took" &&  the millisecs - time1 &CR after msg
   if tTimes = 1 then
  replace numtochar(29) with "" in t1
  replace numtochar(11) with "" in t1
  replace TAB with "" in t1
  put "[" & t1 & "]" & CR & CR  after msg
   end if

   put the millisecs into time1
   put csvToTab3(tmp) into t1
   put "Version 3 took" &&  the millisecs - time1 &CR after msg
   if tTimes = 1 then
  replace numtochar(29) with "" in t1
  replace numtochar(11) with "" in t1
  replace TAB with "" in t1
  put "[" & t1 & "]" & CR & CR  after msg
   end if


   put empty into tA
   put the millisecs into time1
   if tTimes = 1 then
  put csvToArray(tmp) into tA
  put "Version Array took" &&  the millisecs - time1 &CR after msg
  repeat for each key K in tA
 repeat for each key KK in tA[K]
put K && KK && tA[K][KK] &CR after msg
 end repeat
  end repeat
   end

Re: CSV again.

2015-10-18 Thread Mike Kerner
Well, there goes that idea.  There are tutorials right on Git, but it might
be easier if you (and anyone else so not-inclined to Git) post here and
those of us who are at least inclined to try will make do with doing that
work for you.

Anyway, here's what I have as the latest version, with a couple of things I
added to it, marked as "mikey"

function CSVToTab pData,pcoldelim
  local tNuData -- contains tabbed copy of data
  local tReturnPlaceholder -- replaces cr in field data to avoid line
  --   breaks which would be misread as records;
  local tNuDelim  -- new character to replace the delimiter
  local tStatus, theInsideStringSoFar
  --
  put numtochar(11) into tReturnPlaceholder -- vertical tab as
placeholder
  put numtochar(29) into tNuDelim
  --
  if pcoldelim is empty then put comma into pcoldelim
  -- Normalize line endings:
  replace crlf with cr in pData  -- Win to UNIX
  replace numtochar(13) with cr in pData -- Mac to UNIX

  put CR after pData #last line may not properly parse, otherwise #mikey

  put "outside" into tStatus
  set the itemdel to quote
  repeat for each item k in pData
-- put tStatus && k & CR after msg
switch tStatus

  case "inside"
put k after theInsideStringSoFar
put "passedquote" into tStatus
next repeat

  case "passedquote"
-- decide if it was a duplicated escapedQuote or a
closing quote
if k is empty then   -- it's a duplicated quote
  put quote after theInsideStringSoFar
  put "inside" into tStatus
  next repeat
end if
-- not empty - so we remain inside the cell, though
we have left the quoted section
-- NB this allows for quoted sub-strings within the
cell content !!
replace cr with tReturnPlaceholder in
theInsideStringSoFar
put theInsideStringSoFar after tNuData

  case "outside"
replace pcoldelim with tNuDelim in k
-- and deal with the "empty trailing item" issue in
Livecode
replace (tNuDelim & CR) with tNuDelim & tNuDelim &
CR in k
put k after tNuData
put "inside" into tStatus
put empty into theInsideStringSoFar
next repeat
  default
put "defaulted"
break
end switch
  end repeat
  replace tNuDelim with tab in tNuData #mikey
  delete last char of tNuData #added at top to assist last line parse
#mikey
  return tNuData
end CSVToTab




On Sun, Oct 18, 2015 at 8:06 PM, Alex Tweedly  wrote:

>
>
> On 18/10/2015 13:57, Mike Kerner wrote:
>
>> https://github.com/macMikey/LiveCode-Libraries/tree/master/csv
>>
>> I've found some corner cases and made some others.
>>
>>
>> OK, I confess:
>
> I've never used git or github, and I have no idea how to get access to
> these.  :-)
>
> I know I need to learn, but honestly this is not the right time for me to
> do that - is there a 5-minute tutorial (or step-by-step instruction) that I
> can follow to at least get these files ?
>
> Many thanks
> Alex.
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-18 Thread Alex Tweedly



On 18/10/2015 13:57, Mike Kerner wrote:

https://github.com/macMikey/LiveCode-Libraries/tree/master/csv

I've found some corner cases and made some others.



OK, I confess:

I've never used git or github, and I have no idea how to get access to 
these.  :-)


I know I need to learn, but honestly this is not the right time for me 
to do that - is there a 5-minute tutorial (or step-by-step instruction) 
that I can follow to at least get these files ?


Many thanks
Alex.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-18 Thread Alex Tweedly



On 18/10/2015 03:17, Peter M. Brigham wrote:

At this point, finding a function that does the task at all -- reliably and 
taking into account most of the csv malformations we can anticipate -- would be 
a start. So far nothing has been unbreakable. Once we find an algorithm that 
does the job, we can focus on speeding it up.


That is indeed the issue.

There are two distinct problems, and the "best" solutions for each may 
be different.


1. Optimistic parser.

Properly parse any well-formed CSV data, in any idiosyncratic dialect of 
CSV that we may be interested in.


Or to put it otherwise, in general we are going to be parsing data 
produced by some program - it may take some oddball approach to CSV 
formatting, but it will be "correct" in the program's own terms. We are 
not (in this problem) trying to handle, e.g., hand-generated files that 
may contain errors, or have deliberate errors embedded. Thus, we do not 
expect things like mis-matched quotes, etc. - and it will be adequate to 
do "something reasonable" given bad input data.


2. Pessimistic parser.

Just the opposite - try to detect any arbitrary malformation with a 
sensible error message, and properly parse any well-formed CSV data in 
any dialect we might encounter.


And common to both
- adequate (optional) control over delimiters, escaped characters in the 
output, etc.

- efficiency (speed) matters

IMHO, we should also specify that the output should
 - remove the enclosing quotes from quoted cells
 - reduce doubled-quotes within a quoted cell to the appropriate single 
instance of a quote
in order that the TSV (or array, or whatever output format is chosen) 
does not need further processing to remove them; i.e. the output data is 
clean of any CSV formatting artifacts.


Personally, I am a pragmatist, and I have always needed solution 1 above 
- whenever I've had to parse CSV data, it's because I had a real-world 
need to do so, and the data was coming from some well-behaved (even if 
very weird) application - so it was consistent and followed some kind of 
rules, however wacky those rules might be. Other people may have 
different needs.


So I believe that any proposed algorithm should be clear about which of 
these two distinct problems it is trying to solve, and should be judged 
accordingly. Then each of us can look for the most efficient solution to 
whichever one they most care about.


I do believe that any solution to problem 2 is also a solution to 
problem 1 - but I don't know if it can be as efficient while tackling 
that harder problem.


-- Alex.



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-18 Thread Mike Kerner
Consider them added.  They're called "Richard-1.csv" and "Richard-2.csv"

On Sun, Oct 18, 2015 at 6:46 PM, Richard Gaskin 
wrote:

> Mike Kerner wrote:
>
>> I don't have a corner case file, yet, but I'm going to start adding one to
>> Git in a minute...
>>
>> On Sun, Oct 18, 2015 at 2:26 AM, Kay C Lan 
>> wrote:
>>
>> On Sun, Oct 18, 2015 at 10:17 AM, Peter M. Brigham 
>>> wrote:
>>>
>>> > At this point, finding a function that does the task at all -- reliably
>>> > and taking into account most of the csv malformations we can anticipate
>>> --
>>> > would be a start.
>>>
>>
> The snippet included in my article is commonly used to test CSV parsers,
> which is unfortunate since it only covers a relatively small handful of
> edge cases - I added a case for in-data returns just below it:
>
> 
>
> Even then woefully incomplete, but hopefully worthwhile as at least a
> starting point.
>
> --
>  Richard Gaskin
>  Fourth World Systems
>  Software Design and Development for the Desktop, Mobile, and the Web
>  
>  ambassa...@fourthworld.comhttp://www.FourthWorld.com
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-18 Thread Richard Gaskin

Mike Kerner wrote:

I don't have a corner case file, yet, but I'm going to start adding one to
Git in a minute...

On Sun, Oct 18, 2015 at 2:26 AM, Kay C Lan  wrote:


On Sun, Oct 18, 2015 at 10:17 AM, Peter M. Brigham 
wrote:

> At this point, finding a function that does the task at all -- reliably
> and taking into account most of the csv malformations we can anticipate
--
> would be a start.


The snippet included in my article is commonly used to test CSV parsers, 
which is unfortunate since it only covers a relatively small handful of 
edge cases - I added a case for in-data returns just below it:




Even then woefully incomplete, but hopefully worthwhile as at least a 
starting point.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-18 Thread Mike Kerner
https://github.com/macMikey/LiveCode-Libraries/tree/master/csv

I've found some corner cases and made some others.

On Sun, Oct 18, 2015 at 8:01 AM, Mike Kerner 
wrote:

> I don't have a corner case file, yet, but I'm going to start adding one to
> Git in a minute...
>
> On Sun, Oct 18, 2015 at 2:26 AM, Kay C Lan 
> wrote:
>
>> On Sun, Oct 18, 2015 at 10:17 AM, Peter M. Brigham 
>> wrote:
>>
>> > At this point, finding a function that does the task at all -- reliably
>> > and taking into account most of the csv malformations we can anticipate
>> --
>> > would be a start.
>>
>>
>> Actually, having a standard mutant csv file to work on would be a start.
>> Probably two files, a plain text file that needs to fed fed into the algo
>> and a pdf version which shows exactly how the data is suppose to appear.
>>
>> What are people using for their example file? We need to check it contains
>> all possible mutations.
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>
>
> --
> On the first day, God created the heavens and the Earth
> On the second day, God created the oceans.
> On the third day, God put the animals on hold for a few hours,
>and did a little diving.
> And God said, "This is good."
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-18 Thread Mike Kerner
I don't have a corner case file, yet, but I'm going to start adding one to
Git in a minute...

On Sun, Oct 18, 2015 at 2:26 AM, Kay C Lan  wrote:

> On Sun, Oct 18, 2015 at 10:17 AM, Peter M. Brigham 
> wrote:
>
> > At this point, finding a function that does the task at all -- reliably
> > and taking into account most of the csv malformations we can anticipate
> --
> > would be a start.
>
>
> Actually, having a standard mutant csv file to work on would be a start.
> Probably two files, a plain text file that needs to fed fed into the algo
> and a pdf version which shows exactly how the data is suppose to appear.
>
> What are people using for their example file? We need to check it contains
> all possible mutations.
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-17 Thread Kay C Lan
On Sun, Oct 18, 2015 at 10:17 AM, Peter M. Brigham  wrote:

> At this point, finding a function that does the task at all -- reliably
> and taking into account most of the csv malformations we can anticipate --
> would be a start.


Actually, having a standard mutant csv file to work on would be a start.
Probably two files, a plain text file that needs to fed fed into the algo
and a pdf version which shows exactly how the data is suppose to appear.

What are people using for their example file? We need to check it contains
all possible mutations.
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-17 Thread Mike Kerner
Peter,
You're absolutely right, of course.  While we're at it, it would be
interesting to see what we come up with if we write it for LCB's modules...

On Sat, Oct 17, 2015 at 10:17 PM, Peter M. Brigham  wrote:

> At this point, finding a function that does the task at all -- reliably
> and taking into account most of the csv malformations we can anticipate --
> would be a start. So far nothing has been unbreakable. Once we find an
> algorithm that does the job, we can focus on speeding it up.
>
> That said, I don't know that my solution is optimized for speed very well.
> It takes 4-5 seconds to process a 986 record file. On an old slow machine,
> a 2008 MacBook 2.1 GHz Intel Core Duo, but still….
>
> -- Peter
>
> Peter M. Brigham
> pmb...@gmail.com
> http://home.comcast.net/~pmbrig
>
> On Oct 17, 2015, at 10:05 PM, Mike Kerner wrote:
>
> > The other thing that we are going to be interested in is finding the
> > fastest function that performs the task.
> >
> > On Sat, Oct 17, 2015 at 10:04 PM, Mike Kerner  >
> > wrote:
> >
> >> I think that item is odd.  Quotes are, if memory serves, only supposed
> to
> >> appear if they are double-quoted.  Between "f" and "g" you have three
> >> quotes, and between "g" and "h" you only have one.  I believe that is
> not a
> >> correct csv format.
> >>
> >> On Sat, Oct 17, 2015 at 9:24 PM, Peter M. Brigham 
> >> wrote:
> >>
> >>> On Oct 17, 2015, at 8:47 PM, Alex Tweedly wrote:
> >>>
>  Also, I think (i.e. I haven't yet run the code, since I don't have
> >>> offsets() available) there is another mis-formed case you don't
> properly
> >>> detect :
>  a,b,c,"def"""g"h",i,j,k
> >>>
> >>> if I put this as one of the lines of my CSV data, it gets sorted into
> the
> >>> array properly. I think. That is, the 4th item of the line is
> >>>
> >>> "def"""g"h"
> >>>
> >>> Do you get the same result?
> >>>
> >>> -- Peter
> >>>
> >>> Peter M. Brigham
> >>> pmb...@gmail.com
> >>> http://home.comcast.net/~pmbrig
> >>>
> >>>
> >>>
> >>> ___
> >>> use-livecode mailing list
> >>> use-livecode@lists.runrev.com
> >>> Please visit this url to subscribe, unsubscribe and manage your
> >>> subscription preferences:
> >>> http://lists.runrev.com/mailman/listinfo/use-livecode
> >>>
> >>
> >>
> >>
> >> --
> >> On the first day, God created the heavens and the Earth
> >> On the second day, God created the oceans.
> >> On the third day, God put the animals on hold for a few hours,
> >>   and did a little diving.
> >> And God said, "This is good."
> >>
> >
> >
> >
> > --
> > On the first day, God created the heavens and the Earth
> > On the second day, God created the oceans.
> > On the third day, God put the animals on hold for a few hours,
> >   and did a little diving.
> > And God said, "This is good."
> > ___
> > use-livecode mailing list
> > use-livecode@lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode

Re: CSV again.

2015-10-17 Thread Peter M. Brigham
At this point, finding a function that does the task at all -- reliably and 
taking into account most of the csv malformations we can anticipate -- would be 
a start. So far nothing has been unbreakable. Once we find an algorithm that 
does the job, we can focus on speeding it up.

That said, I don't know that my solution is optimized for speed very well. It 
takes 4-5 seconds to process a 986 record file. On an old slow machine, a 2008 
MacBook 2.1 GHz Intel Core Duo, but still….

-- Peter

Peter M. Brigham
pmb...@gmail.com
http://home.comcast.net/~pmbrig

On Oct 17, 2015, at 10:05 PM, Mike Kerner wrote:

> The other thing that we are going to be interested in is finding the
> fastest function that performs the task.
> 
> On Sat, Oct 17, 2015 at 10:04 PM, Mike Kerner 
> wrote:
> 
>> I think that item is odd.  Quotes are, if memory serves, only supposed to
>> appear if they are double-quoted.  Between "f" and "g" you have three
>> quotes, and between "g" and "h" you only have one.  I believe that is not a
>> correct csv format.
>> 
>> On Sat, Oct 17, 2015 at 9:24 PM, Peter M. Brigham 
>> wrote:
>> 
>>> On Oct 17, 2015, at 8:47 PM, Alex Tweedly wrote:
>>> 
 Also, I think (i.e. I haven't yet run the code, since I don't have
>>> offsets() available) there is another mis-formed case you don't properly
>>> detect :
 a,b,c,"def"""g"h",i,j,k
>>> 
>>> if I put this as one of the lines of my CSV data, it gets sorted into the
>>> array properly. I think. That is, the 4th item of the line is
>>> 
>>> "def"""g"h"
>>> 
>>> Do you get the same result?
>>> 
>>> -- Peter
>>> 
>>> Peter M. Brigham
>>> pmb...@gmail.com
>>> http://home.comcast.net/~pmbrig
>>> 
>>> 
>>> 
>>> ___
>>> use-livecode mailing list
>>> use-livecode@lists.runrev.com
>>> Please visit this url to subscribe, unsubscribe and manage your
>>> subscription preferences:
>>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>> 
>> 
>> 
>> 
>> --
>> On the first day, God created the heavens and the Earth
>> On the second day, God created the oceans.
>> On the third day, God put the animals on hold for a few hours,
>>   and did a little diving.
>> And God said, "This is good."
>> 
> 
> 
> 
> -- 
> On the first day, God created the heavens and the Earth
> On the second day, God created the oceans.
> On the third day, God put the animals on hold for a few hours,
>   and did a little diving.
> And God said, "This is good."
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-17 Thread Mike Kerner
The other thing that we are going to be interested in is finding the
fastest function that performs the task.

On Sat, Oct 17, 2015 at 10:04 PM, Mike Kerner 
wrote:

> I think that item is odd.  Quotes are, if memory serves, only supposed to
> appear if they are double-quoted.  Between "f" and "g" you have three
> quotes, and between "g" and "h" you only have one.  I believe that is not a
> correct csv format.
>
> On Sat, Oct 17, 2015 at 9:24 PM, Peter M. Brigham 
> wrote:
>
>> On Oct 17, 2015, at 8:47 PM, Alex Tweedly wrote:
>>
>> > Also, I think (i.e. I haven't yet run the code, since I don't have
>> offsets() available) there is another mis-formed case you don't properly
>> detect :
>> > a,b,c,"def"""g"h",i,j,k
>>
>> if I put this as one of the lines of my CSV data, it gets sorted into the
>> array properly. I think. That is, the 4th item of the line is
>>
>> "def"""g"h"
>>
>>  Do you get the same result?
>>
>> -- Peter
>>
>> Peter M. Brigham
>> pmb...@gmail.com
>> http://home.comcast.net/~pmbrig
>>
>>
>>
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>
>
> --
> On the first day, God created the heavens and the Earth
> On the second day, God created the oceans.
> On the third day, God put the animals on hold for a few hours,
>and did a little diving.
> And God said, "This is good."
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-17 Thread Mike Kerner
I think that item is odd.  Quotes are, if memory serves, only supposed to
appear if they are double-quoted.  Between "f" and "g" you have three
quotes, and between "g" and "h" you only have one.  I believe that is not a
correct csv format.

On Sat, Oct 17, 2015 at 9:24 PM, Peter M. Brigham  wrote:

> On Oct 17, 2015, at 8:47 PM, Alex Tweedly wrote:
>
> > Also, I think (i.e. I haven't yet run the code, since I don't have
> offsets() available) there is another mis-formed case you don't properly
> detect :
> > a,b,c,"def"""g"h",i,j,k
>
> if I put this as one of the lines of my CSV data, it gets sorted into the
> array properly. I think. That is, the 4th item of the line is
>
> "def"""g"h"
>
>  Do you get the same result?
>
> -- Peter
>
> Peter M. Brigham
> pmb...@gmail.com
> http://home.comcast.net/~pmbrig
>
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-17 Thread Peter M. Brigham
On Oct 17, 2015, at 8:47 PM, Alex Tweedly wrote:

> Also, I think (i.e. I haven't yet run the code, since I don't have offsets() 
> available) there is another mis-formed case you don't properly detect :
> a,b,c,"def"""g"h",i,j,k

if I put this as one of the lines of my CSV data, it gets sorted into the array 
properly. I think. That is, the 4th item of the line is 

"def"""g"h"

 Do you get the same result?

-- Peter

Peter M. Brigham
pmb...@gmail.com
http://home.comcast.net/~pmbrig



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-17 Thread Peter M. Brigham
Thanks for catching that. Change the if-then structure to:

if howmany(openQuoteChar,thisItem) <> howmany(closeQuoteChar,thisItem) then
return "This CSV data is not parsable (unclosed quotes in item)."
end if

Revised function:

function CSVtoArray pData
   -- by Peter M. Brigham, pmb...@gmail.com
   -- requires getDelimiters(), howmany(), offsets()
   put getDelimiters(pData,5) into tDelims
   put line 1 of tDelims into crChar
   put line 2 of tDelims into tabChar
   put line 3 of tDelims into commaChar
   put line 4 of tDelims into openQuoteChar
   put line 5 of tDelims into closeQuoteChar
   
   replace crlf with cr in pData  -- Win to UNIX
   replace numtochar(13) with cr in pData -- Mac to UNIX
   
   if howmany(quote,pData) mod 2 = 1 then
  return "This CSV data is not parsable (unclosed quotes in data)."
   end if
   
   put offsets(quote,pData) into qOffsets
   if qOffsets > 0 then
  put 1 into counter
  repeat for each item q in qOffsets
 if counter mod 2 = 1 then put openQuoteChar into char q of pData
 else put closeQuoteChar into char q of pData
 add 1 to counter
  end repeat
   end if
   
   put offsets(cr,pData) into crOffsets
   repeat for each item r in crOffsets
  put char 1 to r of pData into upToHere
  if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
 -- the cr is within a quoted string
 put crChar into char r of pData
  end if
   end repeat
   put offsets(tab,pData) into tabOffsets
   repeat for each item t in tabOffsets
  put char 1 to t of pData into upToHere
  if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
 -- the tab is within a quoted string
 put tabChar into char t of pData
  end if
   end repeat
   put offsets(comma,pData) into commaOffsets
   repeat for each item c in commaOffsets
  put char 1 to c of pData into upToHere
  if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
 -- the comma is within a quoted string
 put commaChar into char c of pData
  end if
   end repeat
   put 0 into lineCounter
   repeat for each line L in pData
  add 1 to lineCounter
  put 0 into itemCounter
  repeat for each item i in L
 add 1 to itemCounter
 put i into thisItem
 if howmany(openQuoteChar,thisItem) <> howmany(closeQuoteChar,thisItem) 
then
return "This CSV data is not parsable (unclosed quotes in item)."
 end if
 replace crChar with cr in thisItem
 replace tabChar with tab in thisItem
 replace commaChar with comma in thisItem
 replace openQuoteChar with quote in thisItem
 replace closeQuoteChar with quote in thisItem
 put thisItem into A[lineCounter][itemCounter]
  end repeat
   end repeat
   return A
end CSVtoArray

--

-- Peter

Peter M. Brigham
pmb...@gmail.com
http://home.comcast.net/~pmbrig


On Oct 17, 2015, at 8:47 PM, Alex Tweedly wrote:

> Ummm  surely at this point
> 
> 
> 
>  repeat for each item i in L
> add 1 to itemCounter
> put i into thisItem
> if howmany(quote,thisItem) mod 2 = 1 then
>return "This CSV data is not parsable (unclosed quotes in item)."
> end if
> 
> ...
> 
> howmany(quote,thisItem) must be 0 - all quotes have been replaced by either 
> openQuoteChar or closeQuoteChar
> 
> Shouldn't this test be
>   if howmany(openQuoteChar, thisItem) <> howmany(closeUqoteChar, thisItem)  
> then
> 
> 
> Also, I think (i.e. I haven't yet run the code, since I don't have offsets() 
> available) there is another mis-formed case you don't properly detect :
> a,b,c,"def"""g"h",i,j,k
> 
> The quoted cell contains the right number (i.e. a multiple of 2) of quotes, 
> but they are not suitably adjacent, so they can't be properly interpreted as 
> paired 'enclosed quotes'.   (I should say, none of the earlier versions 
> detect this either - their intent was to make the best feasible result from 
> well-formed data, and not to detect all malformed cases - but if this version 
> is going to detect and give error returns for error inputs in some cases, 
> then we should try to do it fully).
> 
> -- Alex.
> 
> 
> On 18/10/2015 00:41, Peter M. Brigham wrote:
>> So here's my attempt. It converts a CVS text to an array. Let's see if 
>> there's csv data that can break it.
>> 
>> -- Peter
>> 
>> Peter M. Brigham
>> pmb...@gmail.com
>> http://home.comcast.net/~pmbrig
>> 
>> ---
>> 
>> function CSVtoArray pData
>>-- by Peter M. Brigham, pmb...@gmail.com
>>-- requires getDelimiters(), howmany()
>>put getDelimiters(pData,5) into tDelims
>>put line 1 of tDelims into crChar
>>put line 2 of tDelims into tabChar
>>put line 3 of tDelims into commaChar
>>put line 4 of tDelims into openQuoteChar
>>put line 5 of tDelims into closeQuoteChar
>>replace crlf with cr in pData 

Re: CSV again.

2015-10-17 Thread Alex Tweedly

Ummm  surely at this point



  repeat for each item i in L
 add 1 to itemCounter
 put i into thisItem
 if howmany(quote,thisItem) mod 2 = 1 then
return "This CSV data is not parsable (unclosed quotes in item)."
 end if

...

howmany(quote,thisItem) must be 0 - all quotes have been replaced by 
either openQuoteChar or closeQuoteChar


Shouldn't this test be
   if howmany(openQuoteChar, thisItem) <> howmany(closeUqoteChar, 
thisItem)  then



Also, I think (i.e. I haven't yet run the code, since I don't have 
offsets() available) there is another mis-formed case you don't properly 
detect :

a,b,c,"def"""g"h",i,j,k

The quoted cell contains the right number (i.e. a multiple of 2) of 
quotes, but they are not suitably adjacent, so they can't be properly 
interpreted as paired 'enclosed quotes'.   (I should say, none of the 
earlier versions detect this either - their intent was to make the best 
feasible result from well-formed data, and not to detect all malformed 
cases - but if this version is going to detect and give error returns 
for error inputs in some cases, then we should try to do it fully).


-- Alex.


On 18/10/2015 00:41, Peter M. Brigham wrote:

So here's my attempt. It converts a CVS text to an array. Let's see if there's 
csv data that can break it.

-- Peter

Peter M. Brigham
pmb...@gmail.com
http://home.comcast.net/~pmbrig

---

function CSVtoArray pData
-- by Peter M. Brigham, pmb...@gmail.com
-- requires getDelimiters(), howmany()
put getDelimiters(pData,5) into tDelims
put line 1 of tDelims into crChar
put line 2 of tDelims into tabChar
put line 3 of tDelims into commaChar
put line 4 of tDelims into openQuoteChar
put line 5 of tDelims into closeQuoteChar

replace crlf with cr in pData  -- Win to UNIX

replace numtochar(13) with cr in pData -- Mac to UNIX

if howmany(quote,pData) mod 2 = 1 then

   return "This CSV data is not parsable (unclosed quotes in data)."
end if

put offsets(quote,pData) into qOffsets

if qOffsets > 0 then
   put 1 into counter
   repeat for each item q in qOffsets
  if counter mod 2 = 1 then put openQuoteChar into char q of pData
  else put closeQuoteChar into char q of pData
  add 1 to counter
   end repeat
end if

put offsets(cr,pData) into crOffsets

repeat for each item r in crOffsets
   put char 1 to r of pData into upToHere
   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
  -- the cr is within a quoted string
  put crChar into char r of pData
   end if
end repeat
put offsets(tab,pData) into tabOffsets
repeat for each item t in tabOffsets
   put char 1 to t of pData into upToHere
   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
  -- the tab is within a quoted string
  put tabChar into char t of pData
   end if
end repeat
put offsets(comma,pData) into commaOffsets
repeat for each item c in commaOffsets
   put char 1 to c of pData into upToHere
   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
  -- the comma is within a quoted string
  put commaChar into char c of pData
   end if
end repeat
put 0 into lineCounter
repeat for each line L in pData
   add 1 to lineCounter
   put 0 into itemCounter
   repeat for each item i in L
  add 1 to itemCounter
  put i into thisItem
  if howmany(quote,thisItem) mod 2 = 1 then
 return "This CSV data is not parsable (unclosed quotes in item)."
  end if
  replace crChar with cr in thisItem
  replace tabChar with tab in thisItem
  replace commaChar with comma in thisItem
  replace openQuoteChar with quote in thisItem
  replace closeQuoteChar with quote in thisItem
  put thisItem into A[lineCounter][itemCounter]
   end repeat
end repeat
return A
end CSVtoArray

function getDelimiters pText, nbr
-- returns a cr-delimited list of  characters
--not found in the variable pText
-- use for delimiters for, eg, parsing text files, manipulating arrays, etc.
-- usage: put getDelimiters(pText,2) into tDelims
--if tDelims begins with "Error" then exit to top -- or whatever
--put line 1 of tDelims into lineDivider
--put line 2 of tDelims into itemDivider
-- etc.
-- by Peter M. Brigham, pmb...@gmail.com — freeware

if pText = empty then return "Error: no text specified."

if nbr = empty then put 1 into nbr -- default 1 delimiter
put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
-- low ASCII values, excluding CR, LF, tab, etc.
put the number of items of baseList into maxNbr
if nbr > maxNbr then return "Error: max"

Re: CSV again.

2015-10-17 Thread Peter M. Brigham
My mistake, failed to include the offsets() handler:

-- Peter

Peter M. Brigham
pmb...@gmail.com
http://home.comcast.net/~pmbrig

---

function offsets str, pContainer
   -- returns a comma-delimited list of all the offsets of str in pContainer
   -- returns 0 if not found
   -- note: offsets("xx","xx") returns "1,3,5" not "1,2,3,4,5"
   -- ie, overlapping offsets are not counted
   -- note: to get the last occurrence of a string in a container (often useful)
   -- use "item -1 of offsets(...)"
   -- by Peter M. Brigham, pmb...@gmail.com — freeware
   
   if str is not in pContainer then return 0
   put 0 into startPoint
   repeat
  put offset(str,pContainer,startPoint) into thisOffset
  if thisOffset = 0 then exit repeat
  add thisOffset to startPoint
  put startPoint & comma after offsetList
  add length(str)-1 to startPoint
   end repeat
   return item 1 to -1 of offsetList -- delete trailing comma
end offsets


On Oct 17, 2015, at 8:30 PM, Alex Tweedly wrote:

> Hi Peter,
> 
> it also requires offsets() - I can guess what it does, but it would be safer 
> to get the actual code you use :-)
> 
> Thanks
> -- Alex.
> 
> On 18/10/2015 00:41, Peter M. Brigham wrote:
>> So here's my attempt. It converts a CVS text to an array. Let's see if 
>> there's csv data that can break it.
>> 
>> -- Peter
>> 
>> Peter M. Brigham
>> pmb...@gmail.com
>> http://home.comcast.net/~pmbrig
>> 
>> ---
>> 
>> function CSVtoArray pData
>>-- by Peter M. Brigham, pmb...@gmail.com
>>-- requires getDelimiters(), howmany()
>>put getDelimiters(pData,5) into tDelims
>>put line 1 of tDelims into crChar
>>put line 2 of tDelims into tabChar
>>put line 3 of tDelims into commaChar
>>put line 4 of tDelims into openQuoteChar
>>put line 5 of tDelims into closeQuoteChar
>>replace crlf with cr in pData  -- Win to UNIX
>>replace numtochar(13) with cr in pData -- Mac to UNIX
>>if howmany(quote,pData) mod 2 = 1 then
>>   return "This CSV data is not parsable (unclosed quotes in data)."
>>end if
>>put offsets(quote,pData) into qOffsets
>>if qOffsets > 0 then
>>   put 1 into counter
>>   repeat for each item q in qOffsets
>>  if counter mod 2 = 1 then put openQuoteChar into char q of pData
>>  else put closeQuoteChar into char q of pData
>>  add 1 to counter
>>   end repeat
>>end if
>>put offsets(cr,pData) into crOffsets
>>repeat for each item r in crOffsets
>>   put char 1 to r of pData into upToHere
>>   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
>> then
>>  -- the cr is within a quoted string
>>  put crChar into char r of pData
>>   end if
>>end repeat
>>put offsets(tab,pData) into tabOffsets
>>repeat for each item t in tabOffsets
>>   put char 1 to t of pData into upToHere
>>   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
>> then
>>  -- the tab is within a quoted string
>>  put tabChar into char t of pData
>>   end if
>>end repeat
>>put offsets(comma,pData) into commaOffsets
>>repeat for each item c in commaOffsets
>>   put char 1 to c of pData into upToHere
>>   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
>> then
>>  -- the comma is within a quoted string
>>  put commaChar into char c of pData
>>   end if
>>end repeat
>>put 0 into lineCounter
>>repeat for each line L in pData
>>   add 1 to lineCounter
>>   put 0 into itemCounter
>>   repeat for each item i in L
>>  add 1 to itemCounter
>>  put i into thisItem
>>  if howmany(quote,thisItem) mod 2 = 1 then
>> return "This CSV data is not parsable (unclosed quotes in item)."
>>  end if
>>  replace crChar with cr in thisItem
>>  replace tabChar with tab in thisItem
>>  replace commaChar with comma in thisItem
>>  replace openQuoteChar with quote in thisItem
>>  replace closeQuoteChar with quote in thisItem
>>  put thisItem into A[lineCounter][itemCounter]
>>   end repeat
>>end repeat
>>return A
>> end CSVtoArray
>> 
>> function getDelimiters pText, nbr
>>-- returns a cr-delimited list of  characters
>>--not found in the variable pText
>>-- use for delimiters for, eg, parsing text files, manipulating arrays, 
>> etc.
>>-- usage: put getDelimiters(pText,2) into tDelims
>>--if tDelims begins with "Error" then exit to top -- or whatever
>>--put line 1 of tDelims into lineDivider
>>--put line 2 of tDelims into itemDivider
>>-- etc.
>>-- by Peter M. Brigham, pmb...@gmail.com — freeware
>>if pText = empty then return "Error: no text specified."
>>if nbr = empty then put 1 into nbr -- default 1 delimiter
>>put "2,3,4,5,6,7,8,16,1

Re: CSV again.

2015-10-17 Thread Alex Tweedly

Hi Peter,

it also requires offsets() - I can guess what it does, but it would be 
safer to get the actual code you use :-)


Thanks
-- Alex.

On 18/10/2015 00:41, Peter M. Brigham wrote:

So here's my attempt. It converts a CVS text to an array. Let's see if there's 
csv data that can break it.

-- Peter

Peter M. Brigham
pmb...@gmail.com
http://home.comcast.net/~pmbrig

---

function CSVtoArray pData
-- by Peter M. Brigham, pmb...@gmail.com
-- requires getDelimiters(), howmany()
put getDelimiters(pData,5) into tDelims
put line 1 of tDelims into crChar
put line 2 of tDelims into tabChar
put line 3 of tDelims into commaChar
put line 4 of tDelims into openQuoteChar
put line 5 of tDelims into closeQuoteChar

replace crlf with cr in pData  -- Win to UNIX

replace numtochar(13) with cr in pData -- Mac to UNIX

if howmany(quote,pData) mod 2 = 1 then

   return "This CSV data is not parsable (unclosed quotes in data)."
end if

put offsets(quote,pData) into qOffsets

if qOffsets > 0 then
   put 1 into counter
   repeat for each item q in qOffsets
  if counter mod 2 = 1 then put openQuoteChar into char q of pData
  else put closeQuoteChar into char q of pData
  add 1 to counter
   end repeat
end if

put offsets(cr,pData) into crOffsets

repeat for each item r in crOffsets
   put char 1 to r of pData into upToHere
   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
  -- the cr is within a quoted string
  put crChar into char r of pData
   end if
end repeat
put offsets(tab,pData) into tabOffsets
repeat for each item t in tabOffsets
   put char 1 to t of pData into upToHere
   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
  -- the tab is within a quoted string
  put tabChar into char t of pData
   end if
end repeat
put offsets(comma,pData) into commaOffsets
repeat for each item c in commaOffsets
   put char 1 to c of pData into upToHere
   if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
  -- the comma is within a quoted string
  put commaChar into char c of pData
   end if
end repeat
put 0 into lineCounter
repeat for each line L in pData
   add 1 to lineCounter
   put 0 into itemCounter
   repeat for each item i in L
  add 1 to itemCounter
  put i into thisItem
  if howmany(quote,thisItem) mod 2 = 1 then
 return "This CSV data is not parsable (unclosed quotes in item)."
  end if
  replace crChar with cr in thisItem
  replace tabChar with tab in thisItem
  replace commaChar with comma in thisItem
  replace openQuoteChar with quote in thisItem
  replace closeQuoteChar with quote in thisItem
  put thisItem into A[lineCounter][itemCounter]
   end repeat
end repeat
return A
end CSVtoArray

function getDelimiters pText, nbr
-- returns a cr-delimited list of  characters
--not found in the variable pText
-- use for delimiters for, eg, parsing text files, manipulating arrays, etc.
-- usage: put getDelimiters(pText,2) into tDelims
--if tDelims begins with "Error" then exit to top -- or whatever
--put line 1 of tDelims into lineDivider
--put line 2 of tDelims into itemDivider
-- etc.
-- by Peter M. Brigham, pmb...@gmail.com — freeware

if pText = empty then return "Error: no text specified."

if nbr = empty then put 1 into nbr -- default 1 delimiter
put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
-- low ASCII values, excluding CR, LF, tab, etc.
put the number of items of baseList into maxNbr
if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
repeat with tCount = 1 to nbr
   put true into failed
   repeat with i = 1 to the number of items of baseList
  put item i of baseList into testNbr
  put numtochar(testNbr) into testChar
  if testChar is not in pText then
 -- found one, store and get next delim
 put false into failed
 put testChar into line tCount of delimList
 exit repeat
  end if
   end repeat
   if failed then
  if tCount = 0 then
 return "Error: cannot get any delimiters."
  else if tCount = 1 then
 return "Error: can only get one delimiter."
  else
 return "Error: can only get" && tCount && "delimiters."
  end if
   end if
   delete item i of baseList
end repeat
return delimList
end getDelimiters

function howmany pStr, pContainer, pCaseSens
-- how many times pStr occurs in pContainer
-- note that howmany("00","00") returns 3, not 5

Re: CSV again.

2015-10-17 Thread Peter M. Brigham
So here's my attempt. It converts a CVS text to an array. Let's see if there's 
csv data that can break it.

-- Peter

Peter M. Brigham
pmb...@gmail.com
http://home.comcast.net/~pmbrig

---

function CSVtoArray pData
   -- by Peter M. Brigham, pmb...@gmail.com
   -- requires getDelimiters(), howmany()
   put getDelimiters(pData,5) into tDelims
   put line 1 of tDelims into crChar
   put line 2 of tDelims into tabChar
   put line 3 of tDelims into commaChar
   put line 4 of tDelims into openQuoteChar
   put line 5 of tDelims into closeQuoteChar
   
   replace crlf with cr in pData  -- Win to UNIX
   replace numtochar(13) with cr in pData -- Mac to UNIX
   
   if howmany(quote,pData) mod 2 = 1 then
  return "This CSV data is not parsable (unclosed quotes in data)."
   end if
   
   put offsets(quote,pData) into qOffsets
   if qOffsets > 0 then
  put 1 into counter
  repeat for each item q in qOffsets
 if counter mod 2 = 1 then put openQuoteChar into char q of pData
 else put closeQuoteChar into char q of pData
 add 1 to counter
  end repeat
   end if
   
   put offsets(cr,pData) into crOffsets
   repeat for each item r in crOffsets
  put char 1 to r of pData into upToHere
  if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
 -- the cr is within a quoted string
 put crChar into char r of pData
  end if
   end repeat
   put offsets(tab,pData) into tabOffsets
   repeat for each item t in tabOffsets
  put char 1 to t of pData into upToHere
  if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
 -- the tab is within a quoted string
 put tabChar into char t of pData
  end if
   end repeat
   put offsets(comma,pData) into commaOffsets
   repeat for each item c in commaOffsets
  put char 1 to c of pData into upToHere
  if howmany(openQuoteChar,upToHere) <> howmany(closeQuoteChar,upToHere) 
then
 -- the comma is within a quoted string
 put commaChar into char c of pData
  end if
   end repeat
   put 0 into lineCounter
   repeat for each line L in pData
  add 1 to lineCounter
  put 0 into itemCounter
  repeat for each item i in L
 add 1 to itemCounter
 put i into thisItem
 if howmany(quote,thisItem) mod 2 = 1 then
return "This CSV data is not parsable (unclosed quotes in item)."
 end if
 replace crChar with cr in thisItem
 replace tabChar with tab in thisItem
 replace commaChar with comma in thisItem
 replace openQuoteChar with quote in thisItem
 replace closeQuoteChar with quote in thisItem
 put thisItem into A[lineCounter][itemCounter]
  end repeat
   end repeat
   return A
end CSVtoArray

function getDelimiters pText, nbr
   -- returns a cr-delimited list of  characters
   --not found in the variable pText
   -- use for delimiters for, eg, parsing text files, manipulating arrays, etc.
   -- usage: put getDelimiters(pText,2) into tDelims
   --if tDelims begins with "Error" then exit to top -- or whatever
   --put line 1 of tDelims into lineDivider
   --put line 2 of tDelims into itemDivider
   -- etc.
   -- by Peter M. Brigham, pmb...@gmail.com — freeware
   
   if pText = empty then return "Error: no text specified."
   if nbr = empty then put 1 into nbr -- default 1 delimiter
   put "2,3,4,5,6,7,8,16,17,18,19,20,21,22,23,24,25,26" into baseList
   -- low ASCII values, excluding CR, LF, tab, etc.
   put the number of items of baseList into maxNbr
   if nbr > maxNbr then return "Error: max" && maxNbr && "delimiters."
   repeat with tCount = 1 to nbr
  put true into failed
  repeat with i = 1 to the number of items of baseList
 put item i of baseList into testNbr
 put numtochar(testNbr) into testChar
 if testChar is not in pText then
-- found one, store and get next delim
put false into failed
put testChar into line tCount of delimList
exit repeat
 end if
  end repeat
  if failed then
 if tCount = 0 then
return "Error: cannot get any delimiters."
 else if tCount = 1 then
return "Error: can only get one delimiter."
 else
return "Error: can only get" && tCount && "delimiters."
 end if
  end if
  delete item i of baseList
   end repeat
   return delimList
end getDelimiters

function howmany pStr, pContainer, pCaseSens
   -- how many times pStr occurs in pContainer
   -- note that howmany("00","00") returns 3, not 5
   -- ie,  overlapping matches are not counted
   -- by Peter M. Brigham, pmb...@gmail.com — freeware
   
   if pCaseSens = empty then put false into pCaseSens
   set the casesensitive to pCaseSens
   if pStr is not in pContainer then return 0
   put len(pContainer) into origLength
   replace pStr with ch

Re: CSV again.

2015-10-17 Thread Mike Kerner
I added it to my repository on GitHub if anyone wants to try to do this in
Git.

On Sat, Oct 17, 2015 at 10:53 AM, Mike Kerner 
wrote:

> I am going to put 4 on Git and have at it.
>
> 1) There are other assumptions being made, like assuming that the  and
>  don't appear in the incoming text.  Instead of hardcoding the interim
> substitutions, determine what the interim substitutions are going to be
> (can also allow the user to specify them).  Characters that we need to deal
> with are quote, ,, and comma.
>
> 2) In this version, you can specify the incoming column delimiter.  Add
> the ability for the caller to specify the record delimiter before, the
> column and record delimiters after, and what substitutions are going to be
> used, after.  For example, for embedded 's, perhaps the user wants <13>
> or even a string like a semicolon and a space
>
>
> On Sat, Oct 17, 2015 at 5:03 AM, Alex Tweedly  wrote:
>
>> Naturally it must be removed.
>>
>> But I have a more philosophical issue / question.
>>
>>
>> TSV (in and of itself) doesn't have any quotes, and so doesn't handle
>> quoted CRs or TABs.
>>
>> Currently, the 'old' version - as in Richard's published article, doesn't
>> handle TAB characters enclosed within a quoted cell. The 'new' version does
>> - but only by returning the data delimited by  instead of TAB, and
>> leaving enclosed TABs alone - a mistake, IMHO.
>>
>> I believe that what the converter should do is :
>>  - return TSV - i.e. delimited by TABs
>>  - replace quoted CR by  within quoted cells (as it does now)
>>  - replace quoted TABs by  within quoted cells
>>
>> Any comments or suggestions ?
>>
>> Thanks
>> Alex.
>>
>>
>> On 17/10/2015 02:34, Mike Kerner wrote:
>>
>>> It's safe as long as you remember to remove it at the end of the function
>>>
>>> On Fri, Oct 16, 2015 at 7:12 PM, Alex Tweedly  wrote:
>>>
>>> Duh - replying to myself again :-)

 It looks as though that's exactly what you do mean - it certainly
 generates the problems you described earlier. And my one-line additional
 test would (does in my testing) solve it properly - without it, we don't
 get a chance to flush "theInsideStringSoFar" to tNuData, with the extra
 line we do. And adding it is always safe (AFAICI).

 -- Alex.


 On 17/10/2015 00:03, Alex Tweedly wrote:

 Sorry, Mike, but can you describe what you mean by a "naked" line ?
> Is it simply one with no line delimiter after it ?
> i.e. could only happen on the very last line of a file of input ?
>
> Could that be solved by a simple test (after the various 'replace'
> statements)
>  if the last char of pData <> CR then put CR after pData
> before the parsing happens ?
>
> -- Alex.
>
>
> On 16/10/2015 17:19, Mike Kerner wrote:
>
> No, the problem isn't that LC use LF and CR for ascii(10) and ignores
>> ascii(13).  That's just a personal problem.
>>
>> The problem, here, is that the csv parser handles a naked line and a
>> terminated line differently.  If the line is terminated, it parses it
>> one
>> way, and if it is not, it parses it (incorrectly) a different way,
>> which
>> makes me wonder if this is the latest version.
>>
>> On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar <
>> bobsnei...@iotecdigital.com>
>> wrote:
>>
>> But what if the cr or lf or crlf is inside quoted text, meaning it is
>> not
>>
>>> a delimiter? Oh, I'm afraid the deflector shield will be quite
>>> operational
>>> when your friends arrive.
>>>
>>> Bob S
>>>
>>>
>>> On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:
>>>
 Hi Mike,

 thanks for that additional info.

 I *think* (it's been 3 years) I left them as  (i.e.
 numtochar(29))

 because I had some data including normal TAB characters within the
>>> cells
>>> (!!( and thought  was a safer bet - though of course nothing is
>>> completely safe. It's then up to the caller to decide whether to do
>>> "replace numtochar(29) with TAB in ...", or do TAB escaping, or
>>> whatever
>>> they want.
>>>
>>> As for the other bigger problem  Oh dear = CR vs LF vs CRLF 

 Are you on Mac or Windows or Linux ?
 How is the LF delimited data getting into your app ?
 Maybe we should just add a "replace chartonum(13) with CR in pData"
 ?

 (I confess to being confused by this - I know that LC does

 auto-translation of line delimiters at various places, but I'm not
>>> sure
>>> when it is, or isn't, completely safe. Maybe the easiest thing is to
>>> jst do
>>> all the translations 
>>>
>>>replace CRLF with CR in pData
replace numtochar(10) with CR in pData
replace numtochar(13) with CR in pData

 -- Alex.
>>

Re: CSV again.

2015-10-17 Thread Mike Kerner
I am going to put 4 on Git and have at it.

1) There are other assumptions being made, like assuming that the  and
 don't appear in the incoming text.  Instead of hardcoding the interim
substitutions, determine what the interim substitutions are going to be
(can also allow the user to specify them).  Characters that we need to deal
with are quote, ,, and comma.

2) In this version, you can specify the incoming column delimiter.  Add the
ability for the caller to specify the record delimiter before, the column
and record delimiters after, and what substitutions are going to be used,
after.  For example, for embedded 's, perhaps the user wants <13> or
even a string like a semicolon and a space


On Sat, Oct 17, 2015 at 5:03 AM, Alex Tweedly  wrote:

> Naturally it must be removed.
>
> But I have a more philosophical issue / question.
>
>
> TSV (in and of itself) doesn't have any quotes, and so doesn't handle
> quoted CRs or TABs.
>
> Currently, the 'old' version - as in Richard's published article, doesn't
> handle TAB characters enclosed within a quoted cell. The 'new' version does
> - but only by returning the data delimited by  instead of TAB, and
> leaving enclosed TABs alone - a mistake, IMHO.
>
> I believe that what the converter should do is :
>  - return TSV - i.e. delimited by TABs
>  - replace quoted CR by  within quoted cells (as it does now)
>  - replace quoted TABs by  within quoted cells
>
> Any comments or suggestions ?
>
> Thanks
> Alex.
>
>
> On 17/10/2015 02:34, Mike Kerner wrote:
>
>> It's safe as long as you remember to remove it at the end of the function
>>
>> On Fri, Oct 16, 2015 at 7:12 PM, Alex Tweedly  wrote:
>>
>> Duh - replying to myself again :-)
>>>
>>> It looks as though that's exactly what you do mean - it certainly
>>> generates the problems you described earlier. And my one-line additional
>>> test would (does in my testing) solve it properly - without it, we don't
>>> get a chance to flush "theInsideStringSoFar" to tNuData, with the extra
>>> line we do. And adding it is always safe (AFAICI).
>>>
>>> -- Alex.
>>>
>>>
>>> On 17/10/2015 00:03, Alex Tweedly wrote:
>>>
>>> Sorry, Mike, but can you describe what you mean by a "naked" line ?
 Is it simply one with no line delimiter after it ?
 i.e. could only happen on the very last line of a file of input ?

 Could that be solved by a simple test (after the various 'replace'
 statements)
  if the last char of pData <> CR then put CR after pData
 before the parsing happens ?

 -- Alex.


 On 16/10/2015 17:19, Mike Kerner wrote:

 No, the problem isn't that LC use LF and CR for ascii(10) and ignores
> ascii(13).  That's just a personal problem.
>
> The problem, here, is that the csv parser handles a naked line and a
> terminated line differently.  If the line is terminated, it parses it
> one
> way, and if it is not, it parses it (incorrectly) a different way,
> which
> makes me wonder if this is the latest version.
>
> On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar <
> bobsnei...@iotecdigital.com>
> wrote:
>
> But what if the cr or lf or crlf is inside quoted text, meaning it is
> not
>
>> a delimiter? Oh, I'm afraid the deflector shield will be quite
>> operational
>> when your friends arrive.
>>
>> Bob S
>>
>>
>> On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:
>>
>>> Hi Mike,
>>>
>>> thanks for that additional info.
>>>
>>> I *think* (it's been 3 years) I left them as  (i.e.
>>> numtochar(29))
>>>
>>> because I had some data including normal TAB characters within the
>> cells
>> (!!( and thought  was a safer bet - though of course nothing is
>> completely safe. It's then up to the caller to decide whether to do
>> "replace numtochar(29) with TAB in ...", or do TAB escaping, or
>> whatever
>> they want.
>>
>> As for the other bigger problem  Oh dear = CR vs LF vs CRLF 
>>>
>>> Are you on Mac or Windows or Linux ?
>>> How is the LF delimited data getting into your app ?
>>> Maybe we should just add a "replace chartonum(13) with CR in pData" ?
>>>
>>> (I confess to being confused by this - I know that LC does
>>>
>>> auto-translation of line delimiters at various places, but I'm not
>> sure
>> when it is, or isn't, completely safe. Maybe the easiest thing is to
>> jst do
>> all the translations 
>>
>>replace CRLF with CR in pData
>>>replace numtochar(10) with CR in pData
>>>replace numtochar(13) with CR in pData
>>>
>>> -- Alex.
>>>
>>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-liv

Re: CSV again.

2015-10-17 Thread Alex Tweedly

Naturally it must be removed.

But I have a more philosophical issue / question.


TSV (in and of itself) doesn't have any quotes, and so doesn't handle 
quoted CRs or TABs.


Currently, the 'old' version - as in Richard's published article, 
doesn't handle TAB characters enclosed within a quoted cell. The 'new' 
version does - but only by returning the data delimited by  instead 
of TAB, and leaving enclosed TABs alone - a mistake, IMHO.


I believe that what the converter should do is :
 - return TSV - i.e. delimited by TABs
 - replace quoted CR by  within quoted cells (as it does now)
 - replace quoted TABs by  within quoted cells

Any comments or suggestions ?

Thanks
Alex.

On 17/10/2015 02:34, Mike Kerner wrote:

It's safe as long as you remember to remove it at the end of the function

On Fri, Oct 16, 2015 at 7:12 PM, Alex Tweedly  wrote:


Duh - replying to myself again :-)

It looks as though that's exactly what you do mean - it certainly
generates the problems you described earlier. And my one-line additional
test would (does in my testing) solve it properly - without it, we don't
get a chance to flush "theInsideStringSoFar" to tNuData, with the extra
line we do. And adding it is always safe (AFAICI).

-- Alex.


On 17/10/2015 00:03, Alex Tweedly wrote:


Sorry, Mike, but can you describe what you mean by a "naked" line ?
Is it simply one with no line delimiter after it ?
i.e. could only happen on the very last line of a file of input ?

Could that be solved by a simple test (after the various 'replace'
statements)
 if the last char of pData <> CR then put CR after pData
before the parsing happens ?

-- Alex.


On 16/10/2015 17:19, Mike Kerner wrote:


No, the problem isn't that LC use LF and CR for ascii(10) and ignores
ascii(13).  That's just a personal problem.

The problem, here, is that the csv parser handles a naked line and a
terminated line differently.  If the line is terminated, it parses it one
way, and if it is not, it parses it (incorrectly) a different way, which
makes me wonder if this is the latest version.

On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar <
bobsnei...@iotecdigital.com>
wrote:

But what if the cr or lf or crlf is inside quoted text, meaning it is not

a delimiter? Oh, I'm afraid the deflector shield will be quite
operational
when your friends arrive.

Bob S


On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:

Hi Mike,

thanks for that additional info.

I *think* (it's been 3 years) I left them as  (i.e. numtochar(29))


because I had some data including normal TAB characters within the cells
(!!( and thought  was a safer bet - though of course nothing is
completely safe. It's then up to the caller to decide whether to do
"replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever
they want.


As for the other bigger problem  Oh dear = CR vs LF vs CRLF 

Are you on Mac or Windows or Linux ?
How is the LF delimited data getting into your app ?
Maybe we should just add a "replace chartonum(13) with CR in pData" ?

(I confess to being confused by this - I know that LC does


auto-translation of line delimiters at various places, but I'm not sure
when it is, or isn't, completely safe. Maybe the easiest thing is to
jst do
all the translations 


   replace CRLF with CR in pData
   replace numtochar(10) with CR in pData
   replace numtochar(13) with CR in pData

-- Alex.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode





___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode







___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Mike Kerner
It's safe as long as you remember to remove it at the end of the function

On Fri, Oct 16, 2015 at 7:12 PM, Alex Tweedly  wrote:

> Duh - replying to myself again :-)
>
> It looks as though that's exactly what you do mean - it certainly
> generates the problems you described earlier. And my one-line additional
> test would (does in my testing) solve it properly - without it, we don't
> get a chance to flush "theInsideStringSoFar" to tNuData, with the extra
> line we do. And adding it is always safe (AFAICI).
>
> -- Alex.
>
>
> On 17/10/2015 00:03, Alex Tweedly wrote:
>
>> Sorry, Mike, but can you describe what you mean by a "naked" line ?
>> Is it simply one with no line delimiter after it ?
>> i.e. could only happen on the very last line of a file of input ?
>>
>> Could that be solved by a simple test (after the various 'replace'
>> statements)
>> if the last char of pData <> CR then put CR after pData
>> before the parsing happens ?
>>
>> -- Alex.
>>
>>
>> On 16/10/2015 17:19, Mike Kerner wrote:
>>
>>> No, the problem isn't that LC use LF and CR for ascii(10) and ignores
>>> ascii(13).  That's just a personal problem.
>>>
>>> The problem, here, is that the csv parser handles a naked line and a
>>> terminated line differently.  If the line is terminated, it parses it one
>>> way, and if it is not, it parses it (incorrectly) a different way, which
>>> makes me wonder if this is the latest version.
>>>
>>> On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar <
>>> bobsnei...@iotecdigital.com>
>>> wrote:
>>>
>>> But what if the cr or lf or crlf is inside quoted text, meaning it is not
 a delimiter? Oh, I'm afraid the deflector shield will be quite
 operational
 when your friends arrive.

 Bob S


 On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:
>
> Hi Mike,
>
> thanks for that additional info.
>
> I *think* (it's been 3 years) I left them as  (i.e. numtochar(29))
>
 because I had some data including normal TAB characters within the cells
 (!!( and thought  was a safer bet - though of course nothing is
 completely safe. It's then up to the caller to decide whether to do
 "replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever
 they want.

> As for the other bigger problem  Oh dear = CR vs LF vs CRLF 
>
> Are you on Mac or Windows or Linux ?
> How is the LF delimited data getting into your app ?
> Maybe we should just add a "replace chartonum(13) with CR in pData" ?
>
> (I confess to being confused by this - I know that LC does
>
 auto-translation of line delimiters at various places, but I'm not sure
 when it is, or isn't, completely safe. Maybe the easiest thing is to
 jst do
 all the translations 

>   replace CRLF with CR in pData
>   replace numtochar(10) with CR in pData
>   replace numtochar(13) with CR in pData
>
> -- Alex.
>

 ___
 use-livecode mailing list
 use-livecode@lists.runrev.com
 Please visit this url to subscribe, unsubscribe and manage your
 subscription preferences:
 http://lists.runrev.com/mailman/listinfo/use-livecode


>>>
>>>
>>
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Alex Tweedly

Duh - replying to myself again :-)

It looks as though that's exactly what you do mean - it certainly 
generates the problems you described earlier. And my one-line additional 
test would (does in my testing) solve it properly - without it, we don't 
get a chance to flush "theInsideStringSoFar" to tNuData, with the extra 
line we do. And adding it is always safe (AFAICI).


-- Alex.

On 17/10/2015 00:03, Alex Tweedly wrote:

Sorry, Mike, but can you describe what you mean by a "naked" line ?
Is it simply one with no line delimiter after it ?
i.e. could only happen on the very last line of a file of input ?

Could that be solved by a simple test (after the various 'replace' 
statements)

if the last char of pData <> CR then put CR after pData
before the parsing happens ?

-- Alex.


On 16/10/2015 17:19, Mike Kerner wrote:

No, the problem isn't that LC use LF and CR for ascii(10) and ignores
ascii(13).  That's just a personal problem.

The problem, here, is that the csv parser handles a naked line and a
terminated line differently.  If the line is terminated, it parses it 
one

way, and if it is not, it parses it (incorrectly) a different way, which
makes me wonder if this is the latest version.

On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar 


wrote:

But what if the cr or lf or crlf is inside quoted text, meaning it 
is not
a delimiter? Oh, I'm afraid the deflector shield will be quite 
operational

when your friends arrive.

Bob S



On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:

Hi Mike,

thanks for that additional info.

I *think* (it's been 3 years) I left them as  (i.e. numtochar(29))
because I had some data including normal TAB characters within the 
cells

(!!( and thought  was a safer bet - though of course nothing is
completely safe. It's then up to the caller to decide whether to do
"replace numtochar(29) with TAB in ...", or do TAB escaping, or 
whatever

they want.

As for the other bigger problem  Oh dear = CR vs LF vs CRLF 

Are you on Mac or Windows or Linux ?
How is the LF delimited data getting into your app ?
Maybe we should just add a "replace chartonum(13) with CR in pData" ?

(I confess to being confused by this - I know that LC does

auto-translation of line delimiters at various places, but I'm not sure
when it is, or isn't, completely safe. Maybe the easiest thing is to 
jst do

all the translations 

  replace CRLF with CR in pData
  replace numtochar(10) with CR in pData
  replace numtochar(13) with CR in pData

-- Alex.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode







___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Alex Tweedly

Sorry, Mike, but can you describe what you mean by a "naked" line ?
Is it simply one with no line delimiter after it ?
i.e. could only happen on the very last line of a file of input ?

Could that be solved by a simple test (after the various 'replace' 
statements)

if the last char of pData <> CR then put CR after pData
before the parsing happens ?

-- Alex.


On 16/10/2015 17:19, Mike Kerner wrote:

No, the problem isn't that LC use LF and CR for ascii(10) and ignores
ascii(13).  That's just a personal problem.

The problem, here, is that the csv parser handles a naked line and a
terminated line differently.  If the line is terminated, it parses it one
way, and if it is not, it parses it (incorrectly) a different way, which
makes me wonder if this is the latest version.

On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar 
wrote:


But what if the cr or lf or crlf is inside quoted text, meaning it is not
a delimiter? Oh, I'm afraid the deflector shield will be quite operational
when your friends arrive.

Bob S



On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:

Hi Mike,

thanks for that additional info.

I *think* (it's been 3 years) I left them as  (i.e. numtochar(29))

because I had some data including normal TAB characters within the cells
(!!( and thought  was a safer bet - though of course nothing is
completely safe. It's then up to the caller to decide whether to do
"replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever
they want.

As for the other bigger problem    Oh dear = CR vs LF vs CRLF 

Are you on Mac or Windows or Linux ?
How is the LF delimited data getting into your app ?
Maybe we should just add a "replace chartonum(13) with CR in pData" ?

(I confess to being confused by this - I know that LC does

auto-translation of line delimiters at various places, but I'm not sure
when it is, or isn't, completely safe. Maybe the easiest thing is to jst do
all the translations 

  replace CRLF with CR in pData
  replace numtochar(10) with CR in pData
  replace numtochar(13) with CR in pData

-- Alex.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode







___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Alex Tweedly
It's likely (but of course not 100% guaranteed) that those characters 
have themselves been manipulated in a consistent way by either LC or any 
other subsystem - i.e. auto-translated or not.


Anyone who chooses to use those as genuinely different characters within 
quoted cells *deserves* to have their data be unreadable :-)


-- Alex.

On 16/10/2015 16:28, Bob Sneidar wrote:

But what if the cr or lf or crlf is inside quoted text, meaning it is not a 
delimiter? Oh, I'm afraid the deflector shield will be quite operational when 
your friends arrive.

Bob S



On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:

Hi Mike,

thanks for that additional info.

I *think* (it's been 3 years) I left them as  (i.e. numtochar(29)) because I had some data 
including normal TAB characters within the cells (!!( and thought  was a safer bet - though 
of course nothing is completely safe. It's then up to the caller to decide whether to do 
"replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever they want.

As for the other bigger problem    Oh dear = CR vs LF vs CRLF 

Are you on Mac or Windows or Linux ?
How is the LF delimited data getting into your app ?
Maybe we should just add a "replace chartonum(13) with CR in pData" ?

(I confess to being confused by this - I know that LC does auto-translation of 
line delimiters at various places, but I'm not sure when it is, or isn't, 
completely safe. Maybe the easiest thing is to jst do all the translations 

  replace CRLF with CR in pData
  replace numtochar(10) with CR in pData
  replace numtochar(13) with CR in pData

-- Alex.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Bob Sneidar
The force is strong with this one.

Bob S


On Oct 16, 2015, at 09:19 , Mike Kerner 
mailto:mikeker...@roadrunner.com>> wrote:

No, the problem isn't that LC use LF and CR for ascii(10) and ignores
ascii(13).  That's just a personal problem.

The problem, here, is that the csv parser handles a naked line and a
terminated line differently.  If the line is terminated, it parses it one
way, and if it is not, it parses it (incorrectly) a different way, which
makes me wonder if this is the latest version.

On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar 
mailto:bobsnei...@iotecdigital.com>>
wrote:

But what if the cr or lf or crlf is inside quoted text, meaning it is not
a delimiter? Oh, I'm afraid the deflector shield will be quite operational
when your friends arrive.

Bob S

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Mike Kerner
No, the problem isn't that LC use LF and CR for ascii(10) and ignores
ascii(13).  That's just a personal problem.

The problem, here, is that the csv parser handles a naked line and a
terminated line differently.  If the line is terminated, it parses it one
way, and if it is not, it parses it (incorrectly) a different way, which
makes me wonder if this is the latest version.

On Fri, Oct 16, 2015 at 11:28 AM, Bob Sneidar 
wrote:

> But what if the cr or lf or crlf is inside quoted text, meaning it is not
> a delimiter? Oh, I'm afraid the deflector shield will be quite operational
> when your friends arrive.
>
> Bob S
>
>
> > On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:
> >
> > Hi Mike,
> >
> > thanks for that additional info.
> >
> > I *think* (it's been 3 years) I left them as  (i.e. numtochar(29))
> because I had some data including normal TAB characters within the cells
> (!!( and thought  was a safer bet - though of course nothing is
> completely safe. It's then up to the caller to decide whether to do
> "replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever
> they want.
> >
> > As for the other bigger problem    Oh dear = CR vs LF vs CRLF 
> >
> > Are you on Mac or Windows or Linux ?
> > How is the LF delimited data getting into your app ?
> > Maybe we should just add a "replace chartonum(13) with CR in pData" ?
> >
> > (I confess to being confused by this - I know that LC does
> auto-translation of line delimiters at various places, but I'm not sure
> when it is, or isn't, completely safe. Maybe the easiest thing is to jst do
> all the translations 
> >
> >  replace CRLF with CR in pData
> >  replace numtochar(10) with CR in pData
> >  replace numtochar(13) with CR in pData
> >
> > -- Alex.
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Bob Sneidar
But what if the cr or lf or crlf is inside quoted text, meaning it is not a 
delimiter? Oh, I'm afraid the deflector shield will be quite operational when 
your friends arrive.

Bob S


> On Oct 16, 2015, at 08:04 , Alex Tweedly  wrote:
> 
> Hi Mike,
> 
> thanks for that additional info.
> 
> I *think* (it's been 3 years) I left them as  (i.e. numtochar(29)) 
> because I had some data including normal TAB characters within the cells (!!( 
> and thought  was a safer bet - though of course nothing is completely 
> safe. It's then up to the caller to decide whether to do "replace 
> numtochar(29) with TAB in ...", or do TAB escaping, or whatever they want.
> 
> As for the other bigger problem    Oh dear = CR vs LF vs CRLF 
> 
> Are you on Mac or Windows or Linux ?
> How is the LF delimited data getting into your app ?
> Maybe we should just add a "replace chartonum(13) with CR in pData" ?
> 
> (I confess to being confused by this - I know that LC does auto-translation 
> of line delimiters at various places, but I'm not sure when it is, or isn't, 
> completely safe. Maybe the easiest thing is to jst do all the translations 
> 
> 
>  replace CRLF with CR in pData
>  replace numtochar(10) with CR in pData
>  replace numtochar(13) with CR in pData
> 
> -- Alex.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Alex Tweedly

Hi Mike,

thanks for that additional info.

I *think* (it's been 3 years) I left them as  (i.e. numtochar(29)) 
because I had some data including normal TAB characters within the cells 
(!!( and thought  was a safer bet - though of course nothing is 
completely safe. It's then up to the caller to decide whether to do 
"replace numtochar(29) with TAB in ...", or do TAB escaping, or whatever 
they want.


As for the other bigger problem    Oh dear = CR vs LF vs CRLF 

Are you on Mac or Windows or Linux ?
How is the LF delimited data getting into your app ?
Maybe we should just add a "replace chartonum(13) with CR in pData" ?

(I confess to being confused by this - I know that LC does 
auto-translation of line delimiters at various places, but I'm not sure 
when it is, or isn't, completely safe. Maybe the easiest thing is to jst 
do all the translations 


  replace CRLF with CR in pData
  replace numtochar(10) with CR in pData
  replace numtochar(13) with CR in pData

-- Alex.

On 16/10/2015 12:48, Mike Kerner wrote:

Richard,
Yes, I understand it was a Pascal Pun, and then in 2012, when this thread
originally happened, it became something more, sort of a version pun on a
pascal pun, if you will.

Rather than posting fixes to the one on your blog, let's go through the
"state of the art" and work on that, instead, if it needs it.


Alex,
I see at least two issues with this version:
First of all, you never substitute tab for tNuDelim, so the string you
return is numtochar(29) delimited, not tab-delimited.
The last line of your function, before the "return tNuData" line should be
"replace tNuDelim with tab"

Second of all, I get two different results in my sample, depending on
whether or not the string is ...ERRR -terminated or not
After fixing the problem, above,

When I run
"A","","C"
I get
A  
i.e. the "C" is missing

NOW, if I send
"A","","C"
A   C 

I haven't looked for that bug, yet.

On Thu, Oct 15, 2015 at 10:55 PM, Alex Tweedly  wrote:


H ... my quick test of what was csv4Tab, but is now called csvToTab1 -
see below - gives me
(showing results with a colon  ':' for the cell delimiter, i.e. replacing
numtochar(29) from the code in the previous use-list code

a,b,c   ---> a:b:c
"a","","c" ---> a::c

Now to me, that's what it should give - so I think it gets it right :-)

Question is
a. do you get the same result ?
 if not, what do you get ?  OR can you try with the code below
 if you do, but disagree that this is right, what do you think it
should give ?

-- Alex

function CSVToTab1 pData,pcoldelim
local tNuData -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to avoid line
--   breaks which would be misread as records;
local tNuDelim  -- new character to replace the delimiter
local tStatus, theInsideStringSoFar
--
put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
put numtochar(29) into tNuDelim
--
if pcoldelim is empty then put comma into pcoldelim
-- Normalize line endings:
replace crlf with cr in pData  -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX

put "outside" into tStatus
set the itemdel to quote
repeat for each item k in pData
   -- put tStatus && k & CR after msg
   switch tStatus

  case "inside"
 put k after theInsideStringSoFar
 put "passedquote" into tStatus
 next repeat

  case "passedquote"
 -- decide if it was a duplicated escapedQuote or a closing
quote
 if k is empty then   -- it's a duplicated quote
put quote after theInsideStringSoFar
put "inside" into tStatus
next repeat
 end if
 -- not empty - so we remain inside the cell, though we have
left the quoted section
 -- NB this allows for quoted sub-strings within the cell
content !!
 replace cr with tReturnPlaceholder in theInsideStringSoFar
 put theInsideStringSoFar after tNuData

  case "outside"
 replace pcoldelim with tNuDelim in k
 -- and deal with the "empty trailing item" issue in Livecode
 replace (tNuDelim & CR) with tNuDelim & tNuDelim & CR in k
 put k after tNuData
 put "inside" into tStatus
 put empty into theInsideStringSoFar
 next repeat
  default
 put "defaulted"
 break
   end switch
end repeat
return tNuData
end CSVToTab1


On 16/10/2015 01:34, Mike Kerner wrote:


csv4 does not handle it, and it comes up with a different result from csv2
(which is also wrong).  I sent Richard proposed changes to csv2 which
addresses that issue, but I'll wait while we collectively try to remember
what the latest and greatest csv parser algorithm is before I try to come
up with more ways to break or fix it.

On 

Re: CSV again.

2015-10-16 Thread Bob Sneidar
Someone wrote a piece years ago about why no one who wanted to maintain his 
sanity should attempt to write an XML to CSV parser. In the process of writing 
the piece, his mind degenerated until he was blathering on about non-sensical 
things. The devil had finished his work on the poor soul.

I do not think there *IS* a way to cover all the exceptions in a CSV parser. 
CSV does not lend itself to correct parsing. Just trying to figure out how to 
deal with a cr or lf inside quoted text will get you therapy. Never mind a 
thousands separator in a numeric non-quoted value, which will probably get you 
a stay in a very quiet hotel room where you can't find the checkout desk.

Bob S


On Oct 15, 2015, at 16:34 , Richard Gaskin 
mailto:ambassa...@fourthworld.com>> wrote:

So this seems like a good time to once again bring together the best minds in 
our community (are you listening Alex Tweedly, Geoff Canyon, Mark Weider, Dick 
Kreisel, and others?) to see if we can revisit CSV parsing and come up with a 
function that can parse it into tabs efficiently, while taking into account all 
of the really stupid exceptions that have crept into the world since that 
really stupid format was first popularized.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-16 Thread Mike Kerner
Richard,
Yes, I understand it was a Pascal Pun, and then in 2012, when this thread
originally happened, it became something more, sort of a version pun on a
pascal pun, if you will.

Rather than posting fixes to the one on your blog, let's go through the
"state of the art" and work on that, instead, if it needs it.


Alex,
I see at least two issues with this version:
First of all, you never substitute tab for tNuDelim, so the string you
return is numtochar(29) delimited, not tab-delimited.
The last line of your function, before the "return tNuData" line should be
"replace tNuDelim with tab"

Second of all, I get two different results in my sample, depending on
whether or not the string is ...ERRR -terminated or not
After fixing the problem, above,

When I run
"A","","C"
I get
A  
i.e. the "C" is missing

NOW, if I send
"A","","C"
A   C 

I haven't looked for that bug, yet.

On Thu, Oct 15, 2015 at 10:55 PM, Alex Tweedly  wrote:

> H ... my quick test of what was csv4Tab, but is now called csvToTab1 -
> see below - gives me
> (showing results with a colon  ':' for the cell delimiter, i.e. replacing
> numtochar(29) from the code in the previous use-list code
>
> a,b,c   ---> a:b:c
> "a","","c" ---> a::c
>
> Now to me, that's what it should give - so I think it gets it right :-)
>
> Question is
> a. do you get the same result ?
> if not, what do you get ?  OR can you try with the code below
> if you do, but disagree that this is right, what do you think it
> should give ?
>
> -- Alex
>
> function CSVToTab1 pData,pcoldelim
>local tNuData -- contains tabbed copy of data
>local tReturnPlaceholder -- replaces cr in field data to avoid line
>--   breaks which would be misread as records;
>local tNuDelim  -- new character to replace the delimiter
>local tStatus, theInsideStringSoFar
>--
>put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
>put numtochar(29) into tNuDelim
>--
>if pcoldelim is empty then put comma into pcoldelim
>-- Normalize line endings:
>replace crlf with cr in pData  -- Win to UNIX
>replace numtochar(13) with cr in pData -- Mac to UNIX
>
>put "outside" into tStatus
>set the itemdel to quote
>repeat for each item k in pData
>   -- put tStatus && k & CR after msg
>   switch tStatus
>
>  case "inside"
> put k after theInsideStringSoFar
> put "passedquote" into tStatus
> next repeat
>
>  case "passedquote"
> -- decide if it was a duplicated escapedQuote or a closing
> quote
> if k is empty then   -- it's a duplicated quote
>put quote after theInsideStringSoFar
>put "inside" into tStatus
>next repeat
> end if
> -- not empty - so we remain inside the cell, though we have
> left the quoted section
> -- NB this allows for quoted sub-strings within the cell
> content !!
> replace cr with tReturnPlaceholder in theInsideStringSoFar
> put theInsideStringSoFar after tNuData
>
>  case "outside"
> replace pcoldelim with tNuDelim in k
> -- and deal with the "empty trailing item" issue in Livecode
> replace (tNuDelim & CR) with tNuDelim & tNuDelim & CR in k
> put k after tNuData
> put "inside" into tStatus
> put empty into theInsideStringSoFar
> next repeat
>  default
> put "defaulted"
> break
>   end switch
>end repeat
>return tNuData
> end CSVToTab1
>
>
> On 16/10/2015 01:34, Mike Kerner wrote:
>
>> csv4 does not handle it, and it comes up with a different result from csv2
>> (which is also wrong).  I sent Richard proposed changes to csv2 which
>> addresses that issue, but I'll wait while we collectively try to remember
>> what the latest and greatest csv parser algorithm is before I try to come
>> up with more ways to break or fix it.
>>
>> On Thu, Oct 15, 2015 at 8:24 PM, Alex Tweedly  wrote:
>>
>> Richard et al.,
>>>
>>> sometime after that article, there was a further thread on the use-list.
>>> Pete Haworth found a case not properly covered by the version on the
>>> article, and I came up with a revised version (cutely called csv4Tab !! -
>>> csv3Tab was an interim, deeply buggy attempt)
>>>
>>> (It's in
>>> http://lists.runrev.com/pipermail/use-livecode/2012-May/172275.html )
>>>
>>> It *looks* from that thread (
>>> http://lists.runrev.com/pipermail/use-livecode/2012-May/172191.html ) as
>>> though this case had been discussed, and the re-write should properly
>>> handle it - but I haven't yet had time to try it. My laptop has been
>>> replaced in the meantime, and I can't find my test stack, and recreating
>>> it
>>> and finding the test data is a bit too much for after 1am:-)
>>>
>>> So I'll try it tomorrow; hopefully csv4Tab() will already work for 

Re: CSV again.

2015-10-15 Thread Alex Tweedly
H ... my quick test of what was csv4Tab, but is now called csvToTab1 
- see below - gives me
(showing results with a colon  ':' for the cell delimiter, i.e. 
replacing numtochar(29) from the code in the previous use-list code


a,b,c   ---> a:b:c
"a","","c" ---> a::c

Now to me, that's what it should give - so I think it gets it right :-)

Question is
a. do you get the same result ?
if not, what do you get ?  OR can you try with the code below
if you do, but disagree that this is right, what do you think it 
should give ?


-- Alex

function CSVToTab1 pData,pcoldelim
   local tNuData -- contains tabbed copy of data
   local tReturnPlaceholder -- replaces cr in field data to avoid line
   --   breaks which would be misread as records;
   local tNuDelim  -- new character to replace the delimiter
   local tStatus, theInsideStringSoFar
   --
   put numtochar(11) into tReturnPlaceholder -- vertical tab as placeholder
   put numtochar(29) into tNuDelim
   --
   if pcoldelim is empty then put comma into pcoldelim
   -- Normalize line endings:
   replace crlf with cr in pData  -- Win to UNIX
   replace numtochar(13) with cr in pData -- Mac to UNIX

   put "outside" into tStatus
   set the itemdel to quote
   repeat for each item k in pData
  -- put tStatus && k & CR after msg
  switch tStatus

 case "inside"
put k after theInsideStringSoFar
put "passedquote" into tStatus
next repeat

 case "passedquote"
-- decide if it was a duplicated escapedQuote or a closing 
quote

if k is empty then   -- it's a duplicated quote
   put quote after theInsideStringSoFar
   put "inside" into tStatus
   next repeat
end if
-- not empty - so we remain inside the cell, though we have 
left the quoted section
-- NB this allows for quoted sub-strings within the cell 
content !!

replace cr with tReturnPlaceholder in theInsideStringSoFar
put theInsideStringSoFar after tNuData

 case "outside"
replace pcoldelim with tNuDelim in k
-- and deal with the "empty trailing item" issue in Livecode
replace (tNuDelim & CR) with tNuDelim & tNuDelim & CR in k
put k after tNuData
put "inside" into tStatus
put empty into theInsideStringSoFar
next repeat
 default
put "defaulted"
break
  end switch
   end repeat
   return tNuData
end CSVToTab1

On 16/10/2015 01:34, Mike Kerner wrote:

csv4 does not handle it, and it comes up with a different result from csv2
(which is also wrong).  I sent Richard proposed changes to csv2 which
addresses that issue, but I'll wait while we collectively try to remember
what the latest and greatest csv parser algorithm is before I try to come
up with more ways to break or fix it.

On Thu, Oct 15, 2015 at 8:24 PM, Alex Tweedly  wrote:


Richard et al.,

sometime after that article, there was a further thread on the use-list.
Pete Haworth found a case not properly covered by the version on the
article, and I came up with a revised version (cutely called csv4Tab !! -
csv3Tab was an interim, deeply buggy attempt)

(It's in
http://lists.runrev.com/pipermail/use-livecode/2012-May/172275.html )

It *looks* from that thread (
http://lists.runrev.com/pipermail/use-livecode/2012-May/172191.html ) as
though this case had been discussed, and the re-write should properly
handle it - but I haven't yet had time to try it. My laptop has been
replaced in the meantime, and I can't find my test stack, and recreating it
and finding the test data is a bit too much for after 1am:-)

So I'll try it tomorrow; hopefully csv4Tab() will already work for this
case. If it doesn't, we can try again :-)

-- Alex.


On 16/10/2015 00:34, Richard Gaskin wrote:


Mike Kerner wrote:

Alex, Richard, etc.

What do we consider the latest version of the csv parser?  I think I
found a bug in Richard's CSV2Text code, and proposed changes, but he
wanted the discussion to go down over here, first.  Then I noticed
that csv4Text is out over here, which makes 2, I guess, a bit long in
the tooth.

The version referred to here as "Richard's" is the famous Tweedly algo,
in the middle of this page:


Alex came up with that after a a bunch of us here had a long discussion
about the many variants of CSV running around, and how stupidly complex
they are to parse (see the details in that article).

Mike wrote me this afternoon letting me know that there's yet another
exception that doesn't seem to be accounted for there:

"value","","value"

I had thought we'd covered that in the earlier discussion, but perhaps
not.

So this seems like a good time to once again bring together the best
minds in our community (are you listening Alex Tweedly, Geoff Canyon, Mark
Weider, Dick

Re: CSV again.

2015-10-15 Thread Richard Gaskin

Mike Kerner wrote:

For everyone trying to get back up to speed on CSV, here's the closest
thing to a "Standard", RFC 4180:
https://tools.ietf.org/html/rfc4180


Unfortunately the "format" was around for so long before that RFC, and 
so many big companies have ignored the RFC since, that it doesn't 
reflect the staggeringly rich variety of escapes and quoting conventions 
found in real-world data.


Doesn't hurt to make sure whatever we come up with handles the spec, but 
the spec doesn't handle a lot of what the world calls "CSV".


(And don't even get me started about files delimited by things other 
than commas that identify themselves by the acronym for "Comma-Separated 
Values").


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-15 Thread Richard Gaskin

Mike Kerner wrote:


csv4 does not handle it, and it comes up with a different result from csv2
(which is also wrong).  I sent Richard proposed changes to csv2 which
addresses that issue, but I'll wait while we collectively try to remember
what the latest and greatest csv parser algorithm is before I try to come
up with more ways to break or fix it.


If the fix you provided addresses the issues you found it would be 
helpful to post it here so others can test it.


There was a naming issue in previous versions I'd like to address here:

The function named "CSV2Tab" uses a naming convention popular among some 
Pascal programmers back in the day (but apparently less common today), 
in which conversion functions use a numeral "2" instead of "to" to more 
readily distinguish the text on either side.


Along the way it seems some believed it was a version number embedded in 
the middle of the function name, and during our discussion we started 
seeing things like "CSV3Tab", "CSV4Tab", etc.


The version at my page may well be what was originally named "CSV4Tab", 
but renamed once it became the final version when I posted it to my 
site.  To the best of my knowledge the version posted on my page was the 
most robust available at the time I posted it.


Making things even more confusing, I believe there were at least two 
versions named "CSV4Tab", so I believe it may take some digging to find 
the latest and greatest.  And keep in mind that given the many weird 
things about the many very different implementations of CSV, the latest 
may not be the greatest.


A few years ago I stopped using the older "2" convention for converters, 
so no hand-slapping needed; already did it myself.  And I've never 
embedded a version number in a handler name except during testing, so if 
you ever see code from me that has a numeral in it rest assured it's not 
a version number; if it's meaning is unclear feel free to ask.



So all that said, two notes going forward:

1. When we get a good CSV algo here, the version I post at my page will 
be named "CSVToTab" to avoid such misunderstandings in the future.


2. While we're experimenting here perhaps we could add a version number 
at the end of the function name if it needs to be distinguished for 
comparison purposes (e.g., "CSVtoTab5").


I'll wait to update the page until we have a good one, and when I'll 
also provide a link back to the source post for reference.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-15 Thread Mike Kerner
For everyone trying to get back up to speed on CSV, here's the closest
thing to a "Standard", RFC 4180:
https://tools.ietf.org/html/rfc4180

On Thu, Oct 15, 2015 at 8:34 PM, Peter Haworth  wrote:

> Right I remember that although not what the exact problem was.  In any
> case, csv4Tab has been working fine in my SQLiteAdmin program for at least
> a couple of years now, but I have no idea what flavor of csv files have
> been imported.
>
> Pete
> lcSQL Software 
> Home of lcStackBrowser  and
> SQLiteAdmin 
>
> On Thu, Oct 15, 2015 at 5:24 PM, Alex Tweedly  wrote:
>
> > Richard et al.,
> >
> > sometime after that article, there was a further thread on the use-list.
> > Pete Haworth found a case not properly covered by the version on the
> > article, and I came up with a revised version (cutely called csv4Tab !! -
> > csv3Tab was an interim, deeply buggy attempt)
> >
> > (It's in
> > http://lists.runrev.com/pipermail/use-livecode/2012-May/172275.html )
> >
> > It *looks* from that thread (
> > http://lists.runrev.com/pipermail/use-livecode/2012-May/172191.html ) as
> > though this case had been discussed, and the re-write should properly
> > handle it - but I haven't yet had time to try it. My laptop has been
> > replaced in the meantime, and I can't find my test stack, and recreating
> it
> > and finding the test data is a bit too much for after 1am:-)
> >
> > So I'll try it tomorrow; hopefully csv4Tab() will already work for this
> > case. If it doesn't, we can try again :-)
> >
> > -- Alex.
> >
> > On 16/10/2015 00:34, Richard Gaskin wrote:
> >
> >> Mike Kerner wrote:
> >> > Alex, Richard, etc.
> >> >
> >> > What do we consider the latest version of the csv parser?  I think I
> >> > found a bug in Richard's CSV2Text code, and proposed changes, but he
> >> > wanted the discussion to go down over here, first.  Then I noticed
> >> > that csv4Text is out over here, which makes 2, I guess, a bit long in
> >> > the tooth.
> >>
> >> The version referred to here as "Richard's" is the famous Tweedly algo,
> >> in the middle of this page:
> >> 
> >>
> >> Alex came up with that after a a bunch of us here had a long discussion
> >> about the many variants of CSV running around, and how stupidly complex
> >> they are to parse (see the details in that article).
> >>
> >> Mike wrote me this afternoon letting me know that there's yet another
> >> exception that doesn't seem to be accounted for there:
> >>
> >>"value","","value"
> >>
> >> I had thought we'd covered that in the earlier discussion, but perhaps
> >> not.
> >>
> >> So this seems like a good time to once again bring together the best
> >> minds in our community (are you listening Alex Tweedly, Geoff Canyon,
> Mark
> >> Weider, Dick Kreisel, and others?) to see if we can revisit CSV parsing
> and
> >> come up with a function that can parse it into tabs efficiently, while
> >> taking into account all of the really stupid exceptions that have crept
> >> into the world since that really stupid format was first popularized.
> >>
> >> When we're done I'll update the article, and add even more sarcastic
> >> comments about what a really dumb idea it was to have encouraged people
> to
> >> delimit text with a character so frequently appearing in text.
> >>
> >> --
> >>  Richard Gaskin
> >>  Fourth World Systems
> >>  Software Design and Development for the Desktop, Mobile, and the Web
> >>  
> >>  ambassa...@fourthworld.com http://www.FourthWorld.com
> >>
> >>
> >> ___
> >> use-livecode mailing list
> >> use-livecode@lists.runrev.com
> >> Please visit this url to subscribe, unsubscribe and manage your
> >> subscription preferences:
> >> http://lists.runrev.com/mailman/listinfo/use-livecode
> >>
> >
> >
> > ___
> > use-livecode mailing list
> > use-livecode@lists.runrev.com
> > Please visit this url to subscribe, unsubscribe and manage your
> > subscription preferences:
> > http://lists.runrev.com/mailman/listinfo/use-livecode
> >
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livec

Re: CSV again.

2015-10-15 Thread Mike Kerner
csv4 does not handle it, and it comes up with a different result from csv2
(which is also wrong).  I sent Richard proposed changes to csv2 which
addresses that issue, but I'll wait while we collectively try to remember
what the latest and greatest csv parser algorithm is before I try to come
up with more ways to break or fix it.

On Thu, Oct 15, 2015 at 8:24 PM, Alex Tweedly  wrote:

> Richard et al.,
>
> sometime after that article, there was a further thread on the use-list.
> Pete Haworth found a case not properly covered by the version on the
> article, and I came up with a revised version (cutely called csv4Tab !! -
> csv3Tab was an interim, deeply buggy attempt)
>
> (It's in
> http://lists.runrev.com/pipermail/use-livecode/2012-May/172275.html )
>
> It *looks* from that thread (
> http://lists.runrev.com/pipermail/use-livecode/2012-May/172191.html ) as
> though this case had been discussed, and the re-write should properly
> handle it - but I haven't yet had time to try it. My laptop has been
> replaced in the meantime, and I can't find my test stack, and recreating it
> and finding the test data is a bit too much for after 1am:-)
>
> So I'll try it tomorrow; hopefully csv4Tab() will already work for this
> case. If it doesn't, we can try again :-)
>
> -- Alex.
>
>
> On 16/10/2015 00:34, Richard Gaskin wrote:
>
>> Mike Kerner wrote:
>> > Alex, Richard, etc.
>> >
>> > What do we consider the latest version of the csv parser?  I think I
>> > found a bug in Richard's CSV2Text code, and proposed changes, but he
>> > wanted the discussion to go down over here, first.  Then I noticed
>> > that csv4Text is out over here, which makes 2, I guess, a bit long in
>> > the tooth.
>>
>> The version referred to here as "Richard's" is the famous Tweedly algo,
>> in the middle of this page:
>> 
>>
>> Alex came up with that after a a bunch of us here had a long discussion
>> about the many variants of CSV running around, and how stupidly complex
>> they are to parse (see the details in that article).
>>
>> Mike wrote me this afternoon letting me know that there's yet another
>> exception that doesn't seem to be accounted for there:
>>
>>"value","","value"
>>
>> I had thought we'd covered that in the earlier discussion, but perhaps
>> not.
>>
>> So this seems like a good time to once again bring together the best
>> minds in our community (are you listening Alex Tweedly, Geoff Canyon, Mark
>> Weider, Dick Kreisel, and others?) to see if we can revisit CSV parsing and
>> come up with a function that can parse it into tabs efficiently, while
>> taking into account all of the really stupid exceptions that have crept
>> into the world since that really stupid format was first popularized.
>>
>> When we're done I'll update the article, and add even more sarcastic
>> comments about what a really dumb idea it was to have encouraged people to
>> delimit text with a character so frequently appearing in text.
>>
>> --
>>  Richard Gaskin
>>  Fourth World Systems
>>  Software Design and Development for the Desktop, Mobile, and the Web
>>  
>>  ambassa...@fourthworld.com http://www.FourthWorld.com
>>
>>
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>



-- 
On the first day, God created the heavens and the Earth
On the second day, God created the oceans.
On the third day, God put the animals on hold for a few hours,
   and did a little diving.
And God said, "This is good."
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-15 Thread Peter Haworth
Right I remember that although not what the exact problem was.  In any
case, csv4Tab has been working fine in my SQLiteAdmin program for at least
a couple of years now, but I have no idea what flavor of csv files have
been imported.

Pete
lcSQL Software 
Home of lcStackBrowser  and
SQLiteAdmin 

On Thu, Oct 15, 2015 at 5:24 PM, Alex Tweedly  wrote:

> Richard et al.,
>
> sometime after that article, there was a further thread on the use-list.
> Pete Haworth found a case not properly covered by the version on the
> article, and I came up with a revised version (cutely called csv4Tab !! -
> csv3Tab was an interim, deeply buggy attempt)
>
> (It's in
> http://lists.runrev.com/pipermail/use-livecode/2012-May/172275.html )
>
> It *looks* from that thread (
> http://lists.runrev.com/pipermail/use-livecode/2012-May/172191.html ) as
> though this case had been discussed, and the re-write should properly
> handle it - but I haven't yet had time to try it. My laptop has been
> replaced in the meantime, and I can't find my test stack, and recreating it
> and finding the test data is a bit too much for after 1am:-)
>
> So I'll try it tomorrow; hopefully csv4Tab() will already work for this
> case. If it doesn't, we can try again :-)
>
> -- Alex.
>
> On 16/10/2015 00:34, Richard Gaskin wrote:
>
>> Mike Kerner wrote:
>> > Alex, Richard, etc.
>> >
>> > What do we consider the latest version of the csv parser?  I think I
>> > found a bug in Richard's CSV2Text code, and proposed changes, but he
>> > wanted the discussion to go down over here, first.  Then I noticed
>> > that csv4Text is out over here, which makes 2, I guess, a bit long in
>> > the tooth.
>>
>> The version referred to here as "Richard's" is the famous Tweedly algo,
>> in the middle of this page:
>> 
>>
>> Alex came up with that after a a bunch of us here had a long discussion
>> about the many variants of CSV running around, and how stupidly complex
>> they are to parse (see the details in that article).
>>
>> Mike wrote me this afternoon letting me know that there's yet another
>> exception that doesn't seem to be accounted for there:
>>
>>"value","","value"
>>
>> I had thought we'd covered that in the earlier discussion, but perhaps
>> not.
>>
>> So this seems like a good time to once again bring together the best
>> minds in our community (are you listening Alex Tweedly, Geoff Canyon, Mark
>> Weider, Dick Kreisel, and others?) to see if we can revisit CSV parsing and
>> come up with a function that can parse it into tabs efficiently, while
>> taking into account all of the really stupid exceptions that have crept
>> into the world since that really stupid format was first popularized.
>>
>> When we're done I'll update the article, and add even more sarcastic
>> comments about what a really dumb idea it was to have encouraged people to
>> delimit text with a character so frequently appearing in text.
>>
>> --
>>  Richard Gaskin
>>  Fourth World Systems
>>  Software Design and Development for the Desktop, Mobile, and the Web
>>  
>>  ambassa...@fourthworld.com http://www.FourthWorld.com
>>
>>
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>>
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-15 Thread Alex Tweedly

Richard et al.,

sometime after that article, there was a further thread on the use-list. 
Pete Haworth found a case not properly covered by the version on the 
article, and I came up with a revised version (cutely called csv4Tab !! 
- csv3Tab was an interim, deeply buggy attempt)


(It's in 
http://lists.runrev.com/pipermail/use-livecode/2012-May/172275.html )


It *looks* from that thread ( 
http://lists.runrev.com/pipermail/use-livecode/2012-May/172191.html ) as 
though this case had been discussed, and the re-write should properly 
handle it - but I haven't yet had time to try it. My laptop has been 
replaced in the meantime, and I can't find my test stack, and recreating 
it and finding the test data is a bit too much for after 1am:-)


So I'll try it tomorrow; hopefully csv4Tab() will already work for this 
case. If it doesn't, we can try again :-)


-- Alex.

On 16/10/2015 00:34, Richard Gaskin wrote:

Mike Kerner wrote:
> Alex, Richard, etc.
>
> What do we consider the latest version of the csv parser?  I think I
> found a bug in Richard's CSV2Text code, and proposed changes, but he
> wanted the discussion to go down over here, first.  Then I noticed
> that csv4Text is out over here, which makes 2, I guess, a bit long in
> the tooth.

The version referred to here as "Richard's" is the famous Tweedly 
algo, in the middle of this page:



Alex came up with that after a a bunch of us here had a long 
discussion about the many variants of CSV running around, and how 
stupidly complex they are to parse (see the details in that article).


Mike wrote me this afternoon letting me know that there's yet another 
exception that doesn't seem to be accounted for there:


   "value","","value"

I had thought we'd covered that in the earlier discussion, but perhaps 
not.


So this seems like a good time to once again bring together the best 
minds in our community (are you listening Alex Tweedly, Geoff Canyon, 
Mark Weider, Dick Kreisel, and others?) to see if we can revisit CSV 
parsing and come up with a function that can parse it into tabs 
efficiently, while taking into account all of the really stupid 
exceptions that have crept into the world since that really stupid 
format was first popularized.


When we're done I'll update the article, and add even more sarcastic 
comments about what a really dumb idea it was to have encouraged 
people to delimit text with a character so frequently appearing in text.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.com http://www.FourthWorld.com


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your 
subscription preferences:

http://lists.runrev.com/mailman/listinfo/use-livecode



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-15 Thread Tim Selander


So, tell us what you really think about .CSV, Richard!  :-)

Tim Selander
Tokyo, Japan

On 15/10/16 8:34, Richard Gaskin wrote:

 stupidly complex
really stupid 
stupid format 
really dumb idea



___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-15 Thread Richard Gaskin

Mike Kerner wrote:
> Alex, Richard, etc.
>
> What do we consider the latest version of the csv parser?  I think I
> found a bug in Richard's CSV2Text code, and proposed changes, but he
> wanted the discussion to go down over here, first.  Then I noticed
> that csv4Text is out over here, which makes 2, I guess, a bit long in
> the tooth.

The version referred to here as "Richard's" is the famous Tweedly algo, 
in the middle of this page:



Alex came up with that after a a bunch of us here had a long discussion 
about the many variants of CSV running around, and how stupidly complex 
they are to parse (see the details in that article).


Mike wrote me this afternoon letting me know that there's yet another 
exception that doesn't seem to be accounted for there:


   "value","","value"

I had thought we'd covered that in the earlier discussion, but perhaps not.

So this seems like a good time to once again bring together the best 
minds in our community (are you listening Alex Tweedly, Geoff Canyon, 
Mark Weider, Dick Kreisel, and others?) to see if we can revisit CSV 
parsing and come up with a function that can parse it into tabs 
efficiently, while taking into account all of the really stupid 
exceptions that have crept into the world since that really stupid 
format was first popularized.


When we're done I'll update the article, and add even more sarcastic 
comments about what a really dumb idea it was to have encouraged people 
to delimit text with a character so frequently appearing in text.


--
 Richard Gaskin
 Fourth World Systems
 Software Design and Development for the Desktop, Mobile, and the Web
 
 ambassa...@fourthworld.comhttp://www.FourthWorld.com


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2015-10-15 Thread Mike Kerner
Alex, Richard, etc.

What do we consider the latest version of the csv parser?  I think I found
a bug in Richard's CSV2Text code, and proposed changes, but he wanted the
discussion to go down over here, first.  Then I noticed that csv4Text is
out over here, which makes 2, I guess, a bit long in the tooth.
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-16 Thread Alex Tweedly

On 16/05/2012 00:35, Peter Haworth wrote:

Thanks Alex.

I ran the same data though your new handler and it seems to have worked
fine.

There was a recent discussion on some of these corner case issues on the
sqlite list so I'll go grab their test cases and see what happens.

As far as performance, the new handler took approx 2 1/2 times longer than
the CSV3 version on my 48k rows/17 columns dataset, but that's still only
about 1 second so definitely not a concern as mentioned previously.

I tried it out with this new test data. It has the odd characteristic of 
having partially quoted strings within the cell content; I've adjusted 
the script to allow for that (by removing one logic check). I've also 
added a line to add an extra empty item at the end of a line whenever 
the last item is already empty (i.e. to deal with Livecode's method of 
ignoring blank trailing items).


With these changes, csv4Tab() gets same results as the original 
csv2Tab() did, and they fit with what I think is correct for this 
strange data set :-)


Performance is still better than csv2Tab was, but sadly not as quick as 
(the incorrect) csv3Tab was.



function CSV4Tab pData,pcoldelim
local tNuData -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to avoid line
--   breaks which would be misread as records;
local tNuDelim  -- new character to replace the delimiter
local tStatus, theInsideStringSoFar
--
put numtochar(11) into tReturnPlaceholder -- vertical tab as 
placeholder

put numtochar(29) into tNuDelim
--
if pcoldelim is empty then put comma into pcoldelim
-- Normalize line endings:
replace crlf with cr in pData  -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX

put "outside" into tStatus
set the itemdel to quote
repeat for each item k in pData
-- put tStatus && k & CR after msg
switch tStatus

case "inside"
put k after theInsideStringSoFar
put "passedquote" into tStatus
next repeat

case "passedquote"
-- decide if it was a duplicated escapedQuote or a 
closing quote

if k is empty then   -- it's a duplicated quote
put quote after theInsideStringSoFar
put "inside" into tStatus
next repeat
end if
-- not empty - so we remain inside the cell, though we 
have left the quoted section
-- NB this allows for quoted sub-strings within the 
cell content !!

replace cr with tReturnPlaceholder in theInsideStringSoFar
put theInsideStringSoFar after tNuData

case "outside"
replace pcoldelim with tNuDelim in k
-- and deal with the "empty trailing item" issue in 
Livecode

replace (tNuDelim & CR) with tNuDelim & tNuDelim & CR in k
put k after tNuData
put "inside" into tStatus
put empty into theInsideStringSoFar
next repeat
default
put "defaulted"
break
end switch
end repeat
return tNuData
end CSV4Tab


-- Alex.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-15 Thread Peter Haworth
Thanks Alex.

I ran the same data though your new handler and it seems to have worked
fine.

There was a recent discussion on some of these corner case issues on the
sqlite list so I'll go grab their test cases and see what happens.

As far as performance, the new handler took approx 2 1/2 times longer than
the CSV3 version on my 48k rows/17 columns dataset, but that's still only
about 1 second so definitely not a concern as mentioned previously.

Pete
lcSQL Software 



On Tue, May 15, 2012 at 3:54 PM, Alex Tweedly  wrote:

> On 15/05/2012 18:26, Bob Sneidar wrote:
>
>>   Another good developer lost to the csv parsing chasm of hell. We
>> won't be hearing from Alex again. ;-)
>>
>>  Don't worry Bob, I'm just a tourist here in the chasm, I'm not moving in
> :-)
>
> Pete - please try this out on your data. AFAICT it should handle all the
> cases discussed here, and has the added benefit of being simpler and
> (slightly) easier to understand. Also, it uses no "global replace"s, so it
> would be much easier to modify it to handle very large files by reading
> bufferfulls at a time.
>
> -- Alex.
>
>  function CSV4Tab pData,pcoldelim
>>local tNuData -- contains tabbed copy of data
>>local tReturnPlaceholder -- replaces cr in field data to avoid line
>>--   breaks which would be misread as records;
>>local tStatus, theInsideStringSoFar
>>--
>>put numtochar(11) into tReturnPlaceholder -- vertical tab as
>> placeholder
>>--
>>if pcoldelim is empty then put comma into pcoldelim
>>-- Normalize line endings:
>>replace crlf with cr in pData  -- Win to UNIX
>>replace numtochar(13) with cr in pData -- Mac to UNIX
>>
>>put "outside" into tStatus
>>set the itemdel to quote
>>repeat for each item k in pData
>>switch tStatus
>>
>>case "inside"
>>put k after theInsideStringSoFar
>>put "passedquote" into tStatus
>>next repeat
>>
>>case "passedquote"
>>-- decide if it was a duplicated escapedQuote or a closing
>> quote
>>if k is empty then   -- it's a duplicated quote
>>put quote after theInsideStringSoFar
>>put "inside" into tStatus
>>next repeat
>>end if
>>-- not empty - so we should have a delimiter here
>>if char 1 of k = pcoldelim or char 1 of k = cr then
>>-- as we expect - we have just left the quoted string
>>replace cr with tReturnPlaceholder in
>> theInsideStringSoFar
>>put theInsideStringSoFar after tNuData
>>-- and then deal with this outside item
>>-- by falling through into the 'outsie' case
>>else
>>put "bad logic"
>>break
>>end if
>>
>>case "outside"
>>replace pcoldelim with numtochar(29) in k
>>put k after tNuData
>>put "inside" into tStatus
>>put empty into theInsideStringSoFar
>>next repeat
>>default
>>put "defaulted"
>>break
>>end switch
>>end repeat
>>return tNuData
>> end CSV4Tab
>>
>>
>
> __**_
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/**mailman/listinfo/use-livecode
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-15 Thread Bob Sneidar
hmmm... How are the hotels?

Bob


On May 15, 2012, at 3:54 PM, Alex Tweedly wrote:

> On 15/05/2012 18:26, Bob Sneidar wrote:
>>   Another good developer lost to the csv parsing chasm of hell. We 
>> won't be hearing from Alex again. ;-)
>> 
> Don't worry Bob, I'm just a tourist here in the chasm, I'm not moving in :-)


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-15 Thread Alex Tweedly

On 15/05/2012 18:26, Bob Sneidar wrote:

  Another good developer lost to the csv parsing chasm of hell. We won't 
be hearing from Alex again. ;-)


Don't worry Bob, I'm just a tourist here in the chasm, I'm not moving in :-)

Pete - please try this out on your data. AFAICT it should handle all the 
cases discussed here, and has the added benefit of being simpler and 
(slightly) easier to understand. Also, it uses no "global replace"s, so 
it would be much easier to modify it to handle very large files by 
reading bufferfulls at a time.


-- Alex.


function CSV4Tab pData,pcoldelim
local tNuData -- contains tabbed copy of data
local tReturnPlaceholder -- replaces cr in field data to avoid line
--   breaks which would be misread as records;
local tStatus, theInsideStringSoFar
--
put numtochar(11) into tReturnPlaceholder -- vertical tab as 
placeholder

--
if pcoldelim is empty then put comma into pcoldelim
-- Normalize line endings:
replace crlf with cr in pData  -- Win to UNIX
replace numtochar(13) with cr in pData -- Mac to UNIX

put "outside" into tStatus
set the itemdel to quote
repeat for each item k in pData
switch tStatus

case "inside"
put k after theInsideStringSoFar
put "passedquote" into tStatus
next repeat

case "passedquote"
-- decide if it was a duplicated escapedQuote or a 
closing quote

if k is empty then   -- it's a duplicated quote
put quote after theInsideStringSoFar
put "inside" into tStatus
next repeat
end if
-- not empty - so we should have a delimiter here
if char 1 of k = pcoldelim or char 1 of k = cr then
-- as we expect - we have just left the quoted string
replace cr with tReturnPlaceholder in 
theInsideStringSoFar

put theInsideStringSoFar after tNuData
-- and then deal with this outside item
-- by falling through into the 'outsie' case
else
put "bad logic"
break
end if

case "outside"
replace pcoldelim with numtochar(29) in k
put k after tNuData
put "inside" into tStatus
put empty into theInsideStringSoFar
next repeat
default
put "defaulted"
break
end switch
end repeat
return tNuData
end CSV4Tab




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-15 Thread Peter Haworth
Thanks for everyone's kind thoughts in this time of turmoil.  I wish I had
a choice but I don't so I'll just keep on bearing the csv cross of shame.

Pete
lcSQL Software 



On Tue, May 15, 2012 at 12:41 PM, Mark Wieder wrote:

> Bob-
>
> Tuesday, May 15, 2012, 10:26:41 AM, you wrote:
>
> >  Another good developer lost to the csv parsing chasm of
> > hell. We won't be hearing from Alex again. ;-)
>
> Alas, I fear Pete is following down that lonesome road. It's too bad,
> they were such nice members of the community - I'll quite miss them.
>
> --
> -Mark Wieder
>  mwie...@ahsoftware.net
>
>
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-15 Thread Mark Wieder
Bob-

Tuesday, May 15, 2012, 10:26:41 AM, you wrote:

>  Another good developer lost to the csv parsing chasm of
> hell. We won't be hearing from Alex again. ;-)

Alas, I fear Pete is following down that lonesome road. It's too bad,
they were such nice members of the community - I'll quite miss them.

-- 
-Mark Wieder
 mwie...@ahsoftware.net


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-15 Thread Bob Sneidar
That is a perfect case example of why CSV parsers can never be perfect. 
Unescaped delimiters in field contents should never have been allowed when they 
came up with the "standard" for CSV files. 

Bob


On May 15, 2012, at 10:51 AM, Peter Haworth wrote:

> I'll probably have to implement some sort of mechanism for reading in a
> given number of lines.  BUT..  a carriage return in the middle of a quoted
> cell will be taken by the read for x lines command to be the end of a line
> so I could end up with a partial line in my read buffer.


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-15 Thread Peter Haworth
Thanks Alex, all good points.

I'm still trying to figure out why the program that created the csv file
used this problematic string since it only happened for one cell - all
other empty cells simply had two consecutive commas. Nevertheless, the
other cases you cited are definitely valid so I guess the function will
need to handle them.

As for performance, it's obviously good to do the parsing as efficiently as
possible but my use of the function is to use its output to issue INSERT
statements against an sqlite database.  So we're talking milliseconds for
the parsing vs seconds (or maybe even minutes depending on how much data is
involved) for the INSERT command.  I'd be fine with the parsing taking
longer to handle more corner cases.

Not really anything to do with the parsing but I'm also facing another
issue in this context and that's csv files that are too large to read
completely into memory in one go.  I have one guy who wants to import a 44
gigabyte file!

I'll probably have to implement some sort of mechanism for reading in a
given number of lines.  BUT..  a carriage return in the middle of a quoted
cell will be taken by the read for x lines command to be the end of a line
so I could end up with a partial line in my read buffer.

I may end up just declaring a maximum file size in the documentation and
leaving it up to the user to break up the file into multiple smaller files.

Thanks for your help on this Alex, much appreciated.


Pete
lcSQL Software 



On Tue, May 15, 2012 at 10:02 AM, Alex Tweedly  wrote:

> Unfortunately, that's not enough to fix it, Peter.
>
> The problem case you have identified is where the CSV exporter has decided
> to quote even empty cells. This wasn't covered in the original samples, or
> in any cases I've had to deal with.
>
> Your workaround uses the sequence  to
> attempt to identify this case - but that only identifies it when it occurs
> in the "interior" cells within a record (line). You'd need to extend it to
> also cover the first cell in the line -
>  i.e. 
> and the last cell on the line
>  i.e. 
> and even the *only* cell on the line
>  i.e. 
>
> and then subsequently un-replace each of those appropriately.
>
> BUT - there's an even worse problem - any of these sequences *can* occur
> within a quoted string - e.g.   abc,"this cell contains an escaped quote
> ,"", within it", another cell
>
> Basically - the original idea ONLY works if the only time two quotes
> appear as consecutive characters is as an escaped quote within a quoted
> cell.(hmmm - that means there is another nasty corner case - where the
> escaped quote appears as the first character within a quoted cell, e.g.
> abc,"""quoted string""",def !!)
>
> Fixing this is going to require checking for the doubled quote and acting
> differently within the loop that alternates between 'inside' and 'outside'
> quoted cells; and of course that alternation depends on the discovery of
> quotes (and hence needs to look-ahead at subsequent characters to detect
> the doubled cases.
>
> I'll have a go at re-writing it using that method - but it is basically a
> re-write from scratch, so it may take an hour or two to make sure I've got
> all the cases covered (and I don't yet have any prediction about the
> performance).
>
> If you could send me your test data off-list that would be helpful.
>
> Thanks
> -- Alex.
>
>
> On 15/05/2012 02:00, Peter Haworth wrote:
>
>> Hi Alex,
>> Just toi clat=rify, this was two double quotes with a comma right before
>> and right after them, not an escaped double quote in the middle of string.
>>
>> I've made a fix to this which works, subject to your approval
>>
>> I changed the line:
>>
>> *replace* quote"e with tEscapedQuotePlaceholder in pData
>>
>>
>> to these three lines:
>>
>>
>> *replace* comma&  quote&  quote&  comma with numToChar(31) in pData
>>
>> *replace* quote"e with tEscapedQuotePlaceholder in pData
>>
>> *replace* numToChar(31) with comma&  quote&  quote&  comma in pData
>>
>>
>>
>> That seems to have fixed it.
>>
>>
>> Pete
>> lcSQL Software
>>
>>
>>
>>
>> On Mon, May 14, 2012 at 2:50 PM, Peter Haworth  wrote:
>>
>>  However, I have found another corner case and that is two consecutive
>>> double quote characters with no intervening characters.  I'm still
>>> checking
>>> into it for sure, but it looks like what happens with that after running
>>> it
>>> through your function is a single quote character.  Any thoughts on that?
>>>
>> __**_
>>
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>> http://lists.runrev.com/**mailman/listinfo/use-livecode
>>
>>
>
> __**_
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscrib

Re: CSV again.

2012-05-15 Thread Bob Sneidar
 Another good developer lost to the csv parsing chasm of hell. We won't 
be hearing from Alex again. ;-)

Bob


On May 15, 2012, at 10:02 AM, Alex Tweedly wrote:

> Unfortunately, that's not enough to fix it, Peter.
> 
> The problem case you have identified is where the CSV exporter has decided to 
> quote even empty cells. This wasn't covered in the original samples, or in 
> any cases I've had to deal with.
> 
> Your workaround uses the sequence  to attempt 
> to identify this case - but that only identifies it when it occurs in the 
> "interior" cells within a record (line). You'd need to extend it to also 
> cover the first cell in the line -
>  i.e. 
> and the last cell on the line
>  i.e. 
> and even the *only* cell on the line
>  i.e. 
> 
> and then subsequently un-replace each of those appropriately.
> 
> BUT - there's an even worse problem - any of these sequences *can* occur 
> within a quoted string - e.g.   abc,"this cell contains an escaped quote ,"", 
> within it", another cell
> 
> Basically - the original idea ONLY works if the only time two quotes appear 
> as consecutive characters is as an escaped quote within a quoted cell.
> (hmmm - that means there is another nasty corner case - where the escaped 
> quote appears as the first character within a quoted cell, e.g. abc,"""quoted 
> string""",def !!)
> 
> Fixing this is going to require checking for the doubled quote and acting 
> differently within the loop that alternates between 'inside' and 'outside' 
> quoted cells; and of course that alternation depends on the discovery of 
> quotes (and hence needs to look-ahead at subsequent characters to detect the 
> doubled cases.
> 
> I'll have a go at re-writing it using that method - but it is basically a 
> re-write from scratch, so it may take an hour or two to make sure I've got 
> all the cases covered (and I don't yet have any prediction about the 
> performance).
> 
> If you could send me your test data off-list that would be helpful.
> 
> Thanks
> -- Alex.
> 
> On 15/05/2012 02:00, Peter Haworth wrote:
>> Hi Alex,
>> Just toi clat=rify, this was two double quotes with a comma right before
>> and right after them, not an escaped double quote in the middle of string.
>> 
>> I've made a fix to this which works, subject to your approval
>> 
>> I changed the line:
>> 
>> *replace* quote"e with tEscapedQuotePlaceholder in pData
>> 
>> 
>> to these three lines:
>> 
>> 
>> *replace* comma&  quote&  quote&  comma with numToChar(31) in pData
>> 
>> *replace* quote"e with tEscapedQuotePlaceholder in pData
>> 
>> *replace* numToChar(31) with comma&  quote&  quote&  comma in pData
>> 
>> 
>> That seems to have fixed it.
>> 
>> 
>> Pete
>> lcSQL Software
>> 
>> 
>> 
>> On Mon, May 14, 2012 at 2:50 PM, Peter Haworth  wrote:
>> 
>>> However, I have found another corner case and that is two consecutive
>>> double quote characters with no intervening characters.  I'm still checking
>>> into it for sure, but it looks like what happens with that after running it
>>> through your function is a single quote character.  Any thoughts on that?
>> ___
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your subscription 
>> preferences:
>> http://lists.runrev.com/mailman/listinfo/use-livecode
>> 
> 
> 
> ___
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your subscription 
> preferences:
> http://lists.runrev.com/mailman/listinfo/use-livecode


___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-15 Thread Alex Tweedly

Unfortunately, that's not enough to fix it, Peter.

The problem case you have identified is where the CSV exporter has 
decided to quote even empty cells. This wasn't covered in the original 
samples, or in any cases I've had to deal with.


Your workaround uses the sequence  to 
attempt to identify this case - but that only identifies it when it 
occurs in the "interior" cells within a record (line). You'd need to 
extend it to also cover the first cell in the line -

  i.e. 
and the last cell on the line
  i.e. 
and even the *only* cell on the line
  i.e. 

and then subsequently un-replace each of those appropriately.

BUT - there's an even worse problem - any of these sequences *can* occur 
within a quoted string - e.g.   abc,"this cell contains an escaped quote 
,"", within it", another cell


Basically - the original idea ONLY works if the only time two quotes 
appear as consecutive characters is as an escaped quote within a quoted 
cell.(hmmm - that means there is another nasty corner case - where 
the escaped quote appears as the first character within a quoted cell, 
e.g. abc,"""quoted string""",def !!)


Fixing this is going to require checking for the doubled quote and 
acting differently within the loop that alternates between 'inside' and 
'outside' quoted cells; and of course that alternation depends on the 
discovery of quotes (and hence needs to look-ahead at subsequent 
characters to detect the doubled cases.


I'll have a go at re-writing it using that method - but it is basically 
a re-write from scratch, so it may take an hour or two to make sure I've 
got all the cases covered (and I don't yet have any prediction about the 
performance).


If you could send me your test data off-list that would be helpful.

Thanks
-- Alex.

On 15/05/2012 02:00, Peter Haworth wrote:

Hi Alex,
Just toi clat=rify, this was two double quotes with a comma right before
and right after them, not an escaped double quote in the middle of string.

I've made a fix to this which works, subject to your approval

I changed the line:

*replace* quote"e with tEscapedQuotePlaceholder in pData


to these three lines:


*replace* comma&  quote&  quote&  comma with numToChar(31) in pData

*replace* quote"e with tEscapedQuotePlaceholder in pData

*replace* numToChar(31) with comma&  quote&  quote&  comma in pData


That seems to have fixed it.


Pete
lcSQL Software



On Mon, May 14, 2012 at 2:50 PM, Peter Haworth  wrote:


However, I have found another corner case and that is two consecutive
double quote characters with no intervening characters.  I'm still checking
into it for sure, but it looks like what happens with that after running it
through your function is a single quote character.  Any thoughts on that?

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode




___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-14 Thread Peter Haworth
Hi Alex,
Just toi clat=rify, this was two double quotes with a comma right before
and right after them, not an escaped double quote in the middle of string.

I've made a fix to this which works, subject to your approval

I changed the line:

*replace* quote"e with tEscapedQuotePlaceholder in pData


to these three lines:


*replace* comma & quote & quote & comma with numToChar(31) in pData

*replace* quote"e with tEscapedQuotePlaceholder in pData

*replace* numToChar(31) with comma & quote & quote & comma in pData


That seems to have fixed it.


Pete
lcSQL Software 



On Mon, May 14, 2012 at 2:50 PM, Peter Haworth  wrote:

> However, I have found another corner case and that is two consecutive
> double quote characters with no intervening characters.  I'm still checking
> into it for sure, but it looks like what happens with that after running it
> through your function is a single quote character.  Any thoughts on that?
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-14 Thread Alex Tweedly
Yeah, the "training empty item" problem has been much discussed, and 
there are good reasons for keeping it as it is (even apart from the need 
to not break existing code).


In similar circumstances, I've done

   replace (comma & CR) with (comma & space & CR) in tVariable

but in your case, even a space may not be exactly the same as totally empty.

Could you replace the empty trailing item with a quoted item ?
i.e.
   replace (comma & CR) with (comma & quote & quote & CR) in tVariable
without any unpleasant side-effects ?

-- Alex.

On 14/05/2012 21:00, Peter Haworth wrote:

I've just been checking out Alex's new csv parser and it is indeed much
faster than the original, closer to 50% than 40% in my test case.

However, I've also run into a Livecode issue while doing all this.  This
has come up before in the context of what LC thinks is a line, there's a
similar issue/confusion/whatever with items.

Let's say you have a string "1,2,3,4,5,6" - LC thinks there are 6 items in
it, no problem

Now change the string to "1,2,3,4,5,6," (note the trailing comma) - LC
still thinks there are 6 items in that string.

So to LC, "1,2,3,4,5,6" and 1,2,3,4,5,6," are equivalent in terms of the
number of items in them.  In the context of parsing csv files, they
definitely are not.

Pete
lcSQL Software



On Mon, May 7, 2012 at 4:30 PM, Alex Tweedly  wrote:


Some years ago, this list discussed the difficulties of parsing
comma-separated-value file format; Richard Gaskin has a great article about
it at 
http://www.fourthworld.com/**embassy/articles/csv-must-die.**html

Following that discussion, I came up with some code to parse CSV in
Livecode which was significantly faster than the straightforwards methods
(quoted in the above article). At the time, I put that speed gain down to
two factors

1. a way of looking at the problem "sideways" that enables a different
approach
2. a 'clever' use of split + array access

Recently the topic came up again, and I looked at the code again; I now
realize that in fact the speed gain came entirely from the first of those
two factors, and using split + arrays was not helpful. Livecode's chunk
handling is (in this case) faster than using arrays (my only excuse is that
I was new to Livecode, and so I was using techniques I was familiar with
from other languages). So I revised the code to use chunk handling rather
than split+arrays, and the resulting code runs about 40% faster, with the
added benefit of being slightly easier to read and understand.  The only
slightly mind-bending feature of the new code is the use of

set the lineDelimiter to quote
repeat for each line k in pData 

I find it hard to think about "lines" that aren't actually lines :-)

So - for anyone who needs or wants more speed, here's the code

  function CSV3Tab pData,pcoldelim

  local tNuData -- contains tabbed copy of data
  local tReturnPlaceholder -- replaces cr in field data to avoid line
  --   breaks which would be misread as records;
  --   replaced later during dislay
  local tEscapedQuotePlaceholder -- used for keeping track of quotes
  --   in data
  local tInQuotedText -- flag set while reading data between quotes
  local tInsideQuoted, k
  --
  put numtochar(11) into tReturnPlaceholder -- vertical tab as
  --   placeholder
  put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
  --   distinction between quotes in data and those
  --   used in delimiters
  --
  if pcoldelim is empty then put comma into pcoldelim
  -- Normalize line endings:
  replace crlf with cr in pData  -- Win to UNIX
  replace numtochar(13) with cr in pData -- Mac to UNIX
  --
  -- Put placeholder in escaped quote (non-delimiter) chars:
  replace ("\""e) with tEscapedQuotePlaceholder in pData
  replace quote"e with tEscapedQuotePlaceholder in pData
  --
  put space before pData   -- to avoid ambiguity of starting context
  put False into tInsideQuoted
  set the linedel to quote
  repeat for each line k in pData
if (tInsideQuoted) then
  replace cr with tReturnPlaceholder in k
  put k after tNuData
  put False into tInsideQuoted
else
  replace pcoldelim with numtochar(29) in k
  put k after tNuData
  put true into tInsideQuoted
end if
  end repeat
  --
  delete char 1 of tNuData -- remove the leading space
  replace tEscapedQuotePlaceholder with quote in tNuData
  return tNuData
end CSV3Tab



-- Alex.

__**_
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your
subscription preferences:
http://lists.runrev.com/**mailman/listinfo/use-livecode


___
use-livecode 

Re: CSV again.

2012-05-14 Thread Bob Sneidar
This has been discussed before. The last delimiter is not considered when 
parsing lines, items and words. The contents of the "object" are the only thing 
that is considered, so if nothing comes after the last delimiter, LC says, 
"Nothing to see here. Moving along..."

That being said, "item1,,item2" results in 3 items. Go figure. So the rule is 
empty words,items,lines are counted unless they are the last word,item,line. 
See? Simple. ;-) 

Bob


On May 14, 2012, at 1:00 PM, Peter Haworth wrote:

> I've just been checking out Alex's new csv parser and it is indeed much
> faster than the original, closer to 50% than 40% in my test case.
> 
> However, I've also run into a Livecode issue while doing all this.  This
> has come up before in the context of what LC thinks is a line, there's a
> similar issue/confusion/whatever with items.
> 
> Let's say you have a string "1,2,3,4,5,6" - LC thinks there are 6 items in
> it, no problem
> 
> Now change the string to "1,2,3,4,5,6," (note the trailing comma) - LC
> still thinks there are 6 items in that string.
> 
> So to LC, "1,2,3,4,5,6" and 1,2,3,4,5,6," are equivalent in terms of the
> number of items in them.  In the context of parsing csv files, they
> definitely are not.
> 
> Pete
> lcSQL Software 
> 
> 
> 
> On Mon, May 7, 2012 at 4:30 PM, Alex Tweedly  wrote:
> 
>> Some years ago, this list discussed the difficulties of parsing
>> comma-separated-value file format; Richard Gaskin has a great article about
>> it at 
>> http://www.fourthworld.com/**embassy/articles/csv-must-die.**html
>> 
>> Following that discussion, I came up with some code to parse CSV in
>> Livecode which was significantly faster than the straightforwards methods
>> (quoted in the above article). At the time, I put that speed gain down to
>> two factors
>> 
>> 1. a way of looking at the problem "sideways" that enables a different
>> approach
>> 2. a 'clever' use of split + array access
>> 
>> Recently the topic came up again, and I looked at the code again; I now
>> realize that in fact the speed gain came entirely from the first of those
>> two factors, and using split + arrays was not helpful. Livecode's chunk
>> handling is (in this case) faster than using arrays (my only excuse is that
>> I was new to Livecode, and so I was using techniques I was familiar with
>> from other languages). So I revised the code to use chunk handling rather
>> than split+arrays, and the resulting code runs about 40% faster, with the
>> added benefit of being slightly easier to read and understand.  The only
>> slightly mind-bending feature of the new code is the use of
>> 
>>   set the lineDelimiter to quote
>>   repeat for each line k in pData 
>> 
>> I find it hard to think about "lines" that aren't actually lines :-)
>> 
>> So - for anyone who needs or wants more speed, here's the code
>> 
>> function CSV3Tab pData,pcoldelim
>>> local tNuData -- contains tabbed copy of data
>>> local tReturnPlaceholder -- replaces cr in field data to avoid line
>>> --   breaks which would be misread as records;
>>> --   replaced later during dislay
>>> local tEscapedQuotePlaceholder -- used for keeping track of quotes
>>> --   in data
>>> local tInQuotedText -- flag set while reading data between quotes
>>> local tInsideQuoted, k
>>> --
>>> put numtochar(11) into tReturnPlaceholder -- vertical tab as
>>> --   placeholder
>>> put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
>>> --   distinction between quotes in data and those
>>> --   used in delimiters
>>> --
>>> if pcoldelim is empty then put comma into pcoldelim
>>> -- Normalize line endings:
>>> replace crlf with cr in pData  -- Win to UNIX
>>> replace numtochar(13) with cr in pData -- Mac to UNIX
>>> --
>>> -- Put placeholder in escaped quote (non-delimiter) chars:
>>> replace ("\""e) with tEscapedQuotePlaceholder in pData
>>> replace quote"e with tEscapedQuotePlaceholder in pData
>>> --
>>> put space before pData   -- to avoid ambiguity of starting context
>>> put False into tInsideQuoted
>>> set the linedel to quote
>>> repeat for each line k in pData
>>>   if (tInsideQuoted) then
>>> replace cr with tReturnPlaceholder in k
>>> put k after tNuData
>>> put False into tInsideQuoted
>>>   else
>>> replace pcoldelim with numtochar(29) in k
>>> put k after tNuData
>>> put true into tInsideQuoted
>>>   end if
>>> end repeat
>>> --
>>> delete char 1 of tNuData -- remove the leading space
>>> replace tEscapedQuotePlaceholder with quote in tNuData
>>> return tNuData
>>> end CSV3Tab
>>> 
>>> 
>> -- Alex.
>> 
>> __**_
>> use-livecode mailing list
>> use-livecode@lists.runrev.com
>> Please visit this url to subscribe, unsubscribe and manage your
>> subscription preferences:
>

Re: CSV again.

2012-05-14 Thread Peter Haworth
I've just been checking out Alex's new csv parser and it is indeed much
faster than the original, closer to 50% than 40% in my test case.

However, I've also run into a Livecode issue while doing all this.  This
has come up before in the context of what LC thinks is a line, there's a
similar issue/confusion/whatever with items.

Let's say you have a string "1,2,3,4,5,6" - LC thinks there are 6 items in
it, no problem

Now change the string to "1,2,3,4,5,6," (note the trailing comma) - LC
still thinks there are 6 items in that string.

So to LC, "1,2,3,4,5,6" and 1,2,3,4,5,6," are equivalent in terms of the
number of items in them.  In the context of parsing csv files, they
definitely are not.

Pete
lcSQL Software 



On Mon, May 7, 2012 at 4:30 PM, Alex Tweedly  wrote:

> Some years ago, this list discussed the difficulties of parsing
> comma-separated-value file format; Richard Gaskin has a great article about
> it at 
> http://www.fourthworld.com/**embassy/articles/csv-must-die.**html
>
> Following that discussion, I came up with some code to parse CSV in
> Livecode which was significantly faster than the straightforwards methods
> (quoted in the above article). At the time, I put that speed gain down to
> two factors
>
> 1. a way of looking at the problem "sideways" that enables a different
> approach
> 2. a 'clever' use of split + array access
>
> Recently the topic came up again, and I looked at the code again; I now
> realize that in fact the speed gain came entirely from the first of those
> two factors, and using split + arrays was not helpful. Livecode's chunk
> handling is (in this case) faster than using arrays (my only excuse is that
> I was new to Livecode, and so I was using techniques I was familiar with
> from other languages). So I revised the code to use chunk handling rather
> than split+arrays, and the resulting code runs about 40% faster, with the
> added benefit of being slightly easier to read and understand.  The only
> slightly mind-bending feature of the new code is the use of
>
>set the lineDelimiter to quote
>repeat for each line k in pData 
>
> I find it hard to think about "lines" that aren't actually lines :-)
>
> So - for anyone who needs or wants more speed, here's the code
>
>  function CSV3Tab pData,pcoldelim
>>  local tNuData -- contains tabbed copy of data
>>  local tReturnPlaceholder -- replaces cr in field data to avoid line
>>  --   breaks which would be misread as records;
>>  --   replaced later during dislay
>>  local tEscapedQuotePlaceholder -- used for keeping track of quotes
>>  --   in data
>>  local tInQuotedText -- flag set while reading data between quotes
>>  local tInsideQuoted, k
>>  --
>>  put numtochar(11) into tReturnPlaceholder -- vertical tab as
>>  --   placeholder
>>  put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
>>  --   distinction between quotes in data and those
>>  --   used in delimiters
>>  --
>>  if pcoldelim is empty then put comma into pcoldelim
>>  -- Normalize line endings:
>>  replace crlf with cr in pData  -- Win to UNIX
>>  replace numtochar(13) with cr in pData -- Mac to UNIX
>>  --
>>  -- Put placeholder in escaped quote (non-delimiter) chars:
>>  replace ("\""e) with tEscapedQuotePlaceholder in pData
>>  replace quote"e with tEscapedQuotePlaceholder in pData
>>  --
>>  put space before pData   -- to avoid ambiguity of starting context
>>  put False into tInsideQuoted
>>  set the linedel to quote
>>  repeat for each line k in pData
>>if (tInsideQuoted) then
>>  replace cr with tReturnPlaceholder in k
>>  put k after tNuData
>>  put False into tInsideQuoted
>>else
>>  replace pcoldelim with numtochar(29) in k
>>  put k after tNuData
>>  put true into tInsideQuoted
>>end if
>>  end repeat
>>  --
>>  delete char 1 of tNuData -- remove the leading space
>>  replace tEscapedQuotePlaceholder with quote in tNuData
>>  return tNuData
>> end CSV3Tab
>>
>>
> -- Alex.
>
> __**_
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/**mailman/listinfo/use-livecode
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


Re: CSV again.

2012-05-07 Thread Peter Haworth
Thanks for this Alex!

For list members, I am indebted to Alex for his original csv parsing code
which I used, with his permission, in my SQLiteAdmin application.

I will check out this code and see how it compares to the code currently
embedded in SQLiteAdmin.

Pete
lcSQL Software 



On Mon, May 7, 2012 at 4:30 PM, Alex Tweedly  wrote:

> Some years ago, this list discussed the difficulties of parsing
> comma-separated-value file format; Richard Gaskin has a great article about
> it at 
> http://www.fourthworld.com/**embassy/articles/csv-must-die.**html
>
> Following that discussion, I came up with some code to parse CSV in
> Livecode which was significantly faster than the straightforwards methods
> (quoted in the above article). At the time, I put that speed gain down to
> two factors
>
> 1. a way of looking at the problem "sideways" that enables a different
> approach
> 2. a 'clever' use of split + array access
>
> Recently the topic came up again, and I looked at the code again; I now
> realize that in fact the speed gain came entirely from the first of those
> two factors, and using split + arrays was not helpful. Livecode's chunk
> handling is (in this case) faster than using arrays (my only excuse is that
> I was new to Livecode, and so I was using techniques I was familiar with
> from other languages). So I revised the code to use chunk handling rather
> than split+arrays, and the resulting code runs about 40% faster, with the
> added benefit of being slightly easier to read and understand.  The only
> slightly mind-bending feature of the new code is the use of
>
>set the lineDelimiter to quote
>repeat for each line k in pData 
>
> I find it hard to think about "lines" that aren't actually lines :-)
>
> So - for anyone who needs or wants more speed, here's the code
>
>  function CSV3Tab pData,pcoldelim
>>  local tNuData -- contains tabbed copy of data
>>  local tReturnPlaceholder -- replaces cr in field data to avoid line
>>  --   breaks which would be misread as records;
>>  --   replaced later during dislay
>>  local tEscapedQuotePlaceholder -- used for keeping track of quotes
>>  --   in data
>>  local tInQuotedText -- flag set while reading data between quotes
>>  local tInsideQuoted, k
>>  --
>>  put numtochar(11) into tReturnPlaceholder -- vertical tab as
>>  --   placeholder
>>  put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
>>  --   distinction between quotes in data and those
>>  --   used in delimiters
>>  --
>>  if pcoldelim is empty then put comma into pcoldelim
>>  -- Normalize line endings:
>>  replace crlf with cr in pData  -- Win to UNIX
>>  replace numtochar(13) with cr in pData -- Mac to UNIX
>>  --
>>  -- Put placeholder in escaped quote (non-delimiter) chars:
>>  replace ("\""e) with tEscapedQuotePlaceholder in pData
>>  replace quote"e with tEscapedQuotePlaceholder in pData
>>  --
>>  put space before pData   -- to avoid ambiguity of starting context
>>  put False into tInsideQuoted
>>  set the linedel to quote
>>  repeat for each line k in pData
>>if (tInsideQuoted) then
>>  replace cr with tReturnPlaceholder in k
>>  put k after tNuData
>>  put False into tInsideQuoted
>>else
>>  replace pcoldelim with numtochar(29) in k
>>  put k after tNuData
>>  put true into tInsideQuoted
>>end if
>>  end repeat
>>  --
>>  delete char 1 of tNuData -- remove the leading space
>>  replace tEscapedQuotePlaceholder with quote in tNuData
>>  return tNuData
>> end CSV3Tab
>>
>>
> -- Alex.
>
> __**_
> use-livecode mailing list
> use-livecode@lists.runrev.com
> Please visit this url to subscribe, unsubscribe and manage your
> subscription preferences:
> http://lists.runrev.com/**mailman/listinfo/use-livecode
>
___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode


CSV again.

2012-05-07 Thread Alex Tweedly
Some years ago, this list discussed the difficulties of parsing 
comma-separated-value file format; Richard Gaskin has a great article 
about it at http://www.fourthworld.com/embassy/articles/csv-must-die.html


Following that discussion, I came up with some code to parse CSV in 
Livecode which was significantly faster than the straightforwards 
methods (quoted in the above article). At the time, I put that speed 
gain down to two factors


1. a way of looking at the problem "sideways" that enables a different 
approach

2. a 'clever' use of split + array access

Recently the topic came up again, and I looked at the code again; I now 
realize that in fact the speed gain came entirely from the first of 
those two factors, and using split + arrays was not helpful. Livecode's 
chunk handling is (in this case) faster than using arrays (my only 
excuse is that I was new to Livecode, and so I was using techniques I 
was familiar with from other languages). So I revised the code to use 
chunk handling rather than split+arrays, and the resulting code runs 
about 40% faster, with the added benefit of being slightly easier to 
read and understand.  The only slightly mind-bending feature of the new 
code is the use of


set the lineDelimiter to quote
repeat for each line k in pData 

I find it hard to think about "lines" that aren't actually lines :-)

So - for anyone who needs or wants more speed, here's the code


function CSV3Tab pData,pcoldelim
  local tNuData -- contains tabbed copy of data
  local tReturnPlaceholder -- replaces cr in field data to avoid line
  --   breaks which would be misread as records;
  --   replaced later during dislay
  local tEscapedQuotePlaceholder -- used for keeping track of quotes
  --   in data
  local tInQuotedText -- flag set while reading data between quotes
  local tInsideQuoted, k
  --
  put numtochar(11) into tReturnPlaceholder -- vertical tab as
  --   placeholder
  put numtochar(2)  into tEscapedQuotePlaceholder -- used to simplify
  --   distinction between quotes in data and those
  --   used in delimiters
  --
  if pcoldelim is empty then put comma into pcoldelim
  -- Normalize line endings:
  replace crlf with cr in pData  -- Win to UNIX
  replace numtochar(13) with cr in pData -- Mac to UNIX
  --
  -- Put placeholder in escaped quote (non-delimiter) chars:
  replace ("\""e) with tEscapedQuotePlaceholder in pData
  replace quote"e with tEscapedQuotePlaceholder in pData
  --
  put space before pData   -- to avoid ambiguity of starting context
  put False into tInsideQuoted
  set the linedel to quote
  repeat for each line k in pData
if (tInsideQuoted) then
  replace cr with tReturnPlaceholder in k
  put k after tNuData
  put False into tInsideQuoted
else
  replace pcoldelim with numtochar(29) in k
  put k after tNuData
  put true into tInsideQuoted
end if
  end repeat
  --
  delete char 1 of tNuData -- remove the leading space
  replace tEscapedQuotePlaceholder with quote in tNuData
  return tNuData
end CSV3Tab



-- Alex.

___
use-livecode mailing list
use-livecode@lists.runrev.com
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-livecode