Re: [Unicon-group] Question regarding any parser for CSV format files

Steve Wampler Tue, 14 Oct 2008 10:30:34 -0700


I had some fun playing with this (parsing CSV files) [I almost
put it out as a programming challenge but decided most people
wouldn't want to bother...].  The attached code is a generalization
of what Bruce needs, and draws in part from some code he wrote
to handle his case.  Among the generalizations:


- While the field separator defaults to comma, any set of
  characters that doesn't include double quotes or newlines
  can be used to separate fields.

- White space (normally blanks and tabs) surrounding fields
  is removed.  However, if a blank and/or tab is used as
  separators, then they are treated as separators and not
  whitespace.

- While normally every separator denotes a field separation,
  you can specify that spans of one or more separators
  denote a single field separation.

- Double quotes can (must) be used to enclose fields that
  contain separator characters or double quotes - and
  those embedded double quotes must be doubled themselves.
  (So, to embed "five" (6 characters long) as a field, the
  field must appear as """five""" (8 characters enclosed in
  double quotes).

Also, to make it more fun (for me) to work on, I've used some
features from the UniLib package library:

- a zapPrefix(s, prefix) procedure
- "anonymous functions"

Neither is needed - alternative approaches work just as well
(and you can write "anonymous functions" without using UniLib
support).


--
Steve Wampler -- [EMAIL PROTECTED]
The gods that smiled on your birth are now laughing out loud.

# <p>
# Process a <b>properly formatted</b> CSV file.
#   You can change the separator from the default <tt>,</tt>
#   to any set of characters using the <tt>--separator=CSET</tt>
#   argument.
# </p>
# <p>
#   Each separator denotes a field separation (e.g. "a,,,b,"
#   contains five fields) unless you give the <tt>--span</tt> argument.
#   If you give the <tt>--span</tt> argument, then the fields
#   as considered to be separated by <i>one or more</i> spans
#   of the separator instead of exactly one and the above input
#   contains three fields instead of five (in both cases, the last
#   field is the empty string).
# </p>
# <p>
#  Assumes that:<br><br>
#  a. Fields containing the separator are enclosed in double
#     quotes.<br>
#  b. Fields containing double quotes have those quotes
#     duplicated with the entire field inclosed in double
#     quotes.<br>
#  c. Leading and trailing whitespace (blanks and tabs) are
#     to be removed unless used as separators.<br>
#  d. Double quotes and newlines are <i>never</i> separators.<br>
#</p>

import Utils

global spanner

procedure main(args)

    # Allow changing the separator from the default
    sep := zapPrefix(!args, "--separator=")
    sep --:= "\"\n"     # Some characters are illegal as separtors

    # An 'anonymous procedure' handles spanning versus non-spanning cases
    spanner := if "--span" == !args then
                   makeProc { repeat { sep := @&source; tab(many(\sep[1])) } }
               else
                   makeProc { repeat { @&source; move(1) } }

    every line := parseCSV(!&input, sep) do {
        # Code to process the array of values from each line goes
        #   here in place of this simple output of each field on
        #   a separate line, enclosed in single quotes.
        every write("\t'",!line,"'")
        }

end

# Produce a list of CSV values from a string
procedure parseCSV(s, sep)
    local A
    every put(A := [], genCSV(s, sep))
    return A
end

# Generate the CSV values from a string.  An improperly formatted string
# terminates the generation of values.
procedure genCSV(s, sep)
    local WS, r
    initial {
        # Default to not treating spans of separators as single separators
        /spanner := makeProc { repeat { @&source; move(1) } }
        }

    /sep := ","
    WS := ' \t\n' -- sep  # Can't include separators in whitespace
                          #  (The '\n' is a trick so there's always
                          #  something at the end of the last field.)

    (s||WS) ? {
        while not pos(0) do {
            tab(many(WS))
            if ="\"" then {    # Quoted string
                r := ""
                while r ||:= 1(tab(upto('"')),move(1)) do {
                    if not ="\"" then break     # End of quoted string
                    r ||:= "\""
                    }
                tab((upto(sep))|0)      
                spanner(sep)
                suspend r
                }
            else suspend trim(tab((upto(sep))|0)\1,WS) do spanner(sep)
            }
        }

end

-------------------------------------------------------------------------
This SF.Net email is sponsored by the Moblin Your Move Developer's challenge
Build the coolest Linux based applications with Moblin SDK & win great prizes
Grand prize is a trip for two to an Open Source event anywhere in the world
http://moblin-contest.org/redirect.php?banner_id=100&url=/

_______________________________________________
Unicon-group mailing list
Unicon-group@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/unicon-group

Re: [Unicon-group] Question regarding any parser for CSV format files

Reply via email to