Re: CSV module: incorrectly parsed file.
7stud wrote: > On Feb 17, 9:11 pm, 7stud <[EMAIL PROTECTED]> wrote: >> On Feb 17, 7:09 pm, Christopher Barrington-Leigh >> >> >> >> <[EMAIL PROTECTED]> wrote: >>> Here is a file "test.csv" >>> number,name,description,value >>> 1,"wer","tape 2"",5 >>> 1,vvv,"hoohaa",2 >>> I want to convert it to tab-separated without those silly quotes. Note >>> in the second line that a field is 'tape 2"' , ie two inches: there is >>> a double quote in the string. >>> When I use csv module to read this: >>> import sys >>> outf=open(sys.argv[1]+'.tsv','wt') >>> import csv >>> reader=csv.reader(open(sys.argv[1], "rb")) >>> for row in reader: >>> outf.write('\t'.join([rr.strip() for rr in row]) +'\n') >>> it mangles it, messing up the double double-quote. >>> Can anyone help me? How do I use CSV to get it right? >>> Tjhanks! >>> c >> Try this: >> >> infile = open('data.txt') >> outfile = open('outfile.txt', 'w') >> >> for line in infile: >> pieces = line.strip().split(',') >> >> data = [] >> for piece in pieces: >> if piece[0] == '"': >> data.append(piece[1:-2]) >> else: >> data.append(piece) >> >> out_line = '%s\n' % '\t'.join(data) >> outfile.write(out_line) > > Whoops. The line: > > data.append(piece[1:-2]) > > should be: > > data.append(piece[1:-1]) > Even when you have done all this you will still have problems. As Andrew pointed out the form is ambiguous, and you'd just better hope none of your data items look like Nails 2", soldiers for the use of because then you will be completely screwed. So there's a need for a certain amount of visual scrutiny of the data: I would definitely write a validation program first that tries to read the data and catches any exceptions like unmatched quotes or the wrong number of items in a line. If there aren't too many (and there usually aren't) just edit them out of your input data by hand. If this is to be a regular task then you'll have to program to recognize and correct the common error cases. regards Steve -- Steve Holden+1 571 484 6266 +1 800 494 3119 Holden Web LLC http://www.holdenweb.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: CSV module: incorrectly parsed file.
On Feb 17, 9:11 pm, 7stud <[EMAIL PROTECTED]> wrote: > On Feb 17, 7:09 pm, Christopher Barrington-Leigh > > > > <[EMAIL PROTECTED]> wrote: > > Here is a file "test.csv" > > number,name,description,value > > 1,"wer","tape 2"",5 > > 1,vvv,"hoohaa",2 > > > I want to convert it to tab-separated without those silly quotes. Note > > in the second line that a field is 'tape 2"' , ie two inches: there is > > a double quote in the string. > > > When I use csv module to read this: > > > import sys > > outf=open(sys.argv[1]+'.tsv','wt') > > import csv > > reader=csv.reader(open(sys.argv[1], "rb")) > > for row in reader: > > outf.write('\t'.join([rr.strip() for rr in row]) +'\n') > > > it mangles it, messing up the double double-quote. > > Can anyone help me? How do I use CSV to get it right? > > Tjhanks! > > c > > Try this: > > infile = open('data.txt') > outfile = open('outfile.txt', 'w') > > for line in infile: > pieces = line.strip().split(',') > > data = [] > for piece in pieces: > if piece[0] == '"': > data.append(piece[1:-2]) > else: > data.append(piece) > > out_line = '%s\n' % '\t'.join(data) > outfile.write(out_line) Whoops. The line: data.append(piece[1:-2]) should be: data.append(piece[1:-1]) -- http://mail.python.org/mailman/listinfo/python-list
Re: CSV module: incorrectly parsed file.
On Feb 17, 7:09 pm, Christopher Barrington-Leigh <[EMAIL PROTECTED]> wrote: > Here is a file "test.csv" > number,name,description,value > 1,"wer","tape 2"",5 > 1,vvv,"hoohaa",2 > > I want to convert it to tab-separated without those silly quotes. Note > in the second line that a field is 'tape 2"' , ie two inches: there is > a double quote in the string. > > When I use csv module to read this: > > import sys > outf=open(sys.argv[1]+'.tsv','wt') > import csv > reader=csv.reader(open(sys.argv[1], "rb")) > for row in reader: > outf.write('\t'.join([rr.strip() for rr in row]) +'\n') > > it mangles it, messing up the double double-quote. > Can anyone help me? How do I use CSV to get it right? > Tjhanks! > c Try this: infile = open('data.txt') outfile = open('outfile.txt', 'w') for line in infile: pieces = line.strip().split(',') data = [] for piece in pieces: if piece[0] == '"': data.append(piece[1:-2]) else: data.append(piece) out_line = '%s\n' % '\t'.join(data) outfile.write(out_line) -- http://mail.python.org/mailman/listinfo/python-list
Re: CSV module: incorrectly parsed file.
On Feb 17, 8:09 pm, Christopher Barrington-Leigh <[EMAIL PROTECTED]> wrote: > Here is a file "test.csv" > number,name,description,value > 1,"wer","tape 2"",5 > 1,vvv,"hoohaa",2 > > I want to convert it to tab-separated without those silly quotes. Note > in the second line that a field is 'tape 2"' , ie two inches: there is > a double quote in the string. > What is needed to disambiguate this data is to only accept closing quotes if they are followed by a comma or the end of the line. In pyparsing, you can define your own quoted string format. Here is one solution using pyparsing. At the end, you can extract the data by field name, and print it out however you choose: data = """\ number,name,description,value 1,"wer","tape 2"",5 1,vvv,"hoohaa",2""" from pyparsing import * # very special definition of a quoted string, that ends with a " only if # followed by a , or the end of line quotedString = ('"' + ZeroOrMore(CharsNotIn('"')|('"' + ~FollowedBy(','|lineEnd))) + '"') quotedString.setParseAction(keepOriginalText, removeQuotes) integer = Word(nums).setParseAction(lambda toks:int(toks[0])) value = integer | quotedString | Word(printables.replace(",","")) # first pass, just parse the comma-separated values for line in data.splitlines(): print delimitedList(value).parseString(line) print # now second pass, assign field names using names from first line names = data.splitlines()[0].split(',') def setValueNames(tokens): for k,v in zip(names,tokens): tokens[k] = v lineDef = delimitedList(value).setParseAction(setValueNames) # parse each line, and extract data by field name for line in data.splitlines()[1:]: results = lineDef.parseString(line) print "Desc:", results.description print results.dump() Prints: ['number', 'name', 'description', 'value'] [1, 'wer', 'tape 2"', 5] [1, 'vvv', 'hoohaa', 2] Desc: tape 2" [1, 'wer', 'tape 2"', 5] - description: tape 2" - name: wer - number: 1 - value : 5 Desc: hoohaa [1, 'vvv', 'hoohaa', 2] - description: hoohaa - name: vvv - number: 1 - value : 2 -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: CSV module: incorrectly parsed file.
>Here is a file "test.csv" >number,name,description,value >1,"wer","tape 2"",5 >1,vvv,"hoohaa",2 > >I want to convert it to tab-separated without those silly quotes. Note >in the second line that a field is 'tape 2"' , ie two inches: there is >a double quote in the string. The input format is ambiguous - how is the parser to distinguish between a double-quote in the field, and the double-quote that delimits the field? Excel would have written that field as "tape 2""" (it doubles double-quotes that appear within a field). You can turn off the double-double-quote handling by passing "doublequote=False" to the parser, but the results still might not be what you want (because the format is ambiguous). -- Andrew McNamara, Senior Developer, Object Craft http://www.object-craft.com.au/ -- http://mail.python.org/mailman/listinfo/python-list