[issue30034] csv reader chokes on bad quoting in large files

2017-04-15 Thread Keith Erskine

Keith Erskine added the comment:

OK Terry.  Thank you everybody for your thoughts and suggestions.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30034>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30034] csv reader chokes on bad quoting in large files

2017-04-11 Thread Keith Erskine

Keith Erskine added the comment:

I should have said, Peter, an odd number of quotes does not necessarily mean 
the quoting is bad.  For example, a line of:
a,b",c
will parse fine as ['a', 'b"', 'c'].  Figuring out bad quoting is not easy, but 
if we know that there are no multiline fields in the file, then at least the 
parsing can stop at the end of the line.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30034>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30034] csv reader chokes on bad quoting in large files

2017-04-11 Thread Keith Erskine

Keith Erskine added the comment:

The csv reader already supports bad CSV - that's what I believe "strict" is for 
- but only in one specific scenario.  My request is to make that "strict" 
attribute a bit more useful.

Thank you for your suggestion, Peter.  I have toyed with the idea of looking 
for an even number of double quotes in each line, but thank you for your neat 
way of encapsulating it.  (I already have to strip null bytes out of the input 
data because they break csv, see issue #27580).

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30034>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30034] csv reader chokes on bad quoting in large files

2017-04-10 Thread Keith Erskine

Keith Erskine added the comment:

As you say, David, however much we would like the world to stick to a given CSV 
standard, the reality is that people don't, which is all the more reason for 
making the csv reader flexible and forgiving.

The csv module can and should be used for more than just 
"comma-separated-values" files.  I use it for all sorts of different delimited 
files, and it works very well.  Pandas uses it, as I'm sure do many other 
packages.  It's such a good module, it would be a pity to restrict its scope to 
just Excel-related scenarios.  Parsing delimited files is undoubtedly complex, 
and painfully slow if done with pure Python, so the more that can be done in C 
the better.

I'm no C programmer, but my guesstimate is that the coding changes I'm 
proposing are relatively modest.  In the IN_QUOTED_FIELD section 
(https://github.com/python/cpython/blob/master/Modules/_csv.c#L690), it would 
mean checking for newline characters if the new "multiline" attribute is False 
(and probably "strict" is False too).  Of course there is more to this change 
than just that, but I'm guessing not that much more.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30034>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30034] csv reader chokes on bad quoting in large files

2017-04-10 Thread Keith Erskine

Keith Erskine added the comment:

The csv reader already handles a certain amount of bad formatting.  For 
example, using default behavior, the following file:

a,b,c
d,"e"X,f
g,h,i

is read as:
['a', 'b', 'c']
['d', 'eX', 'f']
['g', 'h', 'i']

It seems reasonable that csv should be able to handle delimited files that are 
not perfectly formatted.  After all, even the CSV "standard" isn't really a 
standard.  When dealing with large (10GB+) files, it's a pain if csv cannot 
read the file because of just one misplaced quote character.  Besides, data 
files are only going to get bigger.

Also, I have to say, I've been dealing with large ETL jobs for over 15 years 
now and I'm struggling to think of a time when I've ever seen a multiline CSV 
file.  Of course, we've all have different experiences.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30034>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30034] csv reader chokes on bad quoting in large files

2017-04-10 Thread Keith Erskine

Keith Erskine added the comment:

Perhaps I should add what I would prefer the csv reader to return in my example 
above.  That would be:

['a', 'b', 'c']
['d', 'e,f']
['g', 'h', 'i']

Yes, the second line is still mangled but at least the csv reader would carry 
on and read the third line correctly.

--

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30034>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue30034] csv reader chokes on bad quoting in large files

2017-04-10 Thread Keith Erskine

New submission from Keith Erskine:

If a csv file has a quote character at the beginning of a field but no closing 
quote, the csv module will keep reading the file until the very end in an 
attempt to close out the field.  It's true this situation occurs only when the 
quoting in a csv file is incorrect, but it would be extremely helpful if the 
csv reader could be told to stop reading each row of fields when it encounters 
a newline character, even if it is within a quoted field at the time.  At the 
moment, with large files, the csv reader will typically error out in this 
situation once it reads the maximum size of a string.  Furthermore, this is not 
an easy situation to trap with custom code.

Here's an example of the what I'm talking about.  For a csv file with the 
following content:
a,b,c
d,"e,f
g,h,i

This code:

import csv
with open('file.txt') as f:
reader = csv.reader(f)
for row in reader:
print(row)

returns:
['a', 'b', 'c']
['d', 'e,f\ng,h,i\n']

Note that the whole of the file after "e", including delimiters and newlines, 
has been added to the second field on the second line. This is correct csv 
behavior but is very unhelpful to me in this situation.

On the grounds that most csv files do not have multiline values within them, 
perhaps a new dialect attribute called "multiline" could be added to the csv 
module, that defaults to True for backwards compatibility.  It would indicate 
whether the csv file has any field values within it that span more than one 
line.  If multiline is False, then the "parse_process_char" function in "_csv" 
would always close out a row of fields when it encounters a newline character.  
It might be best if this multiline attribute were taken into account only when 
"strict" is False.

Right now, I do get badly-formatted files like this, and I cannot ask the 
source for a new file.  I have to manually correct the file using a mixture of 
custom scripts and vi before the csv module will read it. It would be very 
helpful if csv would handle this directly.

--
messages: 291453
nosy: keef604
priority: normal
severity: normal
status: open
title: csv reader chokes on bad quoting in large files
type: enhancement
versions: Python 3.7

___
Python tracker <rep...@bugs.python.org>
<http://bugs.python.org/issue30034>
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com