[issue30034] csv reader chokes on bad quoting in large files
Keith Erskine added the comment: OK Terry. Thank you everybody for your thoughts and suggestions. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30034> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30034] csv reader chokes on bad quoting in large files
Keith Erskine added the comment: I should have said, Peter, an odd number of quotes does not necessarily mean the quoting is bad. For example, a line of: a,b",c will parse fine as ['a', 'b"', 'c']. Figuring out bad quoting is not easy, but if we know that there are no multiline fields in the file, then at least the parsing can stop at the end of the line. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30034> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30034] csv reader chokes on bad quoting in large files
Keith Erskine added the comment: The csv reader already supports bad CSV - that's what I believe "strict" is for - but only in one specific scenario. My request is to make that "strict" attribute a bit more useful. Thank you for your suggestion, Peter. I have toyed with the idea of looking for an even number of double quotes in each line, but thank you for your neat way of encapsulating it. (I already have to strip null bytes out of the input data because they break csv, see issue #27580). -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30034> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30034] csv reader chokes on bad quoting in large files
Keith Erskine added the comment: As you say, David, however much we would like the world to stick to a given CSV standard, the reality is that people don't, which is all the more reason for making the csv reader flexible and forgiving. The csv module can and should be used for more than just "comma-separated-values" files. I use it for all sorts of different delimited files, and it works very well. Pandas uses it, as I'm sure do many other packages. It's such a good module, it would be a pity to restrict its scope to just Excel-related scenarios. Parsing delimited files is undoubtedly complex, and painfully slow if done with pure Python, so the more that can be done in C the better. I'm no C programmer, but my guesstimate is that the coding changes I'm proposing are relatively modest. In the IN_QUOTED_FIELD section (https://github.com/python/cpython/blob/master/Modules/_csv.c#L690), it would mean checking for newline characters if the new "multiline" attribute is False (and probably "strict" is False too). Of course there is more to this change than just that, but I'm guessing not that much more. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30034> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30034] csv reader chokes on bad quoting in large files
Keith Erskine added the comment: The csv reader already handles a certain amount of bad formatting. For example, using default behavior, the following file: a,b,c d,"e"X,f g,h,i is read as: ['a', 'b', 'c'] ['d', 'eX', 'f'] ['g', 'h', 'i'] It seems reasonable that csv should be able to handle delimited files that are not perfectly formatted. After all, even the CSV "standard" isn't really a standard. When dealing with large (10GB+) files, it's a pain if csv cannot read the file because of just one misplaced quote character. Besides, data files are only going to get bigger. Also, I have to say, I've been dealing with large ETL jobs for over 15 years now and I'm struggling to think of a time when I've ever seen a multiline CSV file. Of course, we've all have different experiences. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30034> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30034] csv reader chokes on bad quoting in large files
Keith Erskine added the comment: Perhaps I should add what I would prefer the csv reader to return in my example above. That would be: ['a', 'b', 'c'] ['d', 'e,f'] ['g', 'h', 'i'] Yes, the second line is still mangled but at least the csv reader would carry on and read the third line correctly. -- ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30034> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue30034] csv reader chokes on bad quoting in large files
New submission from Keith Erskine: If a csv file has a quote character at the beginning of a field but no closing quote, the csv module will keep reading the file until the very end in an attempt to close out the field. It's true this situation occurs only when the quoting in a csv file is incorrect, but it would be extremely helpful if the csv reader could be told to stop reading each row of fields when it encounters a newline character, even if it is within a quoted field at the time. At the moment, with large files, the csv reader will typically error out in this situation once it reads the maximum size of a string. Furthermore, this is not an easy situation to trap with custom code. Here's an example of the what I'm talking about. For a csv file with the following content: a,b,c d,"e,f g,h,i This code: import csv with open('file.txt') as f: reader = csv.reader(f) for row in reader: print(row) returns: ['a', 'b', 'c'] ['d', 'e,f\ng,h,i\n'] Note that the whole of the file after "e", including delimiters and newlines, has been added to the second field on the second line. This is correct csv behavior but is very unhelpful to me in this situation. On the grounds that most csv files do not have multiline values within them, perhaps a new dialect attribute called "multiline" could be added to the csv module, that defaults to True for backwards compatibility. It would indicate whether the csv file has any field values within it that span more than one line. If multiline is False, then the "parse_process_char" function in "_csv" would always close out a row of fields when it encounters a newline character. It might be best if this multiline attribute were taken into account only when "strict" is False. Right now, I do get badly-formatted files like this, and I cannot ask the source for a new file. I have to manually correct the file using a mixture of custom scripts and vi before the csv module will read it. It would be very helpful if csv would handle this directly. -- messages: 291453 nosy: keef604 priority: normal severity: normal status: open title: csv reader chokes on bad quoting in large files type: enhancement versions: Python 3.7 ___ Python tracker <rep...@bugs.python.org> <http://bugs.python.org/issue30034> ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com