[issue2078] CSV Sniffer does not function properly on single column .csv files
Skip Montanaro [EMAIL PROTECTED] added the comment: I can't see a great reason to change the behavior. I've attached my current patch for csv.py and test_csv.py in case someone else wants to pick it up later. -- keywords: +patch priority: - low resolution: - postponed status: open - closed Added file: http://bugs.python.org/file10020/csv.diff __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Amaury Forgeot d'Arc [EMAIL PROTECTED] added the comment: It works entirely based on chracter frequencies. Does it make sense to restrict delimiters to a reasonable set of characters? Usual punctuations, spaces, tabs... what else? -- nosy: +amaury.forgeotdarc __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Skip Montanaro [EMAIL PROTECTED] added the comment: It works entirely based on chracter frequencies. Amaury Does it make sense to restrict delimiters to a reasonable set of Amaury characters? Usual punctuations, spaces, tabs... what else? There is an optional delimiters argument to the sniff() method which defaults to None. I would be happier if it was the usual suspects (NeoOffice refuses to gues, but offers TAB, space, semicolon and comma as the default separators when importing a CSV file - Excel seems to just figure it out). That would change the behavior though. With no delimiter set it's generally going to find something, just pick incorrectly. With a non-existent delimiter set it's going to raise an exception. I'm not sure this would be a good tradeoff and would just break existing code. Skip __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Skip Montanaro [EMAIL PROTECTED] added the comment: Jean-Philippe You're right, it does seem that using f.read(1024) to Jean-Philippe feed the sniffer works OK in my case and allows me to Jean-Philippe instantiate the DictReader correctly... Why that is I'm Jean-Philippe not sure though... It works entirely based on chracter frequencies. The more characters you feed it the better it should be at guessing the correct delimiter. In particular, it pays attention to the frequency of the possible delimiters per line and assumes the number of columns is the same for each line. (Well, there's one place where it does use some knowledge of the structure of a csv file, so my earlier assertion was incorrect.) If you only feed it one line it can't really use that frequency-per-line information. Jean-Philippe I was submitting the first line as I thought is was the Jean-Philippe right sample to provide the sniffer for it to sniff the Jean-Philippe correct dialect regardless of the file format and file Jean-Philippe content. That's a good guess, but not quite spot on in this case. In particular, the character frequencies in the first line tend to be much different than the other lines because it usually a row of column headers, while the remainder of the file (though not always ;-) is a table of numbers. Skip __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Jean-Philippe Laverdure [EMAIL PROTECTED] added the comment: Hello and sorry for the late reply. Wolfgang: sorry about my misuse of the csv.DictReader constructor, that was a mistake on my part. However, it still is not functionning as I think it should/could. Look at this: Using this content: Sequence AAGINRDSL AAIANHQVL and this piece of code: f = open(sys.argv[-1], 'r') dialect = csv.Sniffer().sniff(f.readline()) f.seek(0) reader = csv.DictReader(f, dialect=dialect) for line in reader: print line I get this result: {'Sequen': 'AAGINRDSL', 'e': None} {'Sequen': 'AAIANHQVL', 'e': None} When I really should be getting this: {'Sequence': 'AAGINRDSL'} {'Sequence': 'AAIANHQVL'} The fact is this code is in use in an application where users can submit a .csv file produced by Excel for treatment. The file must contain a Sequence column since that is what the treatment is run on. Now I had to make the following changes to my code to account for the fact that some users submit a single column file (since only the Sequence column is required for treatment): f = open(sys.argv[-1], 'r') try: dialect = csv.Sniffer().sniff(f.readline(), [',', '\t']) f.seek(0) reader = csv.DictReader(f, dialect=dialect) except: print 'caught csv sniff() exception' f.seek(0) reader = csv.DictReader(f) for line in reader: Do what I need to do Which really feels like a patched use of a buggy implementation of the Sniffer class I understand the issues raised by Skip in regards to figuring out a delimiter at all costs... But really, the Sniffer class should work apropriately when a single column .csv file is submitted __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Skip Montanaro [EMAIL PROTECTED] added the comment: Jean-Philippe The fact is this code is in use in an application where Jean-Philippe users can submit a .csv file produced by Excel for Jean-Philippe treatment. The file must contain a Sequence column Jean-Philippe since that is what the treatment is run on. Now I had to Jean-Philippe make the following changes to my code to account for the Jean-Philippe fact that some users submit a single column file (since Jean-Philippe only the Sequence column is required for treatment): Jean-Philippe f = open(sys.argv[-1], 'r') Jean-Philippe try: Jean-Philippe dialect = csv.Sniffer().sniff(f.readline(), [',', '\t']) Jean-Philippe f.seek(0) Jean-Philippe reader = csv.DictReader(f, dialect=dialect) Jean-Philippe except: Jean-Philippe print 'caught csv sniff() exception' Jean-Philippe f.seek(0) Jean-Philippe reader = csv.DictReader(f) Jean-Philippe for line in reader: Jean-Philippe Do what I need to do What exceptions are you catching? Why are you only giving it a single line of input as a sample? What happens if you instead use f.read(1024) as the sample? When there is only a single column in the file and you give it a delimiter set which doesn't include any characters in the file it (I think correctly) raises an exception to tell you that it couldn't determine the delimiter: import csv f = open(listB2Mforblast.csv) dialect = csv.Sniffer().sniff(f.read(1024)) dialect.delimiter '' f.seek(0) dialect = csv.Sniffer().sniff(f.read(1024), ,\t :;) Traceback (most recent call last): File stdin, line 1, in module File /Users/skip/local/lib/python2.6/csv.py, line 161, in sniff raise Error, Could not determine delimiter _csv.Error: Could not determine delimiter In that case, use csv.excel as the dialect. It doesn't matter what you use as the delimiter if it doesn't occur in the file, and if it can't figure out the delimiter it's also not going to guess the quotechar. try: ... dialect = csv.Sniffer().sniff(f.read(1024), ,\t :;) ... except csv.Error: ... dialect = csv.excel ... I personally don't much like the sniffer. It doesn't use any knowledge of the structure of a CSV file to guess the delimiter and quotechar (and those are the only two parameters it does guess). I would prefer if it just went away, but folks use it so it's likely to remain in its current form for the forseeable future. Skip __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Jean-Philippe Laverdure [EMAIL PROTECTED] added the comment: Hi Skip, You're right, it does seem that using f.read(1024) to feed the sniffer works OK in my case and allows me to instantiate the DictReader correctly... Why that is I'm not sure though... I was submitting the first line as I thought is was the right sample to provide the sniffer for it to sniff the correct dialect regardless of the file format and file content. And yes, 'except csv.Error' is certainly a better way to trap my desired exception... I guess I'm a bit of a n00b using Python. Thanks for the help. Python really has a great community ! __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Wolfgang Langner [EMAIL PROTECTED] added the comment: In this cases it is not really possible to sniff the right delimiter. To not allow digits or letters is not a good solution. I think the behavior as now is ok, and at this time I see now way to improve it. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Skip Montanaro [EMAIL PROTECTED] added the comment: Wolfgang In this cases it is not really possible to sniff the right Wolfgang delimiter. To not allow digits or letters is not a good Wolfgang solution. I think the behavior as now is ok, and at this time Wolfgang I see now way to improve it. I mostly agree. I'm waiting for the original submitter to chime in though. Skip __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Skip Montanaro [EMAIL PROTECTED] added the comment: What do you think the delimiter should be for this csv file? 43.4e12 147483648 47483648 What about this one? abcdef bcdefg cdefgh And this? abc8def bcd8efg cde8fgh If I force the sniffer to not allow digits or letters as delimiters I can get the sniffer to return comma as the delimiter in all three cases. I'm not certain that's correct in the third case though. -- assignee: - skip.montanaro nosy: +skip.montanaro __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Wolfgang Langner [EMAIL PROTECTED] added the comment: The sniffer returns an dialect that is not really correct. Because the delimiter is set to value and in this case there is no delimiter. See it as, it returns a random delimiter if there is not really one. But your usage of the DictReader is wrong. It have to be called with csv.DictReader(file, dialect=dialect) and then it works in this example. This could be closed. -- nosy: +tds333 versions: +Python 2.6, Python 3.0 __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
Changes by Jean-Philippe Laverdure: -- components: +Library (Lib) -Extension Modules versions: +Python 2.4 __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2078] CSV Sniffer does not function properly on single column .csv files
New submission from Jean-Philippe Laverdure: When attempting to sniff() the dialect for the attached .csv file, csv.Sniffer.sniff() returns an unusable dialect: import csv file = open('listB2Mforblast.csv', 'r') dialect = csv.Sniffer().sniff(file.readline()) file.seek(0) file.readline() file.seek(0) reader = csv.DictReader(file, dialect) reader.next() Traceback (most recent call last): File stdin, line 1, in module File /soft/bioinfo/linux/python-2.5/lib/python2.5/csv.py, line 93, in next d = dict(zip(self.fieldnames, row)) TypeError: zip argument #1 must support iteration However, this works fine: file.seek(0) reader = csv.DictReader(file) reader.next() {'Sequence': 'AALENTHLL'} If I use a 2 column file, sniff() works perfectly. It only seems to have a problem with single column .csv files (which are still .csv files in my opinion) Thanks for looking into this. -- components: Extension Modules files: listB2Mforblast.csv messages: 62319 nosy: jplaverdure severity: normal status: open title: CSV Sniffer does not function properly on single column .csv files type: behavior versions: Python 2.5 Added file: http://bugs.python.org/file9416/listB2Mforblast.csv __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2078 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com