[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Tiago Wright added the comment: Attached is a .py file with 32 test cases for the Sniff class, 18 that fail, 14 that pass. My hope is that these samples can be used to improve the delimiter detection code. -Tiago -- Added file: http://bugs.python.org/file40149/testround8.py ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___import csv def test_delimiters(): delimiter_samples = [ { 'delimiter' :\t, 'sample' : # error:Exception '''Field Name Definition RefID Unique (sequential) number assigned to vehicles IsBadBuyIdentifies if the kicked vehicle was an avoidable purchase PurchDate The Date the vehicle was Purchased at Auction Auction Auction provider at which the vehicle was purchased VehYear The manufacturer's year of the vehicle VehicleAge The Years elapsed since the manufacturer's year ''' }, { 'delimiter' :\t, 'sample' : # error:Exception '''rulessupport confidence lift 1 {Brushes} = {Nail.Polish} 0.149 1 3.57142857142857 2 {Brushes} = {Bronzer} 0.097 0.651006711409396 2.5738856414 3 {Brushes} = {Concealer}0.092 0.61744966442953 1.39694494214826 4 {Lip.liner} = {Concealer} 0.179 0.764957264957265 1.73067254515218 5 {Bronzer} = {Concealer}0.175 0.627240143369176 1.41909534698909 6 {Blush} = {Concealer} 0.220.606060606060606 1.37117784176608 ''' }, { 'delimiter' :,, 'sample' : # error:Exception '''A,B,C,D,E 2000-01-03 00:00:00,0.980268513777,3.68573087906,-0.364216805298,-1.15973806169,foo 2000-01-04 00:00:00,1.04791624281,-0.0412318367011,-0.16181208307,0.212549316967,bar 2000-01-05 00:00:00,0.498580885705,0.731167677815,-0.537677223318,1.34627041952,baz 2000-01-06 00:00:00,1.12020151869,1.56762092543,0.00364077397681,0.67525259227,qux 2000-01-07 00:00:00,-0.487094399463,0.571454623474,-1.6116394093,0.103468562917,foo2 ''' }, { 'delimiter' :,, 'sample' : # error:Exception '''1,699,4751,4158 8,1856 12,4059,5716,4299,4967,2128 16,1928,1176 19,1928,2775,4646,1720,3148,2552,5978,3736,3090 22,4059,1856,4103,4739,4865,4769,621,2874,1637,252 28,5321,4059,4952,1856,4103,699,1976 ''' }, { 'delimiter' :,, 'sample' : # error:Exception '''���Date,From,To,Flight_Number,Airline,Distance,Duration,Seat,Seat_Type,Class,Reason,Plane,Registration,Trip,Note,From_OID,To_OID,Airline_OID,Plane_OID 2004-08-27,YHZ,YYZ,,Air Canada,801,01:56,,A,Y,L,73,193,330 2004-08-01,YYZ,YHZ,,Air Canada,801,01:56,,A,Y,L,193,73,330 2004-07-30,YHZ,YYZ,,Air Canada,801,01:56,,A,Y,L,73,193,330 2004-05-30,ZRH,MUC,,Lufthansa,162,00:47,,,Y,L,1678,346,3320 2004-05-30,MUC,YYZ,,Air Canada,4131,07:53,,,Y,L,346,193,330 2004-05-30,YYZ,YOW,,Unknown,226,00:54,,,Y,L,193,100,-1 ''' }, { 'delimiter' :\t, 'sample' : # error:Exception '''Format version Start date End dateSender Recipient Aggregator 5 2010-05-01 2010-05-31 Spotify Udsvxd Udsvxd Country Label Product CurrencyTotal tracksRightholder's tracks Pro rata share Revenue share Number of users Net revenue Payable USD RateUSD Payable XV Ipstqx Gjivgmn C JFG 331264067 0.0020.00 87845 851092.49 0.045.6647 0.09 JN Mvcqxv Gjivgmqxd Iv P JFG 368037889 635611 0.01 40.00 472355 639147.36 506.62 5.6647 562.82 IL Mvcqxv Gjivgmn C JFG 35016 0.0420.00 8 31.61 0.055.6647 0.05 DW Mvcqxv C DWO 6283654158448 0.0420.00 84344 330574.21 557.63 5.8230 513.62 ''' }, { 'delimiter' :,, 'sample' : # error:Exception '''age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,1iclass 39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, Not-in-family, White, Male, 2174, 0, 40, United-States, =50K 50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, =50K 38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, White, Male, 0, 0, 40, United-States, =50K 53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, Black, Male, 0, 0, 40, United-States, =50K 28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, Black, Female, 0, 0, 40, Cuba, =50K 37, Private, 284582, Masters, 14, Married-civ-spouse,
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Tiago Wright added the comment: I've run the Sniffer against the same data set, but varied the size of the sample given to the code. It seems that feeding it more data actually seems to make the results less accurate. Table attached. On Thu, Aug 6, 2015 at 12:29 PM R. David Murray rep...@bugs.python.org wrote: R. David Murray added the comment: Yes, much better :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ -- Added file: http://bugs.python.org/file40141/csvsniffertest5.txt ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ lines3 lines7 lines70 lines700 human Sniff , , 490 487 424 393 A 1 0 00 Exception 6 8 44 c 1 1 11 g 1 0 00 h 1 0 00 space 0 0 97 y 0 0 11 ; ; 1 1 11 \t\t 918 917 929 706 * 0 0 67 , 6 3 21 - 0 0 05 : 0 2 22 D 5 0 00 E 0 0 10 10 Exception 52 91 18 18 M 1 1 00 c 2 0 00 m 2 0 00 p 61 27 22 22 s 0 0 22 space 1 6 51 125 bar bar33 33 20 9 space Exception 0 1 11 e 4 4 44 space 10 9 99 ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Tiago Wright added the comment: It seems the HTML file did not come through correctly. Trying a text version, please view this in a monospace font: | Sniffer | Human | , | ; | \t | \ | space|Except | : | ) | c | e | M | p |Total | %Error --- , | 498 | | | 2 | 1 | 10 | | | 1 | | | | 512 | 2.7% ; | | 1 | | | | | | | | | | | 1 | 0.0% \t | 3 | | 922 | | 6 | 91 | 2 | 1 | | | 2 | 27 | 1054| 12.5% | | | | | 33 | | | | | | | | | 33 | 0.0% space | | | | | 9 | 1 | | | | 4 | | | 14 | 35.7% --- Total | 501 | 1 | 922 | 35 | 16 | 102 | 2 | 1 | 1 | 4 | 2 | 27 | 1614 On Thu, Aug 6, 2015 at 8:54 AM Tiago Wright rep...@bugs.python.org wrote: Tiago Wright added the comment: Table attached. -Tiago On Wed, Aug 5, 2015 at 8:14 PM Skip Montanaro rep...@bugs.python.org wrote: Skip Montanaro added the comment: Tiago, sorry, but your last post with results is completely unintelligible. Can you toss the table in a file and attach it instead? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ -- Added file: http://bugs.python.org/file40138/csvsniffertest3.htm ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
R. David Murray added the comment: Yes, much better :) -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Tiago Wright added the comment: I apologize, it seems the text table got line wrapped. This time as a TXT attachment. -Tiago On Thu, Aug 6, 2015 at 12:22 PM Tiago Wright rep...@bugs.python.org wrote: Tiago Wright added the comment: -- Added file: http://bugs.python.org/file40140/csvsniffertest3.txt ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___| Sniffer | Human | , | ; | \t | \ | space|Except | : | ) | c | e | M | p |Total | %Error --- , | 498 | | | 2 | 1 | 10 | | | 1 | | | | 512 | 2.7% ; | | 1 | | | | | | | | | | | 1 | 0.0% \t | 3 | | 922 | | 6 | 91 | 2 | 1 | | | 2 | 27 | 1054| 12.5% | | | | | 33 | | | | | | | | | 33 | 0.0% space | | | | | 9 | 1 | | | | 4 | | | 14 | 35.7% --- Total | 501 | 1 | 922 | 35 | 16 | 102 | 2 | 1 | 1 | 4 | 2 | 27 | 1614 ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
R. David Murray added the comment: Your best bet is to attach an ascii text file as an uploaded file. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Tiago Wright added the comment: Table attached. -Tiago On Wed, Aug 5, 2015 at 8:14 PM Skip Montanaro rep...@bugs.python.org wrote: Skip Montanaro added the comment: Tiago, sorry, but your last post with results is completely unintelligible. Can you toss the table in a file and attach it instead? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ -- Added file: http://bugs.python.org/file40138/csvsniffertest3.htm ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___html xmlns:x=urn:schemas-microsoft-com:office:excel xmlns=http://www.w3.org/TR/REC-html40; head meta name=Excel Workbook Frameset meta http-equiv=Content-Type content=text/html; charset=utf-8 meta name=ProgId content=Excel.Sheet meta name=Generator content=Microsoft Excel 14 ![if !supportTabStrip] link id=shLink href=manual_files/sheet001.htm link id=shLink href=manual_files/sheet002.htm link id=shLink href=manual_files/sheet003.htm link id=shLink script language=JavaScript !-- var g_iIEVer=0; var c_lTabs=3; var c_rgszSh=new Array(c_lTabs); c_rgszSh[0] = Sheet2; c_rgszSh[1] = Sheet1; c_rgszSh[2] = Sheet1��(2); var c_rgszClr=new Array(8); c_rgszClr[0]=window; c_rgszClr[1]=buttonface; c_rgszClr[2]=windowframe; c_rgszClr[3]=windowtext; c_rgszClr[4]=threedlightshadow; c_rgszClr[5]=threedhighlight; c_rgszClr[6]=threeddarkshadow; c_rgszClr[7]=threedshadow; var g_iShCur; var g_rglTabX=new Array(c_lTabs); function fnBuildFrameset() { var szHTML=frameset rows=\*,18\ border=0 width=0 frameborder=no framespacing=0+ frame src=\+document.all.item(shLink)[2].href+\ name=\frSheet\ noresize+ frameset cols=\54,*\ border=0 width=0 frameborder=no framespacing=0+ frame src=\\ name=\frScroll\ marginwidth=0 marginheight=0 scrolling=no+ frame src=\\ name=\frTabs\ marginwidth=0 marginheight=0 scrolling=no+ /frameset/framesetplaintext; document.open(text/html,replace); document.write(szHTML); document.close(); fnBuildTabStrip(); } function fnBuildTabStrip() { var szHTML= htmlheadstyle.clScroll {font:8pt Courier New;color:+c_rgszClr[2]+;cursor:default;line-height:10pt;}+ .clScroll2 {font:10pt Arial;color:+c_rgszClr[2]+;cursor:default;line-height:11pt;}/style/head+ body onclick=\event.returnValue=false;\ ondragstart=\event.returnValue=false;\ onselectstart=\event.returnValue=false;\ bgcolor=+c_rgszClr[4]+ topmargin=0 leftmargin=0table cellpadding=0 cellspacing=0 width=100%+ trtd colspan=6 height=1 bgcolor=+c_rgszClr[2]+/td/tr+ trtd style=\font:1pt\nbsp;td+ td valign=top id=tdScroll class=\clScroll\ onclick=\parent.fnFastScrollTabs(0);\ onmouseover=\parent.fnMouseOverScroll(0);\ onmouseout=\parent.fnMouseOutScroll(0);\a#171;/a/td+ td valign=top id=tdScroll class=\clScroll2\ onclick=\parent.fnScrollTabs(0);\ ondblclick=\parent.fnScrollTabs(0);\ onmouseover=\parent.fnMouseOverScroll(1);\ onmouseout=\parent.fnMouseOutScroll(1);\alt/a/td+ td valign=top id=tdScroll class=\clScroll2\ onclick=\parent.fnScrollTabs(1);\ ondblclick=\parent.fnScrollTabs(1);\ onmouseover=\parent.fnMouseOverScroll(2);\ onmouseout=\parent.fnMouseOutScroll(2);\agt/a/td+ td valign=top id=tdScroll class=\clScroll\ onclick=\parent.fnFastScrollTabs(1);\ onmouseover=\parent.fnMouseOverScroll(3);\ onmouseout=\parent.fnMouseOutScroll(3);\a#187;/a/td+ td style=\font:1pt\nbsp;td/tr/table/body/html; frames['frScroll'].document.open(text/html,replace); frames['frScroll'].document.write(szHTML); frames['frScroll'].document.close(); szHTML = htmlhead+ styleA:link,A:visited,A:active {text-decoration:none;+color:+c_rgszClr[3]+;}+ .clTab {cursor:hand;background:+c_rgszClr[1]+;font:8pt Arial;padding-left:3px;padding-right:3px;text-align:center;}+ .clBorder {background:+c_rgszClr[2]+;font:1pt;}+ /style/headbody onload=\parent.fnInit();\ onselectstart=\event.returnValue=false;\ ondragstart=\event.returnValue=false;\ bgcolor=+c_rgszClr[4]+ topmargin=0 leftmargin=0table id=tbTabs cellpadding=0 cellspacing=0; var iCellCount=(c_lTabs+1)*2; var i; for (i=0;iiCellCount;i+=2) szHTML+=col width=1col; var iRow; for (iRow=0;iRow6;iRow++) { szHTML+=tr; if (iRow==5) szHTML+=td colspan=+iCellCount+/td; else { if (iRow==0) { for(i=0;iiCellCount;i++) szHTML+=td height=1 class=\clBorder\/td; } else if (iRow==1) { for(i=0;ic_lTabs;i++) { szHTML+=td height=1 nowrap class=\clBorder\nbsp;/td; szHTML+= td id=tdTab height=1 nowrap class=\clTab\ onmouseover=\parent.fnMouseOverTab(+i+);\ onmouseout=\parent.fnMouseOutTab(+i+);\+ a href=\+document.all.item(shLink)[i].href+\ target=\frSheet\ id=aTabnbsp;+c_rgszSh[i]+nbsp;/a/td; } szHTML+=td id=tdTab height=1 nowrap class=\clBorder\a id=aTabnbsp;/a/tdtd width=100%/td; } else if (iRow==2)
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Skip Montanaro added the comment: Tiago, sorry, but your last post with results is completely unintelligible. Can you toss the table in a file and attach it instead? -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Tiago Wright added the comment: I've run the Sniffer against 1614 csv files on my computer and compared the delimiter it detects to what I have set manually. Here are the results: SnifferHuman,;\t\(blank)Error:)ceMpGrand TotalError rate,498 2 110 1 5122.7%; 1 10.0%\t3 922 69121 227105412.5%| 33 330.0%space91 4 1435.7%Grand Total5011922351610221142271614 -Tiago On Tue, Aug 4, 2015 at 3:51 PM R. David Murray rep...@bugs.python.org wrote: R. David Murray added the comment: If you look at the algorithm it is doing some fancy things with metrics, but does have a 'preferred delimiters' list that it checks. It is possible things could be improved either by tweaking the threshold or by somehow giving added weight to the metrics when the candidate character is in the preferred delimiter list. We might have to do this with a feature flag to turn it on, though, since it could change the results for programs that happen to work with the current algorithm. -- nosy: +r.david.murray ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Peter Otten added the comment: The sniffer actually changes its mind in the fourth line: Python 3.4.0 (default, Jun 19 2015, 14:20:21) [GCC 4.8.2] on linux Type help, copyright, credits or license for more information. import csv csv.Sniffer().sniff(\ ... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message, ... Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,, ... Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,, ... ).delimiter ',' csv.Sniffer().sniff(\ ... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message, ... Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,, ... Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,, ... Sscanner ac15072911230.pdf,CM 16148,33488,MX Cavalier,948200,Photos don't match the invoice ... ).delimiter 'M' That line has only 5 commas while all others have 6. Unfortunately all lines contain exactly two M... -- nosy: +peter.otten ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Skip Montanaro added the comment: I should have probably pointed out that the Sniffer class is the unloved stepchild of the csv module. In my experience it is rarely necessary. You either: * Are reading CSV files which are about what Excel would produce with its default settings or * Know just what your format is, and can define the various parameters easily It's pretty rare, I think, to get a delimited file in some format which is completely unknown and which thus has to be deduced. As Peter showed, the Sniffer class is also kind of unreliable. I didn't write it, and there are precious few test cases for it. One of your datasets should probably be added to the mix and bugs fixed. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
R. David Murray added the comment: If you look at the algorithm it is doing some fancy things with metrics, but does have a 'preferred delimiters' list that it checks. It is possible things could be improved either by tweaking the threshold or by somehow giving added weight to the metrics when the candidate character is in the preferred delimiter list. We might have to do this with a feature flag to turn it on, though, since it could change the results for programs that happen to work with the current algorithm. -- nosy: +r.david.murray ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Tiago Wright added the comment: I agree that the parameters are easily deduced for any one csv file after a quick inspection. The reason I went searching for a good sniffer was that I have ~2100 csv files of slightly different formats coming from different sources. In some cases, a csv file is sent directly to me, other times it is first opened in excel and saved, and other times it is copy-pasted from excel into another location, yielding 3 variations on the formatting from a single source. Multiply that by 8 different sources of data. For hacking disparate data sources together, it is more interesting to have a sniffer that works really well to distinguish among the most common dialects of csv, than one that tries to deduce the parameters of a rare or unknown format. I agree with you that it would be a rare case that the format is completely unknown -- more likely, you know it is one of two or three possible options and don't want to have to inspect each file to find out which. Unfortunately, trying to limit delimiters to some of the most common ones using the delimiters parameter just raises an error: Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03) [GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin Type help, copyright, credits or license for more information. import csv csv.Sniffer().sniff(\ ... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message, ... Sscanner ac15072911220.pdf,CM_15203,41.56,MX Jan Feb,948198,, ... Sscanner ac15072911221.pdf,CM 16148,41.50,MX Unkwon,948199,, ... Sscanner ac15072911230.pdf,CM 16148,6.42,MX Cavalier,948200,Photos don't match the invoice ... Sscanner ac15072911261.pdf,CM_14464,0.06,MX Dutiful,948203,, ... Sscanner ac15072911262.pdf,CM 16148,88,MX Apr,948202,, ... Sscanner ac15072911250.pdf,CM_14464,94.08,MX Jan Feb,948208,, ... Sscanner ac15072911251.pdf,CM_17491,39.84,MX Unkwon,948207,, ... Sscanner ac15072911253.pdf,CM_14464,42.07,MX Cavalier,,, ... Sscanner ac15072911253.pdf,CM_14464,2.23,MX Dutiful,,, ... Sscanner ac15072911253.pdf,CM_14464,12.84,MX Apr,,, ... ).delimiter 'M' csv.Sniffer().sniff(\ ... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message, ... Sscanner ac15072911220.pdf,CM_15203,41.56,MX Jan Feb,948198,, ... Sscanner ac15072911221.pdf,CM 16148,41.50,MX Unkwon,948199,, ... Sscanner ac15072911230.pdf,CM 16148,6.42,MX Cavalier,948200,Photos don't match the invoice ... Sscanner ac15072911261.pdf,CM_14464,0.06,MX Dutiful,948203,, ... Sscanner ac15072911262.pdf,CM 16148,88,MX Apr,948202,, ... Sscanner ac15072911250.pdf,CM_14464,94.08,MX Jan Feb,948208,, ... Sscanner ac15072911251.pdf,CM_17491,39.84,MX Unkwon,948207,, ... Sscanner ac15072911253.pdf,CM_14464,42.07,MX Cavalier,,, ... Sscanner ac15072911253.pdf,CM_14464,2.23,MX Dutiful,,, ... Sscanner ac15072911253.pdf,CM_14464,12.84,MX Apr,,, ... , delimiters=,\t|^).delimiter Traceback (most recent call last): File stdin, line 13, in module File /Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py, line 189, in sniff raise Error(Could not determine delimiter) _csv.Error: Could not determine delimiter On Tue, Aug 4, 2015 at 8:29 AM Skip Montanaro rep...@bugs.python.org wrote: Skip Montanaro added the comment: I should have probably pointed out that the Sniffer class is the unloved stepchild of the csv module. In my experience it is rarely necessary. You either: * Are reading CSV files which are about what Excel would produce with its default settings or * Know just what your format is, and can define the various parameters easily It's pretty rare, I think, to get a delimited file in some format which is completely unknown and which thus has to be deduced. As Peter showed, the Sniffer class is also kind of unreliable. I didn't write it, and there are precious few test cases for it. One of your datasets should probably be added to the mix and bugs fixed. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
New submission from Tiago Wright: csv.Sniffer().sniff() guesses M for the delimiter of the first dataset below. The same error occurs when the , is replaced by \t. However, it correctly guesses , for the second dataset. ---Dataset 1 Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message, Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,, Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,, Sscanner ac15072911230.pdf,CM 16148,33488,MX Cavalier,948200,Photos don't match the invoice Sscanner ac15072911261.pdf,CM_14464,1713.6,MX Dutiful,948203,, Sscanner ac15072911262.pdf,CM 16148,3114,MX Apr,948202,, Sscanner ac15072911250.pdf,CM_14464,1232.28,MX Jan Feb,948208,, Sscanner ac15072911251.pdf,CM_17491,15232,MX Unkwon,948207,, Sscanner ac15072911253.pdf,CM_14464,11250,MX Cavalier,,, Sscanner ac15072911253.pdf,CM_14464,11250,MX Dutiful,,, Sscanner ac15072911253.pdf,CM_14464,11250,MX Apr,,, --- Dataset 2--- Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message, Sscanner ac15072911220.pdf,CM_15203,82.07,MX Jan Feb,948198,, Sscanner ac15072911221.pdf,CM 16148,23.29,MX Unkwon,948199,, Sscanner ac15072911230.pdf,CM 16148,88.55,MX Cavalier,948200,Photos don't match the invoice, Sscanner ac15072911261.pdf,CM_14464,58.78,MX Dutiful,948203,, Sscanner ac15072911262.pdf,CM 16148,52,MX Apr,948202,, Sscanner ac15072911250.pdf,CM_14464,40.40,MX Jan Feb,948208,, Sscanner ac15072911251.pdf,CM_17491,54.97,MX Unkwon,948207,, Sscanner ac15072911253.pdf,CM_14464,4.08,MX Cavalier,,, Sscanner ac15072911253.pdf,CM_14464,49.11,MX Dutiful,,, Sscanner ac15072911253.pdf,CM_14464,18.28,MX Apr,,, -- components: Extension Modules messages: 247967 nosy: Tiago Wright priority: normal severity: normal status: open title: csv.Sniffer guesses M instead of \t or , as the delimiter type: behavior versions: Python 3.4 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter
Skip Montanaro added the comment: How are you calling the sniff() method? Note that it takes a sample of the CSV file. For example, this works for me: f = open(sniff1.csv) dialect = csv.Sniffer().sniff(next(open(sniff1.csv))) dialect.delimiter ',' dialect.lineterminator '\r\n' where sniff1.csv is your Dataset 1. (I think for reliable operation you really want your sample to be a multiple of whole lines.) -- nosy: +skip.montanaro ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue24787 ___ ___ Python-bugs-list mailing list Unsubscribe: https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com