[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-07 Thread Tiago Wright

Tiago Wright added the comment:

Attached is a .py file with 32 test cases for the Sniff class, 18 that
fail, 14 that pass.

My hope is that these samples can be used to improve the delimiter
detection code.

-Tiago

--
Added file: http://bugs.python.org/file40149/testround8.py

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___import csv

def test_delimiters():

delimiter_samples = [

{ 'delimiter' :\t, 'sample' :   # error:Exception
'''Field Name   Definition
RefID   Unique (sequential) number assigned to 
vehicles
IsBadBuyIdentifies if the kicked vehicle was an 
avoidable purchase 
PurchDate   The Date the vehicle was Purchased at 
Auction
Auction Auction provider at which the  vehicle 
was purchased
VehYear The manufacturer's year of the vehicle
VehicleAge  The Years elapsed since the 
manufacturer's year
''' },

{ 'delimiter' :\t, 'sample' :   # error:Exception
'''rulessupport confidence  lift
1   {Brushes} = {Nail.Polish}  0.149   1   3.57142857142857
2   {Brushes} = {Bronzer}  0.097   0.651006711409396   2.5738856414
3   {Brushes} = {Concealer}0.092   0.61744966442953
1.39694494214826
4   {Lip.liner} = {Concealer}  0.179   0.764957264957265   
1.73067254515218
5   {Bronzer} = {Concealer}0.175   0.627240143369176   
1.41909534698909
6   {Blush} = {Concealer}  0.220.606060606060606   1.37117784176608
''' },

{ 'delimiter' :,, 'sample' :   # error:Exception
'''A,B,C,D,E
2000-01-03 
00:00:00,0.980268513777,3.68573087906,-0.364216805298,-1.15973806169,foo
2000-01-04 
00:00:00,1.04791624281,-0.0412318367011,-0.16181208307,0.212549316967,bar
2000-01-05 
00:00:00,0.498580885705,0.731167677815,-0.537677223318,1.34627041952,baz
2000-01-06 
00:00:00,1.12020151869,1.56762092543,0.00364077397681,0.67525259227,qux
2000-01-07 
00:00:00,-0.487094399463,0.571454623474,-1.6116394093,0.103468562917,foo2
''' },

{ 'delimiter' :,, 'sample' :   # error:Exception
'''1,699,4751,4158
8,1856
12,4059,5716,4299,4967,2128
16,1928,1176
19,1928,2775,4646,1720,3148,2552,5978,3736,3090
22,4059,1856,4103,4739,4865,4769,621,2874,1637,252
28,5321,4059,4952,1856,4103,699,1976
''' },

{ 'delimiter' :,, 'sample' :   # error:Exception
'''���Date,From,To,Flight_Number,Airline,Distance,Duration,Seat,Seat_Type,Class,Reason,Plane,Registration,Trip,Note,From_OID,To_OID,Airline_OID,Plane_OID
2004-08-27,YHZ,YYZ,,Air Canada,801,01:56,,A,Y,L,73,193,330
2004-08-01,YYZ,YHZ,,Air Canada,801,01:56,,A,Y,L,193,73,330
2004-07-30,YHZ,YYZ,,Air Canada,801,01:56,,A,Y,L,73,193,330
2004-05-30,ZRH,MUC,,Lufthansa,162,00:47,,,Y,L,1678,346,3320
2004-05-30,MUC,YYZ,,Air Canada,4131,07:53,,,Y,L,346,193,330
2004-05-30,YYZ,YOW,,Unknown,226,00:54,,,Y,L,193,100,-1
''' },

{ 'delimiter' :\t, 'sample' :   # error:Exception
'''Format version   Start date  End dateSender  Recipient   
Aggregator
5   2010-05-01  2010-05-31  Spotify Udsvxd  Udsvxd
Country Label   Product CurrencyTotal tracksRightholder's tracks
Pro rata share  Revenue share   Number of users Net revenue Payable USD 
RateUSD Payable
XV  Ipstqx Gjivgmn  C   JFG 331264067   0.0020.00   
87845   851092.49   0.045.6647  0.09
JN  Mvcqxv Gjivgmqxd Iv P   JFG 368037889   635611  0.01
40.00   472355  639147.36   506.62  5.6647  562.82
IL  Mvcqxv Gjivgmn  C   JFG 35016   0.0420.00   8   
31.61   0.055.6647  0.05
DW  Mvcqxv  C   DWO 6283654158448   0.0420.00   84344   
330574.21   557.63  5.8230  513.62
''' },

{ 'delimiter' :,, 'sample' :   # error:Exception
'''age,workclass,education,education-num,marital-status,occupation,relationship,race,sex,capital-gain,capital-loss,hours-per-week,native-country,1iclass
39, State-gov, 77516, Bachelors, 13, Never-married, Adm-clerical, 
Not-in-family, White, Male, 2174, 0, 40, United-States, =50K
50, Self-emp-not-inc, 83311, Bachelors, 13, Married-civ-spouse, 
Exec-managerial, Husband, White, Male, 0, 0, 13, United-States, =50K
38, Private, 215646, HS-grad, 9, Divorced, Handlers-cleaners, Not-in-family, 
White, Male, 0, 0, 40, United-States, =50K
53, Private, 234721, 11th, 7, Married-civ-spouse, Handlers-cleaners, Husband, 
Black, Male, 0, 0, 40, United-States, =50K
28, Private, 338409, Bachelors, 13, Married-civ-spouse, Prof-specialty, Wife, 
Black, Female, 0, 0, 40, Cuba, =50K
37, Private, 284582, Masters, 14, Married-civ-spouse, 

[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-06 Thread Tiago Wright

Tiago Wright added the comment:

I've run the Sniffer against the same data set, but varied the size of the
sample given to the code. It seems that feeding it more data actually seems
to make the results less accurate. Table attached.
On Thu, Aug 6, 2015 at 12:29 PM R. David Murray rep...@bugs.python.org
wrote:


 R. David Murray added the comment:

 Yes, much better :)

 --

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue24787
 ___


--
Added file: http://bugs.python.org/file40141/csvsniffertest5.txt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___ lines3  lines7  lines70  lines700
human Sniff   
, ,  490 487 424  393 
  A  1   0   00   
  Exception  6   8   44   
  c  1   1   11   
  g  1   0   00   
  h  1   0   00   
  space  0   0   97   
  y  0   0   11   
; ;  1   1   11   
\t\t 918 917 929  706 
  *  0   0   67   
  ,  6   3   21   
  -  0   0   05   
  :  0   2   22   
  D  5   0   00   
  E  0   0   10   10  
  Exception  52  91  18   18  
  M  1   1   00   
  c  2   0   00   
  m  2   0   00   
  p  61  27  22   22  
  s  0   0   22   
  space  1   6   51   125 
bar   bar33  33  20   9   
space Exception  0   1   11   
  e  4   4   44   
  space  10  9   99   
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-06 Thread Tiago Wright

Tiago Wright added the comment:

It seems the HTML file did not come through correctly. Trying a text
version, please view this in a monospace font:

|   Sniffer
|
Human   |   ,   |   ;   |   \t  |   \   |  space|Except |   :   |   )   |
c   |   e   |   M   |   p   |Total  |   %Error
---
,   |   498 |   |   |   2   |   1   |   10  |   |   |
1   |   |   |   |   512 |   2.7%
;   |   |   1   |   |   |   |   |   |   |
|   |   |   |   1   |   0.0%
\t  |   3   |   |   922 |   |   6   |   91  |   2   |   1   |
|   |   2   |   27  |   1054|   12.5%
|   |   |   |   |   33  |   |   |   |   |
|   |   |   |   33  |   0.0%
space   |   |   |   |   |   9   |   1   |   |   |
|   4   |   |   |   14  |   35.7%
---
Total   |   501 |   1   |   922 |   35  |   16  |   102 |   2   |   1   |
1   |   4   |   2   |   27  |   1614

On Thu, Aug 6, 2015 at 8:54 AM Tiago Wright rep...@bugs.python.org wrote:


 Tiago Wright added the comment:

 Table attached.

 -Tiago

 On Wed, Aug 5, 2015 at 8:14 PM Skip Montanaro rep...@bugs.python.org
 wrote:

 
  Skip Montanaro added the comment:
 
  Tiago, sorry, but your last post with results is completely
  unintelligible. Can you toss the table in a file and attach it instead?
 
  --
 
  ___
  Python tracker rep...@bugs.python.org
  http://bugs.python.org/issue24787
  ___
 

 --
 Added file: http://bugs.python.org/file40138/csvsniffertest3.htm

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue24787
 ___

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-06 Thread R. David Murray

R. David Murray added the comment:

Yes, much better :)

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-06 Thread Tiago Wright

Tiago Wright added the comment:

I apologize, it seems the text table got line wrapped. This time as a TXT
attachment.

-Tiago

On Thu, Aug 6, 2015 at 12:22 PM Tiago Wright rep...@bugs.python.org wrote:


 Tiago Wright added the comment:




--
Added file: http://bugs.python.org/file40140/csvsniffertest3.txt

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___|   Sniffer 
|
Human   |   ,   |   ;   |   \t  |   \   |  space|Except |   :   |   )   |   c   
|   e   |   M   |   p   |Total  |   %Error
---
,   |   498 |   |   |   2   |   1   |   10  |   |   |   1   
|   |   |   |   512 |   2.7%
;   |   |   1   |   |   |   |   |   |   |   
|   |   |   |   1   |   0.0%
\t  |   3   |   |   922 |   |   6   |   91  |   2   |   1   |   
|   |   2   |   27  |   1054|   12.5%
|   |   |   |   |   33  |   |   |   |   |   
|   |   |   |   33  |   0.0%
space   |   |   |   |   |   9   |   1   |   |   |   
|   4   |   |   |   14  |   35.7%
---
Total   |   501 |   1   |   922 |   35  |   16  |   102 |   2   |   1   |   1   
|   4   |   2   |   27  |   1614
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-06 Thread R. David Murray

R. David Murray added the comment:

Your best bet is to attach an ascii text file as an uploaded file.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-06 Thread Tiago Wright

Tiago Wright added the comment:

Table attached.

-Tiago

On Wed, Aug 5, 2015 at 8:14 PM Skip Montanaro rep...@bugs.python.org
wrote:


 Skip Montanaro added the comment:

 Tiago, sorry, but your last post with results is completely
 unintelligible. Can you toss the table in a file and attach it instead?

 --

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue24787
 ___


--
Added file: http://bugs.python.org/file40138/csvsniffertest3.htm

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___html xmlns:x=urn:schemas-microsoft-com:office:excel
xmlns=http://www.w3.org/TR/REC-html40;

head
meta name=Excel Workbook Frameset
meta http-equiv=Content-Type content=text/html; charset=utf-8
meta name=ProgId content=Excel.Sheet
meta name=Generator content=Microsoft Excel 14
![if !supportTabStrip]
link id=shLink href=manual_files/sheet001.htm
link id=shLink href=manual_files/sheet002.htm
link id=shLink href=manual_files/sheet003.htm

link id=shLink

script language=JavaScript
!--
 var g_iIEVer=0;

 var c_lTabs=3;

 var c_rgszSh=new Array(c_lTabs);
 c_rgszSh[0] = Sheet2;
 c_rgszSh[1] = Sheet1;
 c_rgszSh[2] = Sheet1��(2);


 var c_rgszClr=new Array(8);
 c_rgszClr[0]=window;
 c_rgszClr[1]=buttonface;
 c_rgszClr[2]=windowframe;
 c_rgszClr[3]=windowtext;
 c_rgszClr[4]=threedlightshadow;
 c_rgszClr[5]=threedhighlight;
 c_rgszClr[6]=threeddarkshadow;
 c_rgszClr[7]=threedshadow;

 var g_iShCur;
 var g_rglTabX=new Array(c_lTabs);

function fnBuildFrameset()
{
 var szHTML=frameset rows=\*,18\ border=0 width=0 frameborder=no 
framespacing=0+
  frame src=\+document.all.item(shLink)[2].href+\ name=\frSheet\ 
noresize+
  frameset cols=\54,*\ border=0 width=0 frameborder=no framespacing=0+
  frame src=\\ name=\frScroll\ marginwidth=0 marginheight=0 
scrolling=no+
  frame src=\\ name=\frTabs\ marginwidth=0 marginheight=0 scrolling=no+
  /frameset/framesetplaintext;

document.open(text/html,replace);
document.write(szHTML);
document.close();

 fnBuildTabStrip();
}

function fnBuildTabStrip()
{
 var szHTML=
  htmlheadstyle.clScroll {font:8pt Courier 
New;color:+c_rgszClr[2]+;cursor:default;line-height:10pt;}+
  .clScroll2 {font:10pt 
Arial;color:+c_rgszClr[2]+;cursor:default;line-height:11pt;}/style/head+
  body onclick=\event.returnValue=false;\ 
ondragstart=\event.returnValue=false;\ 
onselectstart=\event.returnValue=false;\ bgcolor=+c_rgszClr[4]+ topmargin=0 
leftmargin=0table cellpadding=0 cellspacing=0 width=100%+
  trtd colspan=6 height=1 bgcolor=+c_rgszClr[2]+/td/tr+
  trtd style=\font:1pt\nbsp;td+
  td valign=top id=tdScroll class=\clScroll\ 
onclick=\parent.fnFastScrollTabs(0);\ 
onmouseover=\parent.fnMouseOverScroll(0);\ 
onmouseout=\parent.fnMouseOutScroll(0);\a#171;/a/td+
  td valign=top id=tdScroll class=\clScroll2\ 
onclick=\parent.fnScrollTabs(0);\ ondblclick=\parent.fnScrollTabs(0);\ 
onmouseover=\parent.fnMouseOverScroll(1);\ 
onmouseout=\parent.fnMouseOutScroll(1);\alt/a/td+
  td valign=top id=tdScroll class=\clScroll2\ 
onclick=\parent.fnScrollTabs(1);\ ondblclick=\parent.fnScrollTabs(1);\ 
onmouseover=\parent.fnMouseOverScroll(2);\ 
onmouseout=\parent.fnMouseOutScroll(2);\agt/a/td+
  td valign=top id=tdScroll class=\clScroll\ 
onclick=\parent.fnFastScrollTabs(1);\ 
onmouseover=\parent.fnMouseOverScroll(3);\ 
onmouseout=\parent.fnMouseOutScroll(3);\a#187;/a/td+
  td style=\font:1pt\nbsp;td/tr/table/body/html;

frames['frScroll'].document.open(text/html,replace);
frames['frScroll'].document.write(szHTML);
frames['frScroll'].document.close();

 szHTML =
  htmlhead+
  styleA:link,A:visited,A:active 
{text-decoration:none;+color:+c_rgszClr[3]+;}+
  .clTab {cursor:hand;background:+c_rgszClr[1]+;font:8pt 
Arial;padding-left:3px;padding-right:3px;text-align:center;}+
  .clBorder {background:+c_rgszClr[2]+;font:1pt;}+
  /style/headbody onload=\parent.fnInit();\ 
onselectstart=\event.returnValue=false;\ 
ondragstart=\event.returnValue=false;\ bgcolor=+c_rgszClr[4]+
   topmargin=0 leftmargin=0table id=tbTabs cellpadding=0 cellspacing=0;

 var iCellCount=(c_lTabs+1)*2;

 var i;
 for (i=0;iiCellCount;i+=2)
  szHTML+=col width=1col;

 var iRow;
 for (iRow=0;iRow6;iRow++) {

  szHTML+=tr;

  if (iRow==5)
   szHTML+=td colspan=+iCellCount+/td;
  else {
   if (iRow==0) {
for(i=0;iiCellCount;i++)
 szHTML+=td height=1 class=\clBorder\/td;
   } else if (iRow==1) {
for(i=0;ic_lTabs;i++) {
 szHTML+=td height=1 nowrap class=\clBorder\nbsp;/td;
 szHTML+=
  td id=tdTab height=1 nowrap class=\clTab\ 
onmouseover=\parent.fnMouseOverTab(+i+);\ 
onmouseout=\parent.fnMouseOutTab(+i+);\+
  a href=\+document.all.item(shLink)[i].href+\ target=\frSheet\ 
id=aTabnbsp;+c_rgszSh[i]+nbsp;/a/td;
}
szHTML+=td id=tdTab height=1 nowrap class=\clBorder\a 
id=aTabnbsp;/a/tdtd width=100%/td;
   } else if (iRow==2) 

[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-05 Thread Skip Montanaro

Skip Montanaro added the comment:

Tiago, sorry, but your last post with results is completely unintelligible. Can 
you toss the table in a file and attach it instead?

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-05 Thread Tiago Wright

Tiago Wright added the comment:

I've run the Sniffer against 1614 csv files on my computer and compared the
delimiter it detects to what I have set manually. Here are the results:

 SnifferHuman,;\t\(blank)Error:)ceMpGrand TotalError rate,498  2
110  1   5122.7%; 1  10.0%\t3 922 69121  227105412.5%|   33
330.0%space91   4  1435.7%Grand Total5011922351610221142271614
-Tiago

On Tue, Aug 4, 2015 at 3:51 PM R. David Murray rep...@bugs.python.org
wrote:


 R. David Murray added the comment:

 If you look at the algorithm it is doing some fancy things with metrics,
 but does have a 'preferred delimiters' list that it checks.  It is possible
 things could be improved either by tweaking the threshold or by somehow
 giving added weight to the metrics when the candidate character is in the
 preferred delimiter list.

 We might have to do this with a feature flag to turn it on, though, since
 it could change the results for programs that happen to work with the
 current algorithm.

 --
 nosy: +r.david.murray

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue24787
 ___


--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-04 Thread Peter Otten

Peter Otten added the comment:

The sniffer actually changes its mind in the fourth line:

Python 3.4.0 (default, Jun 19 2015, 14:20:21) 
[GCC 4.8.2] on linux
Type help, copyright, credits or license for more information.
 import csv
 csv.Sniffer().sniff(\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,,
... ).delimiter
','
 csv.Sniffer().sniff(\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,,
... Sscanner ac15072911230.pdf,CM 16148,33488,MX Cavalier,948200,Photos don't 
match the invoice
... ).delimiter
'M'

That line has only 5 commas while all others have 6. Unfortunately all lines 
contain exactly two M...

--
nosy: +peter.otten

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-04 Thread Skip Montanaro

Skip Montanaro added the comment:

I should have probably pointed out that the Sniffer class is the unloved 
stepchild of the csv module. In my experience it is rarely necessary. You 
either:

* Are reading CSV files which are about what Excel would produce with its 
default settings

or

* Know just what your format is, and can define the various parameters easily

It's pretty rare, I think, to get a delimited file in some format which is 
completely unknown and which thus has to be deduced.

As Peter showed, the Sniffer class is also kind of unreliable. I didn't write 
it, and there are precious few test cases for it. One of your datasets should 
probably be added to the mix and bugs fixed.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-04 Thread R. David Murray

R. David Murray added the comment:

If you look at the algorithm it is doing some fancy things with metrics, but 
does have a 'preferred delimiters' list that it checks.  It is possible things 
could be improved either by tweaking the threshold or by somehow giving added 
weight to the metrics when the candidate character is in the preferred 
delimiter list.

We might have to do this with a feature flag to turn it on, though, since it 
could change the results for programs that happen to work with the current 
algorithm.

--
nosy: +r.david.murray

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-04 Thread Tiago Wright

Tiago Wright added the comment:

I agree that the parameters are easily deduced for any one csv file after a
quick inspection. The reason I went searching for a good sniffer was that I
have ~2100 csv files of slightly different formats coming from different
sources. In some cases, a csv file is sent directly to me, other times it
is first opened in excel and saved, and other times it is copy-pasted from
excel into another location, yielding 3 variations on the formatting from a
single source. Multiply that by 8 different sources of data.

For hacking disparate data sources together, it is more interesting to have
a sniffer that works really well to distinguish among the most common
dialects of csv, than one that tries to deduce the parameters of a rare or
unknown format. I agree with you that it would be a rare case that the
format is completely unknown -- more likely, you know it is one of two or
three possible options and don't want to have to inspect each file to find
out which.

Unfortunately, trying to limit delimiters to some of the most common ones
using the delimiters parameter just raises an error:

Python 3.4.3 (v3.4.3:9b73f1c3e601, Feb 23 2015, 02:52:03)
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)] on darwin
Type help, copyright, credits or license for more information.
 import csv
 csv.Sniffer().sniff(\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,41.56,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,41.50,MX Unkwon,948199,,
... Sscanner ac15072911230.pdf,CM 16148,6.42,MX Cavalier,948200,Photos
don't match the invoice
... Sscanner ac15072911261.pdf,CM_14464,0.06,MX Dutiful,948203,,
... Sscanner ac15072911262.pdf,CM 16148,88,MX Apr,948202,,
... Sscanner ac15072911250.pdf,CM_14464,94.08,MX Jan Feb,948208,,
... Sscanner ac15072911251.pdf,CM_17491,39.84,MX Unkwon,948207,,
... Sscanner ac15072911253.pdf,CM_14464,42.07,MX Cavalier,,,
... Sscanner ac15072911253.pdf,CM_14464,2.23,MX Dutiful,,,
... Sscanner ac15072911253.pdf,CM_14464,12.84,MX Apr,,,
... ).delimiter
'M'
 csv.Sniffer().sniff(\
... Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
... Sscanner ac15072911220.pdf,CM_15203,41.56,MX Jan Feb,948198,,
... Sscanner ac15072911221.pdf,CM 16148,41.50,MX Unkwon,948199,,
... Sscanner ac15072911230.pdf,CM 16148,6.42,MX Cavalier,948200,Photos
don't match the invoice
... Sscanner ac15072911261.pdf,CM_14464,0.06,MX Dutiful,948203,,
... Sscanner ac15072911262.pdf,CM 16148,88,MX Apr,948202,,
... Sscanner ac15072911250.pdf,CM_14464,94.08,MX Jan Feb,948208,,
... Sscanner ac15072911251.pdf,CM_17491,39.84,MX Unkwon,948207,,
... Sscanner ac15072911253.pdf,CM_14464,42.07,MX Cavalier,,,
... Sscanner ac15072911253.pdf,CM_14464,2.23,MX Dutiful,,,
... Sscanner ac15072911253.pdf,CM_14464,12.84,MX Apr,,,
... , delimiters=,\t|^).delimiter
Traceback (most recent call last):
  File stdin, line 13, in module
  File
/Library/Frameworks/Python.framework/Versions/3.4/lib/python3.4/csv.py,
line 189, in sniff
raise Error(Could not determine delimiter)
_csv.Error: Could not determine delimiter

On Tue, Aug 4, 2015 at 8:29 AM Skip Montanaro rep...@bugs.python.org
wrote:


 Skip Montanaro added the comment:

 I should have probably pointed out that the Sniffer class is the unloved
 stepchild of the csv module. In my experience it is rarely necessary. You
 either:

 * Are reading CSV files which are about what Excel would produce with its
 default settings

 or

 * Know just what your format is, and can define the various parameters
 easily

 It's pretty rare, I think, to get a delimited file in some format which is
 completely unknown and which thus has to be deduced.

 As Peter showed, the Sniffer class is also kind of unreliable. I didn't
 write it, and there are precious few test cases for it. One of your
 datasets should probably be added to the mix and bugs fixed.

 --

 ___
 Python tracker rep...@bugs.python.org
 http://bugs.python.org/issue24787
 ___


--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-03 Thread Tiago Wright

New submission from Tiago Wright:

csv.Sniffer().sniff() guesses M for the delimiter of the first dataset below. 
The same error occurs when the , is replaced by \t. However, it correctly 
guesses , for the second dataset.

---Dataset 1
Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
Sscanner ac15072911220.pdf,CM_15203,28714.32,MX Jan Feb,948198,,
Sscanner ac15072911221.pdf,CM 16148,15600,MX Unkwon,948199,,
Sscanner ac15072911230.pdf,CM 16148,33488,MX Cavalier,948200,Photos don't match 
the invoice
Sscanner ac15072911261.pdf,CM_14464,1713.6,MX Dutiful,948203,,
Sscanner ac15072911262.pdf,CM 16148,3114,MX Apr,948202,,
Sscanner ac15072911250.pdf,CM_14464,1232.28,MX Jan Feb,948208,,
Sscanner ac15072911251.pdf,CM_17491,15232,MX Unkwon,948207,,
Sscanner ac15072911253.pdf,CM_14464,11250,MX Cavalier,,,
Sscanner ac15072911253.pdf,CM_14464,11250,MX Dutiful,,,
Sscanner ac15072911253.pdf,CM_14464,11250,MX Apr,,,

--- Dataset 2---
Invoice File,Credit Memo,Amount Claimed,Description,Invoice,Message,
Sscanner ac15072911220.pdf,CM_15203,82.07,MX Jan Feb,948198,,
Sscanner ac15072911221.pdf,CM 16148,23.29,MX Unkwon,948199,,
Sscanner ac15072911230.pdf,CM 16148,88.55,MX Cavalier,948200,Photos don't match 
the invoice,
Sscanner ac15072911261.pdf,CM_14464,58.78,MX Dutiful,948203,,
Sscanner ac15072911262.pdf,CM 16148,52,MX Apr,948202,,
Sscanner ac15072911250.pdf,CM_14464,40.40,MX Jan Feb,948208,,
Sscanner ac15072911251.pdf,CM_17491,54.97,MX Unkwon,948207,,
Sscanner ac15072911253.pdf,CM_14464,4.08,MX Cavalier,,,
Sscanner ac15072911253.pdf,CM_14464,49.11,MX Dutiful,,,
Sscanner ac15072911253.pdf,CM_14464,18.28,MX Apr,,,

--
components: Extension Modules
messages: 247967
nosy: Tiago Wright
priority: normal
severity: normal
status: open
title: csv.Sniffer guesses M instead of \t or , as the delimiter
type: behavior
versions: Python 3.4

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue24787] csv.Sniffer guesses M instead of \t or , as the delimiter

2015-08-03 Thread Skip Montanaro

Skip Montanaro added the comment:

How are you calling the sniff() method? Note that it takes a sample of the CSV 
file. For example, this works for me:

 f = open(sniff1.csv)
 dialect = csv.Sniffer().sniff(next(open(sniff1.csv)))
 dialect.delimiter 
','
 dialect.lineterminator
'\r\n'

where sniff1.csv is your Dataset 1. (I think for reliable operation you really 
want your sample to be a multiple of whole lines.)

--
nosy: +skip.montanaro

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue24787
___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com