[issue44677] CSV sniffing falsely detects space as a delimiter

2021-07-21 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Changing sniffer logic is risky because it risks breaking existing code that 
relies on the current predictions.

FWIW, in your example, the sniffer gets the desired result if given a delimiter 
hint:

>>> s = "a|b\nc| 'd\ne|' f"
>>> pprint.pp(dict(vars(Sniffer().sniff(s, '|'
{'__module__': 'csv',
 '_name': 'sniffed',
 'lineterminator': '\r\n',
 'quoting': 0,
 '__doc__': None,
 'doublequote': False,
 'delimiter': '|',
 'quotechar': "'",
 'skipinitialspace': False}

--
nosy: +rhettinger

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44677] CSV sniffing falsely detects space as a delimiter

2021-07-20 Thread Roundup Robot


Change by Roundup Robot :


--
keywords: +patch
nosy: +python-dev
nosy_count: 1.0 -> 2.0
pull_requests: +25801
stage:  -> patch review
pull_request: https://github.com/python/cpython/pull/27256

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44677] CSV sniffing falsely detects space as a delimiter

2021-07-20 Thread Piotr Tokarski


Piotr Tokarski  added the comment:

I think changing `(?P["\']).*?(?P=quote)` to 
`(?P["\'])[^\n]*?(?P=quote)` in all regexes does the trick, doesn't it?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44677] CSV sniffing falsely detects space as a delimiter

2021-07-20 Thread Piotr Tokarski


Piotr Tokarski  added the comment:

Test sample:

```
import csv
from io import StringIO


def csv_text():
return StringIO("a|b\nc| 'd\ne|' f")


with csv_text() as input_file:
print('The following text is going to be parsed:')
print(input_file.read())
print()


with csv_text() as input_file:
dialect_params = [
'delimiter',
'quotechar',
'escapechar',
'lineterminator',
'quoting',
'doublequote',
'skipinitialspace'
]
dialect = csv.Sniffer().sniff(input_file.read())
print('The following dialect has been detected:')
for dialect_param in dialect_params:
print(f'- {dialect_param}: {repr(getattr(dialect, dialect_param))}')
print()


with csv_text() as input_file:
print('Parsed csv text:')
for entry in csv.reader(input_file, dialect=dialect):
print(f'- {entry}')
print()
```

Actual output:

```
The following text is going to be parsed:
a|b
c| 'd
e|' f

The following dialect has been detected:
- delimiter: ' '
- quotechar: "'"
- escapechar: None
- lineterminator: '\r\n'
- quoting: 0
- doublequote: False
- skipinitialspace: False

Parsed csv text:
- ['a|b']
- ['c|', 'd\ne|', 'f']

```

Expected output:

```
The following text is going to be parsed:
a|b
c| 'd
e|' f

The following dialect has been detected:
- delimiter: '|'
- quotechar: '"'
- escapechar: None
- lineterminator: '\r\n'
- quoting: 0
- doublequote: False
- skipinitialspace: False

Parsed csv text:
- ['a', 'b']
- ['c', " 'd"]
- ['e', "' f"]

```

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue44677] CSV sniffing falsely detects space as a delimiter

2021-07-19 Thread Piotr Tokarski


New submission from Piotr Tokarski :

Let's consider the following CSV content: "a|b\nc| 'd\ne|' f". The real 
delimiter in this case is '|' character while ' ' is sniffed. Find verbose 
example attached.

Problem lays in csv.py file in the following code:

```
matches = []
for restr in (r'(?P[^\w\n"\'])(?P 
?)(?P["\']).*?(?P=quote)(?P=delim)', # ,".*?",
  
r'(?:^|\n)(?P["\']).*?(?P=quote)(?P[^\w\n"\'])(?P ?)',   # 
 ".*?",
  r'(?P[^\w\n"\'])(?P 
?)(?P["\']).*?(?P=quote)(?:$|\n)',   # ,".*?"
  r'(?:^|\n)(?P["\']).*?(?P=quote)(?:$|\n)'):
#  ".*?" (no delim, no space)
regexp = re.compile(restr, re.DOTALL | re.MULTILINE)
matches = regexp.findall(data)
if matches:
break
```

What makes matches non-empty and farther processing happens with delimiter 
falsely set to ' '.

--
components: Library (Lib)
messages: 397821
nosy: pt12lol
priority: normal
severity: normal
status: open
title: CSV sniffing falsely detects space as a delimiter
type: behavior
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com