Re: [Python-Dev] [Csv] These csv test cases seem incorrect to me...

2007-03-11 Thread John Machin
On 12/03/2007 1:01 PM, [EMAIL PROTECTED] wrote:
> I decided it would be worthwhile to have a csv module written in Python (no
> C underpinnings) for a number of reasons:
> 
> * It will probably be easier to add Unicode support to a Python version
> 
> * More people will be able to read/grok/modify/fix bugs in a Python
>   implementation than in the current mixed Python/C implementation.
> 
> * With alternative implementations of Python available (PyPy,
>   IronPython, Jython) it makes sense to have a Python version they can
>   use.
> 
> I'm far from having anything which will pass the current test suite, but in
> diagnosing some of my current failures I noticed a couple test cases which
> seem wrong.  In the TestDialectExcel class I see these two questionable
> tests:
> 
> def test_quotes_and_more(self):
> self.readerAssertEqual('"a"b', [['ab']])
> 
> def test_quote_and_quote(self):
> self.readerAssertEqual('"a" "b"', [['a "b"']])
> 
> It seems to me that if a field starts with a quote it *has* to be a quoted
> field.  Any quotes appearing within a quoted field have to be escaped and
> the field has to end with a quote.  Both of these test cases fail on or the
> other assumption.  If they are indeed both correct and I'm just looking at
> things crosseyed I think they at least deserve comments explaining why they
> are correct.
> 
> Both test cases date from the first checkin.  I performed the checkin
> because of the group developing the module I believe I was the only one with
> checkin privileges at the time, not because I wrote the test cases.
> 
> Any ideas about why these test cases are in there?  I can't imagine Excel
> generating either one.
> 

Hi Skip,

'"a"b' can't be produced by applying minimalist CSV writing rules to 
'ab'. A non-minimalist writer could produce '"ab"', but certainly not 
'"a"b'.

The second case is worse -- it's inconsistent; the reader is supposed to 
remove the quotes from "a" but not from "b"???

IMHO these test cases are *WRONG* and it's a worry that they "work" with 
the current csv module :-(

Regards,

John

___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Csv] These csv test cases seem incorrect to me...

2007-03-11 Thread John Machin
On 12/03/2007 1:41 PM, Andrew McNamara wrote:
> 
> The point was to produce the same results as Excel. Sure, Excel probably
> doesn't generate crap like this itself, but 3rd parties do, and people
> complain if we don't parse it just like Excel (sigh).

Let's put a little flesh on those a's and b's:

A typical example of the first case is where a database address line 
contains a quoted house name e.g.

"Dunromin", 123 Main Street

and the producer of the CSV file has not done any quoting at all.

An example of the 2nd case is a database address line like this:

C/o Mrs Jones, "Dunromin", 123 Main Street

and the producer of the CSV file has merely wrapped quotes about it 
without doubling the existing quotes, to emit this:

"C/o Mrs Jones, "Dunromin", 123 Main Street"

which Excel and adherents would distort to two fields containing:
'C/o Mrs Jones, Dunromin"' and ' 123 Main Street"' -- aarrgghh!!

People who complain as described are IMHO misguided; they are accepting 
crap and losing data (yes, the quotes in the above examples are *DATA*). 
Why should we heed their complaints?

Perhaps we could consider a non-default "dopey_like_Excel" option for 
csv :-)

BTW, it is possible to do a reasonable recovery job when the producer's 
protocol was to wrap quotes around the data without doubling existing 
quotes, providing there were an even number of quotes to start with. It 
just requires a quite different finite state machine.

Cheers,
John


___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com


Re: [Python-Dev] [Csv] skipfinalspace

2008-10-20 Thread John Machin

Tom Brown wrote:
(Continuing thread started at 
http://mail.python.org/pipermail/csv/2008-October/000688.html)


On Sun, Oct 19, 2008 at 16:46, Andrew McNamara 
<[EMAIL PROTECTED] > wrote:


 >I downloaded the 2.6 source tar ball, but is it too late for new
features to
 >get into versions <3?

Yep.

 >How would you feel about adding the following tests to
Lib/test/test_csv.py
 >and getting them to pass?
 >
 >Also http://www.python.org/doc/2.5.2/lib/csv-fmt-params.html says
 >"*skipinitialspace *When True, whitespace immediately following the
 >delimiter is ignored."
 >but my tests show whitespace at the start of any field is ignored,
including
 >the first field.

I suspect (but I haven't checked) that it means "after the delimiter and
before any quoted field (or some variation on that).

I agree that whitespace after the delimiter and before any quoted field 
is skipped. Also whitespace after the start of the line and before any 
quoted field is skipped.



All of the "dialect" parameters are there to allow parsing of a specific
common form of CSV file. Because there is no formal definition of the
format, the module simply aims to parse (and produce the same result)
as common applications such as Excel and Access. Changing the behaviour
in any non-backwards compatible way is sure to get screams of anguish
from many users. Even when the behaviour appears to be a bug, you can
be sure people are counting on it working like that.


skipinitialspace defaults to false and by the same logic skipfinalspace 
should default to false to preserve compatibility with the csv module in 
2.6. On the other hand, the switch to version 3 is as good a time as any 
to break backwards compatibility to adopt something that works better 
for new users.


Read Andrew's lips: They don't want "better", they want "the same as MS".

Based on my experience parsing several hundred csv generated by many 
different people I think it would be nice to at least have a dialect 
that is excel + skipinitialspace=True + skipfinalspace=True.


Based on my experience extracting data from innumerable csv files (and 
infinite varieties thereof), spreadsheet files, and database tables, in 
99.99% of cases one should automatically apply the following 
transformations to each text field:

   * strip leading whitespace
   * strip trailing whitespace
   * replace embedded runs of whitespace by a single space
and one needs to ensure that the definition of whitespace includes the 
no-break space (NBSP) character.


As this "space normalisation" is needed for all input sources, the csv 
module is IMHO the wrong place to put it. A string method would be a 
better idea.


Cheers,
John
___
Python-Dev mailing list
Python-Dev@python.org
http://mail.python.org/mailman/listinfo/python-dev
Unsubscribe: 
http://mail.python.org/mailman/options/python-dev/archive%40mail-archive.com