On 26/09/2009, at 12:00 AM, Jay A. Kreibich wrote:

> On Fri, Sep 25, 2009 at 10:24:15AM +1000, BareFeet scratched on the  
> wall:
>
>> In reality, in the thousands of CSV files I've dealt with
>> over the years, they all follow the same standard:
>>
>> 1. Rows are delimited by a new line (return and/or line feed).
>> 2. Columns are delimited by a comma.
>> 3. "Quoted text" is treated as one value item, including any commas  
>> or
>> new lines within it.
>> 4. A double quote "" is used to put a quote within quotes.
>>
>> That's it.
>
> This is more or less the standard put forth by RFC 4180.  And if  
> this is all you've encountered

Yes.

> you're not using very many different applications

I guess "many" is subjective but I would say I have used many,  
including Tandem main frame outputs, SQL server outputs, various  
product list suppliers, Excel Exports, electron microscope date, etc.  
I don't doubt that you've dealt with some strange CSV files, but I  
think the number of files defined as "strange" reduces if you better  
define the CSV format.

Please understand that I am not arguing that CSV should be used. But  
the fact is that CSV is a commonly used data export/import format (I'd  
say around 50%) and therefore supporting it is a significant  
requirement.

> or you're primarily dealing with numbers and simple strings that  
> don't contain quotes or commas.  CSV works very very well if you  
> never get into the question of escapes, but details count.

As I mentioned previously in points 3 and 4 above, I do deal with  
strings that contain quotes and commas. Dealing with them is well  
defined in practical use (or my experience of it).

> Reading the RFC only proves my point.  SQLite v3 is older than that  
> doc, and it pretty much admits the cat was out of the bag a long  
> time ago.  There are a ton of optional and might/may/could sections  
> that event

I accept that some earlier CSV implementations varied and that the RFC  
may have been in catch up mode. But I summarise that as a delay in  
setting the standard and thankfully not so much a problem with  
everyday use since most/all major current implementations use the  
specifications I outlined above (4 point summary).

> the format they define has a lot of holes in it

I don't think there are holes, within the scope of the data with which  
it deals. For instance, CSV doesn't deal with blob data, pictures,  
related tables, column type (even the distinction between a numeric  
and text value) and therefore has no definition for it.

> (i.e. headers, or no headers?).

Again, as I understand it, headers are outside of the CSV scope.

CSV only defines an array of cells of string data, not column  
definitions. If headers are included within CSV they are just another  
row in the array. CSV doesn't know what they are but an importer  
should import them into the array. After importing the raw data into  
an array, the program can then interpret the data in ways that makes  
sense to it. If the program understands headers then it should look  
for them in the array. If it understands column types, it should apply  
them, etc.

Since we're coming from the SQLite perspective, data means more to us  
than what CSV defines. Concepts such as column type, cell type are  
everyday definitions for SQLite but are meaningless to CSV. In most  
cases, we pre-define a table within SQLite specifically for receiving  
data from a particular CSV data file. CSV defines just the array of  
data.

>> Everything I've seen uses this.
>
> According to the RFC Excel doesn't use double-quotes for anything.  
> You might not care about Excel, but I'm willing to bet it is one of  
> the most-- if not the most-- common exporters of CSV.  The question  
> of getting data from Excel into SQLite shows up on the list every  
> now and then.

Yes, Excel is a major consideration and I have to deal with data from  
Excel exporters often. In my experience, though, the specification I  
outlined above deals with Excel exports fine. I can't comment on the  
RFC which may or may not match common use.

>> Some don't need delimiters in values, so they don't need quotes,  
>> but the encompassing specification works for all cases.
>
> No, it doesn't.  Working on a large website that provided CSV  
> exports for a number of data sources, I've seen plenty of examples  
> that don't work.  Finding a common format that could be exported  
> into a handful of common desktop apps was so pointless we seriously  
> considered getting rid of CSV all together, because we got tired of  
> our users telling us how simple CSV was, and why couldn't we just do  
> this one thing differently so it would work on *their* application.

This surprises me. Did you cater for escapes as mentioned in points 3  
and 4 above? You mentioned:

> CSV works very very well if you never get into the question of escapes

So I wonder if you did.

Also, perhaps the users were asking for specific data within the CSV,  
such as header names or column types. This doesn't mean that the CSV  
format is ambiguous, just that they want specific information included  
in the array.

>> It's not that big a deal for SQLite to support it, so it should.
>
> If it is so simple, and you know where the code is...

There are numerous code subroutines around to deal with this, in  
various languages. I Google searched the other day for one to  
incorporate into my Objective-C program, pasted it in and was  
instantly able to import all of my usual CSV data sources into an  
array. I then loop through the array to interpret the cells in a way  
that makes sense to my program. CSV just defines an array of data. My  
program can then interpret it however makes sense within its context.

> after all, it is such a "simple" format-- and find themselves in a  
> mess of code soon enough.  Seriously, give it a try.

The code is fairly straight forward. I've done it a few times.

> Carlos's Python script (nice!) is a great example.  His comment "I  
> am so grateful I did not have to write a parser for CSV" is dead on.

I would also rather use existing code than roll my own.

> And, as he points out, the reason the Python module is so good is  
> that it is adaptive

There's nothing wrong with adaptive. But the CSV-to-array importer  
only has to deal with the four spec points above. How that array is  
used is a second step after the file reading.

> and really reads five or six different variants of CSV (something a  
> reader can do but a writer cannot).

What are the variants? You mentioned with or without escapes, but  
those are both within the spec. You mentioned headers, but they are  
within the spec. Are you talking about non-standard escapes like \"  
instead of "" for quotes within quotes, because tat would be incorrect  
CSV. Or are you talking about semi-colons instead of commas?

> that advanced support of this kind of thing is really outside of the  
> scope of SQLite3.  After all, the .import command is part of the  
> shell, not part of the core library.

Given the number of times that CSV import has come up on this list, it  
is obviously important. Since .import already exists in SQLite, it  
might as well embrace the spec properly, including delimiters in  
quotes. The code is trivial by comparison the SQLite in general.

> CSV is a great quick and dirty format to move data.  But it isn't  
> "simple" and it isn't nearly as universal as many assume.

As above, I think it is pretty simple. The complexity arrives when  
people expect CSV to provide more than a simple array of strings.

> It works great if you're just moving simple numbers and strings

Exactly.

> that don't include commas, but becomes a mess when you get into  
> exceptions.

No, it caters for commas. If your CSV implementation doesn't, then it  
isn't doing full CSV.

Tom
BareFeet

  --
Comparison of SQLite GUI applications:
http://www.tandb.com.au/sqlite/compare/

_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to