David Coker wrote:

> Thanks for the suggestions and example Jim!
>
> Of course I'm using sample data to test with, but the real files
> that need to be worked with are pretty massive. To give you an idea
> of just how massive...
>
> I have a subset of data that I've been working with for over a week
> now, massaging text, filing in blank fields (concatenation in Excel)
> and thought I had everything taken care of as of late last night. I
> found out this morning that they needed one more column of data merged
> into my "finished" file.

If you have it in Excel, can you export it using tab-delimited?

Tabs rarely occur in field data, and if the field values don't contain tabs or returns you'd be able to use normal chunk expressions for orders-of-magnitude better performance.


>>Just so you know CSV is the second worst format ever invented.
>>They are still searching for worst one, but have not found it yet.
>
> <BIG GRIN> No question about it! </BIG GRIN>

They found it:  it was another form of CSV. ;)

That's one of the many problems with CSV: it isn't a single defined format, but rather a collection of ad hoc variants. I've seen differences in escaping and quotation used among even products from just Microsoft, and in different versions of the same Microsoft products, not to mention the even greater number of variants used by other programs.

Some use quotes around every field value, others use quotes around only textual values but not numbers, others use quotes around only multi-word values but not around text that contains a single word, and others escape quotes that are in values with a preceding slash, others escape quotes by using double quotes (a total Whiskey Tango Foxtrot "solution"), others also escape returns in values while many leave returns unescaped requiring you to figure it out character-by-character, and others do even weirder things....

I've had to write CSV parsers, using flags as Jim outlined. After seeing the loss of productivity and performance from those formats, I now have a policy of never delivering any product with a CSV export option, on moral grounds. :)

Any format that uses delimiter characters as commonly used in field values as a comma is, to be as polite as possible, a stupid invention, almost an anti-invention.

In a just world, whomever first deployed a system that used CSV would be found and put in stocks in the public square with a sign reading:

 "I'm responsible for the loss of several million hours
  of other people's time."

--
 Richard Gaskin
 Fourth World
 Rev training and consulting: http://www.fourthworld.com
 Webzine for Rev developers: http://www.revjournal.com
 revJournal blog: http://revjournal.com/blog.irv
_______________________________________________
use-revolution mailing list
[email protected]
Please visit this url to subscribe, unsubscribe and manage your subscription 
preferences:
http://lists.runrev.com/mailman/listinfo/use-revolution

Reply via email to