On Mon, 26 Mar 2007 19:58:50 +0530
"Aniesh joseph" <[EMAIL PROTECTED]> wrote:

> Hello
> 
> 
> I have to read large CSV file upto 10 MB size. I tried to read each line by
> using getcsv() method, but cannot worthy. I have to make some checking the
> contents of the CSV files such as any duplicate row, or any row missing
> contents etc.
> 
> Can anybody suggests a method to read large file of CSV files ?

The trick to parsing large files is to completely process and then discard
each line one at a time. Hopefully the memory for strings that have been
processed will be collected although I'm not technically sure that will
be true. You might want to run a simple test to read and discard lines
from a file that is much bigger than the memory limit.

If the CSV interface you're using right now doesn't support that model
then you'll have to write your own CSV parser. The below code is one
that I wrote in C but fortunately PHP is very much like C, it shouldn't
be too hard to translate it. It's used in production environments by
major software products, free and otherwise. If you do translate to PHP,
perhaps you can post it back on the list.

Note that this code looks complicated but it's actually one of the
smallest CSV parsers you'll find and it's a lot more correct than just
about anything you'll find. Parsing quotes and quotes within quotes
is non-trivial.

Obviously you'll need to change the sinput parameter to a file or some
kind of stream source and return an array instead of the user providing
a buffer.

Mike

int
csv_parse_str(struct sinput *in,
            unsigned char *buf,
            size_t bn,
            unsigned char *row[],
            int rn,
            int sep,
            int flags)
{
    int trim, quotes, ch, state, r, j, t, inquotes;

    trim = flags & CSV_TRIM;
    quotes = flags & CSV_QUOTES;
    state = ST_START;
    inquotes = 0;
    ch = r = j = t = 0;

    memset(row, 0, sizeof(unsigned char *) * rn);

    while (rn && bn && (ch = snextch(in)) > 0) {
        switch (state) {
            case ST_START:
                if (ch != '\n' && ch != sep && isspace(ch)) {
                    if (!trim) {
                        buf[j++] = ch; bn--;
                        t = j;
                    }
                    break;
                } else if (quotes && ch == '"') {
                    j = t = 0;
                    state = ST_COLLECT;
                    inquotes = 1;
                    break;
                }
                state = ST_COLLECT;
            case ST_COLLECT:
                if (inquotes) {
                    if (ch == '"') {
                        state = ST_END_QUOTE;
                        break;
                    }
                } else if (ch == sep || ch == '\n') {
                    row[r++] = buf; rn--;
                    if (ch == '\n' && t && buf[t - 1] == '\r') {
                        t--; bn++; /* crlf -> lf */
                    }
                    buf[t] = '\0'; bn--;
                    buf += t + 1;
                    j = t = 0;
                    state = ST_START;
                    inquotes = 0;
                    if (ch == '\n') {
                        rn = 0;
                    }
                    break;
                } else if (quotes && ch == '"') {
                    PMNF(errno = EILSEQ, ": unexpected quote in element %d", (r 
+ 1));
                    return -1;
                }
                buf[j++] = ch; bn--;
                if (!trim || isspace(ch) == 0) {
                    t = j;
                }
                break;
            case ST_TAILSPACE:
            case ST_END_QUOTE:
                if (ch == sep || ch == '\n') {
                    row[r++] = buf; rn--;
                    buf[j] = '\0'; bn--;
                    buf += j + 1;
                    j = t =  0;
                    state = ST_START;
                    inquotes = 0;
                    if (ch == '\n') {
                        rn = 0;
                    }
                    break;
                } else if (quotes && ch == '"' && state != ST_TAILSPACE) {
                    buf[j++] = '"';    bn--;         /* nope, just an escaped 
quote */
                    t = j;
                    state = ST_COLLECT;
                    break;
                } else if (isspace(ch)) {
                    state = ST_TAILSPACE;
                    break;
                }
                errno = EILSEQ;
                PMNF(errno, ": bad end quote in element %d", (r + 1));
                return -1;
        }
    }
    if (ch == -1) {
        AMSG("");
        return -1;
    }
    if (bn == 0) {
        PMNO(errno = E2BIG);
        return -1;
    }
    if (rn) {
        if (inquotes && state != ST_END_QUOTE) {
            PMNO(errno = EILSEQ);
            return -1;
        }
        row[r] = buf;
        buf[t] = '\0';
    }

    return in->count;
}

Note: This code comes from "libmba" and is MIT Licensed (like BSD no
advert).

-- 
Michael B Allen
PHP Active Directory Kerberos SSO
http://www.ioplex.com/
_______________________________________________
New York PHP Community Talk Mailing List
http://lists.nyphp.org/mailman/listinfo/talk

NYPHPCon 2006 Presentations Online
http://www.nyphpcon.com

Show Your Participation in New York PHP
http://www.nyphp.org/show_participation.php

Reply via email to