Hi,

I am currently using sqlite3 version:

3.8.1 2013-10-17 12:57:35 c78be6d786c19073b3a6730dfe3fb1be54f5657a

I found an edge case bug in the sqlite3 shell when importing csv data with
fields containing embedded quotes, CRs and LFs:

When a field contains an embedded quote, and that embedded quote is
immediately followed by an EOL then the current csv parser will consider
the end of field to be reached and fail to import the row correctly.

For example the following csv, a single data row with column headers will
fail to import correctly.

column1,column2,column3,column4
fielddata1,fielddata2,"field ""
data"" 3",fielddata4


The offending code is in shell.c in function

static char *csv_read_one_field(CSVReader *p);

An example fix could be something like the following -- adding a counter
flag 'cQuoteComplete' to track quotes properly.

static char *csv_read_one_field(CSVReader *p){
  int c, pc;
  int cSep = p->cSeparator;
  int cQuoteComplete = 0;
  p->n = 0;
  c = fgetc(p->in);
  if( c==EOF || seenInterrupt ){
    p->cTerm = EOF;
    return 0;
  }
  if( c=='"' ){
    int startLine = p->nLine;
    int cQuote = c;
    cQuoteComplete+=1;
    pc = 0;
    while( 1 ){
      c = fgetc(p->in);
      if( c=='\n' ) p->nLine++;
      if( c==cQuote ){
        cQuoteComplete+=1;
        if( pc==cQuote ){
          pc = 0;
          continue;
        }
      }
      if( (c==cSep && pc==cQuote)
       || (c=='\n' && pc==cQuote)
       || (c=='\n' && pc=='\r' && p->n>=2 && p->z[p->n-2]==cQuote)
       || (c==EOF && pc==cQuote)
      ){
        if (cQuoteComplete%2 == 0) {
          do{ p->n--; }while( p->z[p->n]!=cQuote );
          p->cTerm = c;
          break;
        }
      }
      if( c==EOF ){
        fprintf(stderr, "%s:%d: unterminated %c-quoted field\n",
                p->zFile, startLine, cQuote);
        p->cTerm = EOF;
        break;
      }
      csv_append_char(p, c);
      pc = c;
    }
    if( cQuoteComplete%2 != 0 ){
      fprintf(stderr, "%s:%d: unescaped %c character\n",
              p->zFile, p->nLine, cQuote);
    }
  }else{
    while( c!=EOF && c!=cSep && c!='\n' ){
      csv_append_char(p, c);
      c = fgetc(p->in);
    }
    if( c=='\n' ){
      p->nLine++;
      if( p->n>1 && p->z[p->n-1]=='\r' ) p->n--;
    }
    p->cTerm = c;
  }
  if( p->z ) p->z[p->n] = 0;
  return p->z;
}


I built the sqlite3 shell from the almagamation source and tested the above
change using my import data; about 1Gb of messy,  but rfc4180 compliant,
CSV. It all imported cleanly.

sqlite3 is a wonderful bit of software. I have been using it for some time
now to munge and query multi-gigabyte size data sets and am very impressed
with its performance and capabilities.

Best Regards,
Lindsay
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to