Greetings,
I'm attempting to convert some very large CSV files into AVRO format. To this
end, I wrote a csvtoavro converter using C API v1.7.5.
The essence of the program is this:
// initialize line counter
lineno = 0;
// make a schema first
avro_schema_from_json_length (...);
// make a generic class from schema
iface = avro_generic_class_from_schema( schema );
// get the record size and verify that it is 109
avro_schema_record_size (schema);
// get a generic value
avro_generic_value_new (iface, &tuple);
// make me an output file
fp = fopen ( outputfile, "wb" );
// make me a filewriter
avro_file_writer_create_fp (fp, outputfile, 0, schema, &db);
// now for the code to emit the data
while (...)
{
avro_value_reset (&tuple);
// get the CSV record into the tuple
...
// write that tuple
avro_file_writer_append_value (db, &tuple);
lineno ++;
// flush the file
avro_file_writer_flush (db);
}
// close the output file
avro_file_writer_close (db);
// other cleanup
avro_value_iface_decref (iface);
avro_value_decref (&tuple);
// close output file
fflush (outfp);
fclose (outfp);
I read the file using a modified version of avrocat.c that looks like this.
wschema = avro_file_reader_get_writer_schema(reader);
iface = avro_generic_class_from_schema(wschema);
avro_generic_value_new(iface, &value);
int rval;
lineno = 0;
while ((rval = avro_file_reader_read_value(reader, &value)) == 0) {
lineno ++;
avro_value_reset(&value);
}
// If it was not an EOF that caused it to fail,
// print the error.
if (rval != EOF)
{
fprintf(stderr, "Error: %s\n", avro_strerror());
}
else
{
printf ( "%s %lld\n", filename, lineno );
}
On many files, I find no data is missing in the .AVRO file. However, quite
often I get files where several dozen rows of data are missing.
I'm certain that I'm doing something wrong, and something very basic. Any help
debugging would be most appreciated.
Thanks,
-amrith