USING Parquet

Christian Schwabe Sat, 23 Aug 2014 09:00:35 -0700

Hello together,

with large test data (.csv > 5 GB) I now wanted to do some tests.
Unfortunately it fails again pretty early. I have an EXTERNAL TABLE applied as 
follows:


CREATE EXTERNAL TABLE dfkklocks_hist
(
  validfrom timestamp,
  validthru timestamp,
  client text,
  loobj1 text,
  lotyp text,
  proid text,
  lockr text,
  fdate date,
  tdate date,
  gpart text,
  vkont text,
  cond_loobj text,
  actkey text,
  uname text,
  adatum date,
  azeit text,
  protected text,
  laufd date,
  laufi text
)
using csv with ('csvfile.delimiter'='~') location ‚file:path/to/csv/file;

Then I create a table with the suffix *_internal and the parquet type as 
follows:

CREATE TABLE dfkklocks_hist_internal
(
  validfrom timestamp,
  validthru timestamp,
  client text,
  loobj1 text,
  lotyp text,
  proid text,
  lockr text,
  fdate date,
  tdate date,
  gpart text,
  vkont text,
  cond_loobj text,
  actkey text,
  uname text,
  adatum date,
  azeit text,
  protected text,
  laufd date,
  laufi text
) using parquet;


This csv-file contains records such as these:
2014-08-19 21:03:32.78~9999-12-31 
23:59:59.999~200~0000000000530010000053~06~01~5~2005-12-31~9999-12-31~0010000053~000000000053~~~FREITAG~2006-06-01~125611~~1800-01-01~

Now I would like to insert content from cdv-file to the table using parquet as 
follows:: 
contract> INSERT INTO dfkklocks_hist_internal SELECT * FROM dfkklocks_hist;
ERROR: Cannot convert Tajo type: TIMESTAMP
java.lang.RuntimeException: Cannot convert Tajo type: TIMESTAMP
        at 
org.apache.tajo.storage.parquet.TajoSchemaConverter.convertColumn(TajoSchemaConverter.java:191)
        at 
org.apache.tajo.storage.parquet.TajoSchemaConverter.convert(TajoSchemaConverter.java:150)
        at 
org.apache.tajo.storage.parquet.TajoWriteSupport.<init>(TajoWriteSupport.java:54)
        at 
org.apache.tajo.storage.parquet.TajoParquetWriter.<init>(TajoParquetWriter.java:80)
        at 
org.apache.tajo.storage.parquet.ParquetAppender.init(ParquetAppender.java:75)
        at 
org.apache.tajo.engine.planner.physical.StoreTableExec.init(StoreTableExec.java:69)
        at org.apache.tajo.worker.Task.run(Task.java:423)
        at org.apache.tajo.worker.TaskRunner$1.run(TaskRunner.java:425)
        at java.lang.Thread.run(Thread.java:745)

In TajoSchemaConverter.java it looks as if it would not be possible to use a 
Tajo timestamp in parquet. Am I right with the assumption? 
Change the timestamp value (see example data set) also did not bring me to 
success. I had, at first, the assumption that the timestamp is not valid. But 
timestamp values like eg: 1970-00-00 00: 00: 00.000 or 1971-01-01 01: 01: 01 
000 showed no change in behavior. 
Are my conclusions thus far correct? Is this an outstanding bug? Am I doing 
something wrong maybe? What chance would there still that could lead me to the 
goal that I have not yet listed here?

private Type convertColumn(Column column) {
    TajoDataTypes.Type type = column.getDataType().getType();
    switch (type) {
      case BOOLEAN:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.BOOLEAN);
      case BIT:
      case INT2:
      case INT4:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.INT32);
      case INT8:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.INT64);
      case FLOAT4:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.FLOAT);
      case FLOAT8:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.DOUBLE);
      case CHAR:
      case TEXT:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.BINARY,
                         OriginalType.UTF8);
      case PROTOBUF:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.BINARY);
      case BLOB:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.BINARY);
      case INET4:
      case INET6:
        return primitive(column.getSimpleName(),
                         PrimitiveType.PrimitiveTypeName.BINARY);
      default:
        throw new RuntimeException("Cannot convert Tajo type: " + type);
    }
  }

I'm really thankful that there is a community like you guys out there that fix 
a support in such errors together.
Have a nice weekend.

Best regards,
Chris

USING Parquet

Reply via email to