Hi, community:

I'm implementing supporting string longer than 32000 characters in carbondata 
and have a question about the grammar of this feature. Here I'd like to explain 
it and want to receive your feedbacks.

DESCRIPTION:

In previous implementation, carbondata internally uses a short to store the 
length of a string (char,varchar,string are all treated as string) value. It 
does save memory and space for the regular use case by using short instead of 
int, but will cause problem when the length of string exceeds the range of 
short.

In order to support the above case and save memory/space if possible, we want 
to distinguish which string columns are SHORT and which are LONG. (SHORT: 
lengthOfValue<32000, LONG: lengthOfValue>=32000)


SOLUTION:

Solution1. Add a TableProperty in CreateTableStatement
The grammar looks like below:
For SQL case:
```
    sql(
      s"""
         | CREATE TABLE if not exists $longStringTable(
         | id INT, name STRING, description STRING, address STRING
         | ) STORED BY 'carbondata'
         | TBLPROPERTIES('LONG_STRING_COLUMNS'='description', 
'SORT_COLUMNS'='name')
         |""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
  .option("tableName", "longStringTable")
  .option("SORT_COLUMNS", 'name')
  .option('LONG_STRING_COLUMNS', 'description')
  .mode(SaveMode.Overwrite)
  .save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a property called 'LONG_STRING_COLUMNS' in the statement, its value 
contains all the columns that are LONG. 
Note: This solution is available as PR2252.


Solution2. Add a new datatype
The grammar looks like below:
For SQL case:
```
    sql(
      s"""
         | CREATE TABLE if not exists $longStringTable(
         | id INT, name STRING, description TEXT, address STRING
         | ) STORED BY 'carbondata'
         | TBLPROPERTIES('SORT_COLUMNS'='name')
         |""".stripMargin)
```

For DataFrame case:
```
df.write.format("carbondata")
  .option("tableName", "longStringTable")
  .option("SORT_COLUMNS", 'name')
  // SparkSQL does not have TEXT datatype, how to specify it?
  .mode(SaveMode.Overwrite)
  .save()
```

In the above example, 'description' is LONG while 'address' is SHORT.
We add a new datatype called 'TEXT' and treat 'description' as TEXT in the 
statement.

I'd prefer to solution1 since it is compatible with hive/sparksql while 
solution2 has problem to migrate with them.

END



At last, how do you think the solutions provided above?

Please give your comments, moreover you can provide other solutions here to 
improve it.

Reply via email to