Hi, community:
I'm implementing supporting string longer than 32000 characters in carbondata
and have a question about the grammar of this feature. Here I'd like to explain
it and want to receive your feedbacks.
DESCRIPTION:
In previous implementation, carbondata internally uses a short to store the
length of a string (char,varchar,string are all treated as string) value. It
does save memory and space for the regular use case by using short instead of
int, but will cause problem when the length of string exceeds the range of
short.
In order to support the above case and save memory/space if possible, we want
to distinguish which string columns are SHORT and which are LONG. (SHORT:
lengthOfValue<32000, LONG: lengthOfValue>=32000)
SOLUTION:
Solution1. Add a TableProperty in CreateTableStatement
The grammar looks like below:
For SQL case:
```
sql(
s"""
| CREATE TABLE if not exists $longStringTable(
| id INT, name STRING, description STRING, address STRING
| ) STORED BY 'carbondata'
| TBLPROPERTIES('LONG_STRING_COLUMNS'='description',
'SORT_COLUMNS'='name')
|""".stripMargin)
```
For DataFrame case:
```
df.write.format("carbondata")
.option("tableName", "longStringTable")
.option("SORT_COLUMNS", 'name')
.option('LONG_STRING_COLUMNS', 'description')
.mode(SaveMode.Overwrite)
.save()
```
In the above example, 'description' is LONG while 'address' is SHORT.
We add a property called 'LONG_STRING_COLUMNS' in the statement, its value
contains all the columns that are LONG.
Note: This solution is available as PR2252.
Solution2. Add a new datatype
The grammar looks like below:
For SQL case:
```
sql(
s"""
| CREATE TABLE if not exists $longStringTable(
| id INT, name STRING, description TEXT, address STRING
| ) STORED BY 'carbondata'
| TBLPROPERTIES('SORT_COLUMNS'='name')
|""".stripMargin)
```
For DataFrame case:
```
df.write.format("carbondata")
.option("tableName", "longStringTable")
.option("SORT_COLUMNS", 'name')
// SparkSQL does not have TEXT datatype, how to specify it?
.mode(SaveMode.Overwrite)
.save()
```
In the above example, 'description' is LONG while 'address' is SHORT.
We add a new datatype called 'TEXT' and treat 'description' as TEXT in the
statement.
I'd prefer to solution1 since it is compatible with hive/sparksql while
solution2 has problem to migrate with them.
END
At last, how do you think the solutions provided above?
Please give your comments, moreover you can provide other solutions here to
improve it.