[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-05-31 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.=>Done
1.2 CarbonData compress binary column because now the compressor is 
table level.=>Done
=>TODO, support configuration for compress  and no compress, 
default no compress because binary usually is already compressed, like jpg 
format image. So no need to uncompress for binary column. 1.5.4 will support 
column level compression, after that, we can implement no compress for binary. 
We can talk with community.
1.3 CarbonData stores binary as dimension. => Done
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows). =>Done
1.5 Avro, JSON convert need consider
•   AVRO fixed and variable length binary can be supported
=> Avro don't support binary data type => No 
need
 Support read binary from JSON  => done.
1.6 Binay data type as a child columns in Struct, Map   
  
 => support it in the future, but priority is not very 
high, not in 1.5.4
1.7 Verify what is the maximum size of the binary value supportred  
=> snappy only support about 1.71 G, the max data size should be 2 GB, but 
need confirm


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[] =>Done
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column => Done
   => CARBON Datasource don't support dictionary include column
   =>support  carbon.column.compressor= snappy,zstd,gzip for binary, 
compress is for all columns(table level)
2.3 Support CTAS for binary=> transaction/non-transaction,  
Carbon/Hive/Parquet => Done 
2.4 Support external table for binary=> Done
2.5 Support projection for binary column=> Done
2.6 Support desc formatted=> Done
   => Carbon Datasource don't support  ALTER TABLE add columns 
sql
   support  ALTER TABLE for(add column, rename, drop column) 
binary data type in carbon session=> Done
   Don't support change the data type for binary by alter table 
=> Done
2.7 Don’t BUCKETCOLUMNS  for binary => Done
2.8 Support compaction for binary=> Done
2.9 datamap
Support bloomfilter,mv and pre-aggregate
Don’t support lucene, timeseries datamap,  no need min max 
datamap for binary
=>Done
2.10 CSDK / python SDK support binary in the future.=> TODO, python 
sdk already merge to pycarbon
2.11 Support S3=> Done
2.12 support UDF, hex, base64, cast:.=> TODO
   select hex(bin) from carbon_table..=> TODO
  
2.13 support configurable decode for query, support base64 and Hex 
decode.=> Done
2.15 How big data size binary data type can support for writing and 
reading?=> TODO
2.16 support filter for binary => Done
2.17 select CAST(s AS BINARY) from carbon_table. => 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-05-31 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.=>Done
1.2 CarbonData compress binary column because now the compressor is 
table level.=>Done
=>TODO, support configuration for compress  and no compress, 
default no compress because binary usually is already compressed, like jpg 
format image. So no need to uncompress for binary column. 1.5.4 will support 
column level compression, after that, we can implement no compress for binary. 
We can talk with community.
1.3 CarbonData stores binary as dimension. => Done
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows). =>Done
1.5 Avro, JSON convert need consider
•   AVRO fixed and variable length binary can be supported
=> Avro don't support binary data type => No 
need
 Support read binary from JSON  => done.
1.6 Binay data type as a child columns in Struct, Map   
  
 => support it in the future, but priority is not very 
high, not in 1.5.4
1.7 Verify what is the maximum size of the binary value supportred  
=> snappy only support about 1.71 G, the max data size should be 2 GB, but 
need confirm


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[] =>Done
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column => Done
   => CARBON Datasource don't support dictionary include column
   =>support  carbon.column.compressor= snappy,zstd,gzip for binary, 
compress is for all columns(table level)
2.3 Support CTAS for binary=> transaction/non-transaction,  
Carbon/Hive/Parquet => Done 
2.4 Support external table for binary=> Done
2.5 Support projection for binary column=> Done
2.6 Support desc formatted=> Done
   => Carbon Datasource don't support  ALTER TABLE add columns 
sql
   support  ALTER TABLE for(add column, rename, drop column) 
binary data type in carbon session=> Done
   Don't support change the data type for binary by alter table 
=> Done
2.7 Don’t support PARTITION, BUCKETCOLUMNS  for binary  => Done
2.8 Support compaction for binary=> Done
2.9 datamap
Support bloomfilter,mv and pre-aggregate
Don’t support lucene, timeseries datamap,  no need min max 
datamap for binary
=>Done
2.10 CSDK / python SDK support binary in the future.=> TODO, python 
sdk already merge to pycarbon
2.11 Support S3=> Done
2.12 support UDF, hex, base64, cast:.=> TODO
   select hex(bin) from carbon_table..=> TODO
  
2.13 support configurable decode for query, support base64 and Hex 
decode.=> Done
2.15 How big data size binary data type can support for writing and 
reading?=> TODO
2.16 support filter for binary => Done
2.17 select CAST(s AS BINARY) 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-05-09 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.=>Done
1.2 CarbonData compress binary column because now the compressor is 
table level.=>Done
=>TODO, support configuration for compress  and no compress, 
default no compress because binary usually is already compressed, like jpg 
format image. So no need to uncompress for binary column. 1.5.4 will support 
column level compression, after that, we can implement no compress for binary. 
We can talk with community.
1.3 CarbonData stores binary as dimension. => Done
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows). =>Done
1.5 Avro, JSON convert need consider
•   AVRO fixed and variable length binary can be supported
=> Avro don't support binary data type => No 
need
 Support read binary from JSON  => done.
1.6 Binay data type as a child columns in Struct, Map   
  
 => support it in the future, but priority is not very 
high, not in 1.5.4
1.7 Verify what is the maximum size of the binary value supportred  
=> snappy only support about 1.71 G, the max data size should be 2 GB, but 
need confirm


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[] =>Done
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column => Done
   => CARBON Datasource don't support dictionary include column
   =>support  carbon.column.compressor= snappy,zstd,gzip for binary, 
compress is for all columns(table level)
2.3 Support CTAS for binary=> transaction/non-transaction,  
Carbon/Hive/Parquet => Done 
2.4 Support external table for binary=> Done
2.5 Support projection for binary column=> Done
2.6 Support desc formatted=> Done
   => Carbon Datasource don't support  ALTER TABLE add columns 
sql
   support  ALTER TABLE for(add column, rename, drop column) 
binary data type in carbon session=> Done
   Don't support change the data type for binary by alter table 
=> Done
2.7 Don’t support PARTITION, BUCKETCOLUMNS  for binary  => Done
2.8 Support compaction for binary=> Done
2.9 datamap
Support bloomfilter,mv and pre-aggregate
Don’t support lucene, timeseries datamap,  no need min max 
datamap for binary
=>Done
2.10 CSDK / python SDK support binary in the future.=> TODO
2.11 Support S3=> Done
2.12 support UDF, hex, base64, cast:.=> TODO
   select hex(bin) from carbon_table..=> TODO
  
2.13 support configurable decode for query, support base64 and Hex 
decode.=> Done
2.15 How big data size binary data type can support for writing and 
reading?=> TODO
2.16 support filter for binary => Done
2.17 select CAST(s AS BINARY) from carbon_table. => Done

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-24 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.=>Done
1.2 CarbonData compress binary column because now the compressor is 
table level.=>Done
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension. => Done
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows). =>Done
1.5 Avro, JSON convert need consider
•   AVRO fixed and variable length binary can be supported
=> Avro don't support binary data type => No 
need
 Support read binary from JSON  => done.
1.6 Binay data type as a child columns in Struct, Map   
  
 => support it in the future, but priority is not very 
high, not in 1.5.4
1.7 Verify what is the maximum size of the binary value supportred  
=> snappy only support about 1.71 G, the max data size should be 2 GB, but 
need confirm


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[] =>Done
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column => Done
   => CARBON Datasource don't support dictionary include column
   =>support  carbon.column.compressor= snappy,zstd,gzip for binary, 
compress is for all columns(table level)
2.3 Support CTAS for binary=> transaction/non-transaction,  
Carbon/Hive/Parquet => Done 
2.4 Support external table for binary=> Done
2.5 Support projection for binary column=> Done
2.6 Support desc formatted=> Done
   => Carbon Datasource don't support  ALTER TABLE add columns 
sql
   support  ALTER TABLE for(add column, rename, drop column) 
binary data type in carbon session=> Done
   Don't support change the data type for binary by alter table 
=> Done
2.7 Don’t support PARTITION, BUCKETCOLUMNS  for binary  => Done
2.8 Support compaction for binary=> Done
2.9 datamap
Support bloomfilter,mv and pre-aggregate
Don’t support lucene, timeseries datamap,  no need min max 
datamap for binary
=>Done
2.10 CSDK / python SDK support binary in the future.=> TODO
2.11 Support S3=> Done
2.12 support UDF, hex, base64, cast:.=> TODO
   select hex(bin) from carbon_table..=> TODO
  
2.13 support configurable decode for query, support base64 and Hex 
decode.=> Done
2.14 Proper Error message for not supported features like SI=> TODO
2.15 How big data size binary data type can support for writing and 
reading?=> TODO
2.16 support filter for binary => Done
2.17 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-24 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.=>Done
1.2 CarbonData compress binary column because now the compressor is 
table level.=>Done
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension. => Done
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows). =>Done
1.5 Avro, JSON convert need consider
•   AVRO fixed and variable length binary can be supported
=> Avro don't support binary data type => No 
need
 Support read binary from JSON  => done.
1.6 Binay data type as a child columns in Struct, Map   
  
 => support it in the future, but priority is not very 
high, not in 1.5.4
1.7 Verify what is the maximum size of the binary value supportred  
=> snappy only support about 1.71 G, the max data size should be 2 GB, but 
need confirm


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[] =>Done
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column => Done
   => CARBON Datasource don't support dictionary include column
   =>support  carbon.column.compressor= snappy,zstd,gzip for binary, 
compress is for all columns(table level)
2.3 Support CTAS for binary=> transaction/non-transaction,  
Carbon/Hive/Parquet => Done 
2.4 Support external table for binary=> Done
2.5 Support projection for binary column=> Done
2.6 Support desc formatted=> Done
   => Carbon Datasource don't support  ALTER TABLE add columns 
sql
   support  ALTER TABLE for(add column, rename, drop column) 
binary data type in carbon session=> Done
   Don't support change the data type for binary by alter table 
=> Done
2.7 Don’t support PARTITION, BUCKETCOLUMNS  for binary  => Done
2.8 Support compaction for binary=> Done
2.9 datamap
Support bloomfilter,mv and pre-aggregate
Don’t support lucene, timeseries datamap,  no need min max 
datamap for binary
=>Done
2.10 CSDK / python SDK support binary in the future.=> TODO
2.11 Support S3=> Done
2.12 support UDF, hex, base64, cast:.=> TODO
   select hex(bin) from carbon_table..=> TODO
  
2.13 support configurable decode for query, support base64 and Hex 
decode.=> Done
2.14 Proper Error message for not supported features like SI=> TODO
2.15 How big data size binary data type can support for writing and 
reading?=> TODO
2.16 support filter for binary => Done
2.17 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-24 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.=>Done
1.2 CarbonData compress binary column because now the compressor is 
table level.=>Done
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension. => Done
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows). =>Done
1.5 Avro, JSON convert need consider
•   AVRO fixed and variable length binary can be supported
=> Avro don't support binary data type => No 
need
 Support read binary from JSON  => done.
1.6 Binay data type as a child columns in Struct, Map   
  
 => support it in the future, but priority is not very 
high, not in 1.5.4
1.7 Verify what is the maximum size of the binary value supportred  
=> snappy only support about 1.71 G, the max data size should be 2 GB, but 
need confirm


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[] =>Done
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column => Done
   => CARBON Datasource don't support dictionary include column
   =>support  carbon.column.compressor= snappy,zstd,gzip for binary, 
compress is for all columns(table level)
2.3 Support CTAS for binary=> transaction/non-transaction,  
Carbon/Hive/Parquet => Done 
2.4 Support external table for binary=> Done
2.5 Support projection for binary column=> Done
2.6 Support desc formatted=> Done
   => Carbon Datasource don't support  ALTER TABLE add columns 
sql
   support  ALTER TABLE for(add column, rename, drop column) 
binary data type in carbon session=> Done
   Don't support change the data type for binary by alter table 
=> Done
2.7 Don’t support PARTITION, BUCKETCOLUMNS  for binary  => Done
2.8 Support compaction for binary=> Done
2.9 datamap
Support bloomfilter,mv and pre-aggregate
Don’t support lucene, timeseries datamap,  no need min max 
datamap for binary
=>Done
2.10 CSDK / python SDK support binary in the future.=> TODO
2.11 Support S3=> Done
2.12 support UDF, hex, base64, cast:.=> TODO
   select hex(bin) from carbon_table..=> TODO
  
2.13 support configurable decode for query, support base64 and Hex 
decode.=> TODO
2.14 Proper Error message for not supported features like SI=> TODO
2.15 How big data size binary data type can support for writing and 
reading?=> TODO
2.16 support filter for binary => Done
2.17 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-22 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.=>Done
1.2 CarbonData compress binary column because now the compressor is 
table level.=>Done
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension. => Done
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows). =>Done
1.5 Avro, JSON convert need consider
•   AVRO fixed and variable length binary can be supported
=> Avro don't support binary data type => No 
need
 Support read binary from JSON  => done.
1.6 Binay data type as a child columns in Struct, Map   
  
 => support it in the future, but priority is not very 
high, not in 1.5.4
1.7 Verify what is the maximum size of the binary value supportred  
=> snappy only support about 1.71 G, the max data size should be 2 GB, but 
need confirm


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[] =>Done
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column => Done
   => CARBON Datasource don't support dictionary include column
   =>support  carbon.column.compressor= snappy,zstd,gzip for binary, 
compress is for all columns(table level)
2.3 Support CTAS for binary=> transaction/non-transaction,  
Carbon/Hive/Parquet => Done 
2.4 Support external table for binary=> Done
2.5 Support projection for binary column=> Done
2.6 Support desc formatted=> Done
   => Carbon Datasource don't support  ALTER TABLE add columns 
sql
   support  ALTER TABLE for(add column, rename, drop column) 
binary data type in carbon session=> Done
   Don't support change the data type for binary by alter table 
=> Done
2.7 Don’t support PARTITION, BUCKETCOLUMNS  for binary  => Done
2.8 Support compaction for binary=> Done
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the 
future=> TODO
2.10 CSDK / python SDK support binary in the future.=> TODO
2.11 Support S3=> Done
2.12 support UDF, hex, base64, cast:.=> TODO
   select hex(bin) from carbon_table..=> TODO
  
2.13 support configurable decode for query, support base64 and Hex 
decode.=> TODO
2.14 Proper Error message for not supported features like 
/mv/SI/bloom/streaming=> TODO
2.15 How big data size binary data type can support for writing and 
reading?=> TODO
2.16 support filter for binary => Done
2.17 select CAST(s AS 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-16 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
  TODO: 1.5 Avro, JSON convert need consider


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
   => CARBON Datasource don't support dictionary include column
   => carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support desc formatted
   => Carbon Datasource don't support  ALTER TABLE add calumny 
sql
   =>TODO: ALTER TABLE for binary data type in carbon session
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS  for binary  
2.8 Support compaction for binary(TODO)
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.(TODO)
2.11 Support S3
TODO:
2.12 support UDF, hex, base64, cast:
   select hex(bin) from carbon_table.
   select CAST(s AS BINARY) from carbon_table.
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV
4.2 Spark load CSV and convert string to byte[], and storage in 
CarbonData. read binary column and return as byte[]
4.3 Supporting insert into (string => binary),  TODO: update, 
delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource


 
mail list: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html


  was:
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-15 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
  TODO: 1.5 Avro, JSON convert need consider


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
   => CARBON Datasource don't support dictionary include column
   => carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support desc formatted
   => Carbon Datasource don't support  ALTER TABLE add calumny 
sql
   =>TODO: ALTER TABLE for binary data type in carbon session
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS  for binary  
2.8 Support compaction for binary(TODO)
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.(TODO)
2.11 Support S3
TODO:
2.12 support UDF, hex, base64, cast:
   select hex(bin) from carbon_table.
   select CAST(s AS BINARY) from carbon_table.
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV, encode as Hex, 
Base64
4.2 Spark load CSV and convert string to binary, and storage in 
CarbonData. CarbonData internal will decode Hex to binary.
4.3 Supporting insert (string => binary, configuration for 
encode/decode algorithm, default is Hex, user can change to base64 or others, 
is it ok?), update, delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


 
mail list: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html


  

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-12 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
  TODO: 1.5 Avro, JSON convert need consider


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
   => CARBON Datasource don't support dictionary include column
   => carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support desc formatted
   => Carbon Datasource don't support  ALTER TABLE add calumny 
sql
   =>TODO: ALTER TABLE for binary data type in carbon session
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS  for binary  
2.8 Support compaction for binary(TODO)
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.(TODO)
2.11 Support S3
 
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV, encode as Hex, 
Base64
4.2 Spark load CSV and convert string to binary, and storage in 
CarbonData. CarbonData internal will decode Hex to binary.
4.3 Supporting insert (string => binary, configuration for 
encode/decode algorithm, default is Hex, user can change to base64 or others, 
is it ok?), update, delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


 
mail list: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html


  was:
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-12 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
  TODO: 1.5 Avro, JSON convert need consider


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
   => CARBON Datasource don't support dictionary include column
   => carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support show table, desc, ALTER TABLE for binary data type
   => Carbon Datasource don't support  ALTER TABLE add calumny 
sql
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary   
2.8 Support compaction for binary
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.
2.11 Support S3
 
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV, encode as Hex, 
Base64
4.2 Spark load CSV and convert string to binary, and storage in 
CarbonData. CarbonData internal will decode Hex to binary.
4.3 Supporting insert (string => binary, configuration for 
encode/decode algorithm, default is Hex, user can change to base64 or others, 
is it ok?), update, delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


 
mail list: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html



  was:
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-12 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
  TODO: 1.5 Avro, JSON convert need consider


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
   => CARBON Datasource don't support dictionary include column
   => carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support show table, desc, ALTER TABLE for binary data type
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary   
2.8 Support compaction for binary
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.
2.11 Support S3
 
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV, encode as Hex, 
Base64
4.2 Spark load CSV and convert string to binary, and storage in 
CarbonData. CarbonData internal will decode Hex to binary.
4.3 Supporting insert (string => binary, configuration for 
encode/decode algorithm, default is Hex, user can change to base64 or others, 
is it ok?), update, delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


 
mail list: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html



  was:
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-12 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
  TODO: 1.5 Avro, JSON convert need consider


2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
=> carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support show table, desc, ALTER TABLE for binary data type
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary   
2.8 Support compaction for binary
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.
2.11 Support S3
 
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV, encode as Hex, 
Base64
4.2 Spark load CSV and convert string to binary, and storage in 
CarbonData. CarbonData internal will decode Hex to binary.
4.3 Supporting insert (string => binary, configuration for 
encode/decode algorithm, default is Hex, user can change to base64 or others, 
is it ok?), update, delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


 
mail list: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html



  was:
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-12 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.[Formal]
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.[Formal]
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
TODO: 1.5 Avro, JSON convert need consider  
1.6

2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
=> carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support show table, desc, ALTER TABLE for binary data type
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary   
2.8 Support compaction for binary
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.
2.11 Support S3
 
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV, encode as Hex, 
Base64
4.2 Spark load CSV and convert string to binary, and storage in 
CarbonData. CarbonData internal will decode Hex to binary.
4.3 Supporting insert (string => binary, configuration for 
encode/decode algorithm, default is Hex, user can change to base64 or others, 
is it ok?), update, delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


 
mail list: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html



  was:
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-12 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.[Formal]
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.[Formal]
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
TODO: 1.5 Avro, JSON convert need consider  
1.6

2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
=> carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support show table, desc, ALTER TABLE for binary data type
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary   
2.8 Support compaction for binary
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.
2.11 Support S3
 
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV, encode as Hex, 
Base64
4.2 Spark load CSV and convert string to binary, and storage in 
CarbonData. CarbonData internal will decode Hex to binary.
4.3 Supporting insert (string => binary, configuration for 
encode/decode algorithm, default is Hex, user can change to base64 or others, 
is it ok?), update, delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


 
mail list: 
http://apache-carbondata-dev-mailing-list-archive.1130556.n5.nabble.com/Discuss-CarbonData-supports-binary-data-type-td76828.html



  was:
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when 

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-11 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Description: 
CarbonData supports binary data type



Version Changes Owner   Date
0.1 Init doc for Supporting binary data typeXubo2019-4-10

Background :
Binary is basic data type and widely used in various scenarios. So it’s better 
to support binary data type in CarbonData. Download data from S3 will be slow 
when dataset has lots of small binary data. The majority of application 
scenarios are  related to storage small binary data type into CarbonData, which 
can avoid small binary files problem and speed up S3 access performance, also 
can decrease cost of accessing OBS by decreasing the number of calling S3 API. 
It also will easier to manage structure data and Unstructured data(binary) by 
storing them into CarbonData. 

Goals:
1. Supporting write binary data type by Carbon Java SDK.[Formal]
2. Supporting read binary data type by Spark Carbon file format(carbon 
datasource) and CarbonSession.[Formal]
3. Supporting read binary data type by Carbon SDK
4. Supporting write binary by spark


Approach and Detail:
1.Supporting write binary data type by Carbon Java SDK [Formal]:
1.1 Java SDK needs support write data with specific data types, 
like int, double, byte[ ] data type, no need to convert all data type to string 
array. User read binary file as byte[], then SDK writes byte[] into binary 
column.  
1.2 CarbonData compress binary column because now the compressor is 
table level.
=>TODO, support configuration for compress, default is no 
compress because binary usually is already compressed, like jpg format image. 
So no need to uncompress for binary column. 1.5.4 will support column level 
compression, after that, we can implement no compress for binary. We can talk 
with community.
1.3 CarbonData stores binary as dimension.
1.4 Support configure page size for binary data type because binary 
data usually is big, such as 200k. Otherwise it will be very big for one 
blocklet (32000 rows).
TODO: 1.5 Avro, JSON convert need consider  
1.6

2. Supporting read and manage binary data type by Spark Carbon file 
format(carbon DataSource) and CarbonSession.[Formal]
2.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
2.2 Support create table with binary column, table property doesn’t 
support sort_columns, dictionary, COLUMN_META_CACHE, RANGE_COLUMN for binary 
column
=> Evaluate COLUMN_META_CACHE for binary
=> carbon.column.compressor for all columns
2.3 Support CTAS for binary=> transaction/non-transaction
2.4 Support external table for binary
2.5 Support projection for binary column
2.6 Support show table, desc, ALTER TABLE for binary data type
2.7 Don’t support PARTITION, filter, BUCKETCOLUMNS for binary   
2.8 Support compaction for binary
2.9 datamap? Don’t support bloomfilter, lucene, timeseries datamap, 
 no need min max datamap for binary, support mv and pre-aggregate in the future
2.10 CSDK / python SDK support binary in the future.
2.11 Support S3
 
CarbonSession: impact analysis


3. Supporting read binary data type by Carbon SDK
3.1 Supporting read binary data type from non-transaction table, 
read binary column and return as byte[]
3.2 Supporting projection for binary column
3.3 Supporting S3
3.4 no need to support filter.

4. Supporting write binary by spark (carbon file format / 
carbonsession, POC??)
4.1 Convert binary to String and storage in CSV, encode as Hex, 
Base64
4.2 Spark load CSV and convert string to binary, and storage in 
CarbonData. CarbonData internal will decode Hex to binary.
4.3 Supporting insert (string => binary, configuration for 
encode/decode algorithm, default is Hex, user can change to base64 or others, 
is it ok?), update, delete for binary
4.4 Don’t support stream table.
=> refer hive and Spark2.4 image DataSource

Formal? How to support write into binary read from images in SQL?
Use spark core code is ok.  


 




  was:
Support Binary Data Type:
1. Support write and read binary data type by CarbonData Java SDK
2. Support  read binary data type by Spark Carbon File Format


> Support Binary Data Type
> 
>
> Key: CARBONDATA-3336
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3336
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: xubo245
>Assignee: xubo245
>   

[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-09 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Attachment: (was: CarbonData support binary data type.pdf)

> Support Binary Data Type
> 
>
> Key: CARBONDATA-3336
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3336
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Attachments: CarbonData support binary data type V0.1.pdf
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Support Binary Data Type:
> 1. Support write and read binary data type by CarbonData Java SDK
> 2. Support  read binary data type by Spark Carbon File Format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-09 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Attachment: CarbonData support binary data type v0.1.pdf

> Support Binary Data Type
> 
>
> Key: CARBONDATA-3336
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3336
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Attachments: CarbonData support binary data type V0.1.pdf
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Support Binary Data Type:
> 1. Support write and read binary data type by CarbonData Java SDK
> 2. Support  read binary data type by Spark Carbon File Format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-09 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Attachment: CarbonData support binary data type V0.1.pdf

> Support Binary Data Type
> 
>
> Key: CARBONDATA-3336
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3336
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Attachments: CarbonData support binary data type V0.1.pdf
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Support Binary Data Type:
> 1. Support write and read binary data type by CarbonData Java SDK
> 2. Support  read binary data type by Spark Carbon File Format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-09 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Attachment: (was: CarbonData support binary data type v0.1.pdf)

> Support Binary Data Type
> 
>
> Key: CARBONDATA-3336
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3336
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Attachments: CarbonData support binary data type V0.1.pdf
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Support Binary Data Type:
> 1. Support write and read binary data type by CarbonData Java SDK
> 2. Support  read binary data type by Spark Carbon File Format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CARBONDATA-3336) Support Binary Data Type

2019-04-09 Thread xubo245 (JIRA)


 [ 
https://issues.apache.org/jira/browse/CARBONDATA-3336?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

xubo245 updated CARBONDATA-3336:

Attachment: CarbonData support binary data type.pdf

> Support Binary Data Type
> 
>
> Key: CARBONDATA-3336
> URL: https://issues.apache.org/jira/browse/CARBONDATA-3336
> Project: CarbonData
>  Issue Type: New Feature
>Reporter: xubo245
>Assignee: xubo245
>Priority: Major
> Attachments: CarbonData support binary data type.pdf
>
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> Support Binary Data Type:
> 1. Support write and read binary data type by CarbonData Java SDK
> 2. Support  read binary data type by Spark Carbon File Format



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)