[GitHub] incubator-carbondata pull request #243: [parser][minor] load data not suppor...

2016-10-16 Thread scwf
GitHub user scwf opened a pull request:

https://github.com/apache/incubator-carbondata/pull/243

[parser][minor] load data not support local

Seems we do not support `load data local`, so fix it.



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/scwf/incubator-carbondata patch-1

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/243.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #243


commit 3f34291350fdc99e24fbc0e506f4d8410720c797
Author: Fei Wang 
Date:   2016-10-16T15:59:26Z

load data new not support local




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #237: [CARBONDATA-317] - CSV having only s...

2016-10-16 Thread kumarvishal09
Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/237#discussion_r83549178
  
--- Diff: 
integration/spark/src/main/scala/org/apache/carbondata/spark/csv/CarbonCsvRelation.scala
 ---
@@ -148,6 +150,10 @@ case class CarbonCsvRelation protected[spark] (
   .withSkipHeaderRecord(false)
 CSVParser.parse(firstLine, 
csvFormat).getRecords.get(0).asScala.toArray
   }
+  if(null == firstRow) {
+throw new DataLoadingException("Please check your input path and 
make sure " +
--- End diff --

May be csv file does not have header and user is passing the header from 
load command in that case is this a valid message??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #194: [CARBONDATA-270] Double data type va...

2016-10-16 Thread kumarvishal09
Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/194#discussion_r83546209
  
--- Diff: 
core/src/main/java/org/apache/carbondata/scan/filter/FilterUtil.java ---
@@ -1426,4 +1423,25 @@ private static void 
getUnknownExpressionsList(Expression expression,
   getUnknownExpressionsList(child, lst);
 }
   }
+  /**
+   * This method will compare double values it will preserve
+   * the -0.0 and 0.0 equality as per == ,also preserve NaN equality check 
as per
+   * java.lang.Double.equals()
+   *
+   * @param d1 double value for equality check
+   * @param d2 double value for equality check
+   * @return boolean after comparing two double values.
+   */
+  public static int compare(Double d1, Double d2) {
--- End diff --

Move this method to DataTypeUtil as it can be used from multiple places and 
change the method name to compareDouble


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #194: [CARBONDATA-270] Double data type va...

2016-10-16 Thread kumarvishal09
Github user kumarvishal09 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/194#discussion_r83546183
  
--- Diff: 
core/src/main/java/org/apache/carbondata/scan/filter/FilterUtil.java ---
@@ -1426,4 +1423,25 @@ private static void 
getUnknownExpressionsList(Expression expression,
   getUnknownExpressionsList(child, lst);
 }
   }
+  /**
+   * This method will compare double values it will preserve
+   * the -0.0 and 0.0 equality as per == ,also preserve NaN equality check 
as per
+   * java.lang.Double.equals()
+   *
+   * @param d1 double value for equality check
+   * @param d2 double value for equality check
+   * @return boolean after comparing two double values.
+   */
+  public static int compare(Double d1, Double d2) {
+if ((d1.doubleValue() == d2.doubleValue()) || (Double.isNaN(d1) && 
Double.isNaN(d2))) {
+  return 0;
+}
+if (d1 < d2) {
--- End diff --

can't we add else if and else why three if condition is required??


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Discussion how to crate the CarbonData table with good performance

2016-10-16 Thread Liang Chen
Hi 

Thanks for you shared these experience.
Can you put these FAQ to CWIKI:
https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home

Regards
Liang



bill.zhou wrote
> Discussion how to crate the CarbonData table with good performance

> Suggestion to create Carbon table

> Recently we used CarbonData to do the performance in Telecommunication
> filed and summarize some of the Suggestions while creating the CarbonData
> table.
> 
> We have tables which range from 10 thousand rows to 10 billion rows and
> have from 100 columns to 300 columns. Following are some of the columns
> used in the table.

> 

> 

> Column name

> 

> Data type

> 

> Cardinality

> 

> Attribution

> 

> 

> 

> msisdn

> 

> String

> 

> 30 million

> 

> dimension

> 

> 

> 

> BEGIN_TIME

> 

> bigint

> 

> 10 thousand

> 

> dimension

> 

> 

> 

> HOST

> 

> String

> 

> 1 million

> 

> dimension

> 

> 

> 

> Dime_1

> 

> String

> 

> 1 thousand

> 

> dimension

> 

> 

> 

> counter_1

> 

> numeric(20,0)

> 

> NA

> 

> measure

> 

> 

> 

> ...

> 

> ...

> 

> NA

> 

> ...

> 

> 

> 

> 

> counter_100

> 

> numeric(20,0)

> 

> NA

> 

> measure

> 

> 

> 

> We have about more than 50 test cases; according to the test case we
> summarize some suggestion to create the table which can have a better
> query performance. 
> 
> 1.Put the frequently-used column filter in the beginning. 
> 
> For example, MSISDN filter is used in most of the query then put the
> MSISDN in the first column. The create table command can be as follows,
> the query which has MSISDN as a filter will be good (because the MSISDN is
> high cardinality, if create table like this the compress ratio will be
> decreased)
> 
/
> create table carbondata_table(
/
> 
/
> msisdn String,
/
> 
/
> ...
/
> 
/
> )STORED BY 'org.apache.carbondata.format' 
/
> 
/
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,..','DICTIONARY_INCLUDE'='...');
/
> 
> 

> 2.If has multiple column which is frequently-use in the filter, put it to
> the front in the order as low cardinality to high cardinality.
> 
> For example if msisdn, host and dime_1 is frequently-used column, the
> table column order can be like dime_1->host->msisdn, because the dime_1
> cardinality is low. Create table command can be as follows. This will
> increase the compression ratio and good performance for filter on dime_1,
> host and msisdn.
> 
>  
/
> create table carbondata_table(
> 
> Dime_1 String,
> 
> HOST String,
> 
> MSISDN String,
> 
> ...
> 
> )STORED BY 'org.apache.carbondata.format' 
> 
/
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,HOST..','DICTIONARY_INCLUDE'='Dime_1..');
> 
> 

> 3.If no column is frequent-use in filter, then can put all the dimension
> column order as from low cardinality to high cardinality. Create table
> command can be as following: 
> 
/
> create table carbondata_table(
> 
> Dime_1 String,
> 
> BEGIN_TIME bigint
> 
> HOST String,
> 
> MSISDN String,
> 
> ...
> 
> )STORED BY 'org.apache.carbondata.format' 
> 
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI..','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME..');
> 
/
> 

> 4.For measure that needs no high accuracy, then no need to use
> numeric(20,0) data type, suggestion is to use double to replace it than
> will increase the query performance. If one test case uses double to
> replace the numeric (20, 0) the query improve 5 times from 15 second to 3
> second. Create table command can be as follows. 
> 
/
> create table carbondata_table(
> 
> Dime_1 String,
> 
> BEGIN_TIME bigint
> 
> HOST String,
> 
> MSISDN String,
> 
> counter_1 double,
> 
> counter_2 double,
> 
> ...
> 
> counter_100 double,
> 
> )STORED BY 'org.apache.carbondata.format' 
> 
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');
> 
/
> 

> 5.If the column which is always incremental like start_time. For example
> one scenario: every day we will load data into carbon and the start_time
> is incremental for each load. For this scenario you can put the start_time
> column in the back of dimension, because always incremental value can use
> the min/max index well always. Create table command can be as following. 
> 
/
> create table carbondata_table(
> 
> Dime_1 String,
> 
> HOST String,
> 
> MSISDN String,
> 
> counter_1 double,
> 
> counter_2 double,
> 
> BEGIN_TIME bigint,
> 
> ...
> 
> counter_100 double,
> 
> )STORED BY 'org.apache.carbondata.format' 
> 
> TBLPROPERTIES (
> 'DICTIONARY_EXCLUDE'='MSISDN,HOST,IMSI','DICTIONARY_INCLUDE'='Dime_1,END_TIME,BEGIN_TIME');
> 
/
> 
> 

> One more is for the dimension whether dictionary is needed or not, we
> suggest if the cardinality higher than 50 thousand do not put it as
> dictionary column. If high cardinality column put as dictionary will
> impact the load performance.





--
View this message in context: 

[GitHub] incubator-carbondata pull request #194: [CARBONDATA-270] Double data type va...

2016-10-16 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/194#discussion_r83546963
  
--- Diff: 
core/src/main/java/org/apache/carbondata/scan/filter/FilterUtil.java ---
@@ -1426,4 +1423,25 @@ private static void 
getUnknownExpressionsList(Expression expression,
   getUnknownExpressionsList(child, lst);
 }
   }
+  /**
+   * This method will compare double values it will preserve
+   * the -0.0 and 0.0 equality as per == ,also preserve NaN equality check 
as per
+   * java.lang.Double.equals()
+   *
+   * @param d1 double value for equality check
+   * @param d2 double value for equality check
+   * @return boolean after comparing two double values.
+   */
+  public static int compare(Double d1, Double d2) {
+if ((d1.doubleValue() == d2.doubleValue()) || (Double.isNaN(d1) && 
Double.isNaN(d2))) {
+  return 0;
+}
+if (d1 < d2) {
--- End diff --

since we are returning once any condition matches i think both if or if 
else makes no difference in this context.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-carbondata pull request #194: [CARBONDATA-270] Double data type va...

2016-10-16 Thread sujith71955
Github user sujith71955 commented on a diff in the pull request:

https://github.com/apache/incubator-carbondata/pull/194#discussion_r83547082
  
--- Diff: 
core/src/main/java/org/apache/carbondata/scan/filter/FilterUtil.java ---
@@ -1426,4 +1423,25 @@ private static void 
getUnknownExpressionsList(Expression expression,
   getUnknownExpressionsList(child, lst);
 }
   }
+  /**
+   * This method will compare double values it will preserve
+   * the -0.0 and 0.0 equality as per == ,also preserve NaN equality check 
as per
+   * java.lang.Double.equals()
+   *
+   * @param d1 double value for equality check
+   * @param d2 double value for equality check
+   * @return boolean after comparing two double values.
+   */
+  public static int compare(Double d1, Double d2) {
+if ((d1.doubleValue() == d2.doubleValue()) || (Double.isNaN(d1) && 
Double.isNaN(d2))) {
+  return 0;
+}
+if (d1 < d2) {
--- End diff --

one condition we can save in else, its fine i will update as per your 
comments. Thanks for reviewing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


Re: Discussion(New feature) Support Complex Data Type: Map in Carbon Data

2016-10-16 Thread Liang Chen
Hi Vimal

Thank you started the discussion.
For keys of Map data only can be primitive, can you list these type which
will be supported? (Int,String,Double..

For discussing more conveniently, you can go ahead to use google docs. 
After the design document finalized , please archive and upload it to
cwiki:https://cwiki.apache.org/confluence/display/CARBONDATA/CarbonData+Home

Regards
Liang


Vimal Das Kammath wrote
> Hi All,
> 
> This discussion is regarding support for Map Data type in Carbon Data.
> 
> Carbon Data supports complex and nested data types such as Arrays and
> Struts. However, Carbon Data does not support other complex data types
> such
> as Maps and Union which are generally supported by popular opensource file
> formats.
> 
> 
> Supporting Map data type will require changes/additions to the DDL, Query
> Syntax, Data Loading and Storage.
> 
> 
> I have hosted the design on google docs for review and discussion.
> 
> https://docs.google.com/document/d/1U6wPohvdDHk0B7bONnVHWa6PKG8R9q5-oKMqzMMQHYY/edit?usp=sharing
> 
> 
> Below is the same inline.
> 
> 
> 1.  DDL Changes
> 
> Maps are key->value data types and where the value can be fetched by
> providing the key. Hence we need to restrict keys to primitive data types
> whereas values can be of any data type supported in Carbon(primitive and
> complex).
> 
> Map data types can be defined in the create table DDL as :-
> 
> “MAPprimitive_data_type, data_type”
> 
> For Example:-
> 
> create table example_table (id Int, name String, salary Int,
> salary_breakup
> mapString, Int, city String)
> 
> 
> 2.  Data Loading Changes
> 
> Carbon should be able to support loading data into tables with Map type
> columns from csv files. It should be possible to represent maps in a
> single
> row of csv. This will need carbon to support specifying the delimiters for
> :-
> 
> 1. Between two Key-Value pairs
> 
> 2. Between each Key and Value in a pair
> 
> As Carbon already supports Strut and Array Complex types, the data loading
> process already provides support for defining delimiters for complex data
> types. Carbon provides two Optional parameters for data loading
> 
> 1. COMPLEX_DELIMITER_LEVEL_1: will define the delimiter between two
> Key-Value pairs
> 
> OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='$')
> 
> 2. COMPLEX_DELIMITER_LEVEL_2: will define the delimiter between each
> Key and Value in a pair
> 
> OPTIONS('COMPLEX_DELIMITER_LEVEL_2'=':')
> 
> With these delimiter options, the below map can be represented in csv:-
> 
> Fixed->100,000
> 
> Bonus->30,000
> 
> Stock->40,000
> 
> As
> 
> Fixed:100,000$Bonus:30,000$Stock:40,000 in the csv file.
> 
> 
> 
> 3.  Query Capabilities
> 
> A complex datatype like Map will require additional operators to be
> supported in the query language to fully utilize the strength of the data
> type.
> 
> Maps are sequence of key-value pairs, hence should support looking up
> value
> for a given key. Users could use the ColumnName[“key”] syntax to lookup
> values in a map column. For example: salary_breakup[“Fixed”] could be used
> to fetch only the Fixed component in the salary breakup.
> 
> In Addition, we also need to define how maps can be used in existing
> constructs such as select, where(filter), group by etc..
> 1. Select:- Map data type can be directly selected or only the value
> for a given key can be selected as per the requirement. For
> example:-“Select
> name, salary, salary_breakup” will return the content of map long with
> each
> row.“Select name, salary, salary_breakup[“Fixed”]” will return only one
> value from the map whose key is “Fixed”2. Filter:-Map data type cannot
> be directly used in a where clause as where clause can operate only on
> primitive data types. However the map lookup operator can be used in where
> clauses. For example:-“Select name, salary where
> salary_breakup[“Bonus”]>10,000”*Note: if the value is not of primitive
> type, further assessor operators need to be used depending on the type of
> value to arrive at a primitive type for the filter expression to be
> valid.*
> 3. Group By:- Just like with filters, maps cannot be directly used in
> a
> group by clause, however the lookup operator can be used.
> 
> 4. Functions:- A size() function can be provided for map types to
> determine the number of key-value pairs in a map.
> 4.  Storage changes
> 
> As Carbon is a columnar data store, Map values will be stored using 3
> physical columns
> 
> 1. One Column for representing the Map Data type. Will store the
> number
> of fields and start index, just the same way as it is done for Struts and
> Arrays.
> 
> 2. One Column for the Key
> 
> 3. One Column for the value, if the value is of primitive data type,
> else the value itself will be multiple physical columns depending on the
> data type of the value.
> 
> MapString,Int
> 
> Column_1
> 
> Column_2
> 
> Column_3
> 
> Map_Salary_Breakup
> 
> Map_Salary_Breakup.key
> 
> Map_Salary_Breakup.value

Re: Discussion(New feature) Support Complex Data Type: Map in Carbon Data

2016-10-16 Thread Ravindra Pesala
Hi Vimal,

Design doc looks clear, can you also add file format storage design for map
datatype.

Regards,
Ravi.

On 17 October 2016 at 07:43, Liang Chen  wrote:

> Hi Vimal
>
> Thank you started the discussion.
> For keys of Map data only can be primitive, can you list these type which
> will be supported? (Int,String,Double..
>
> For discussing more conveniently, you can go ahead to use google docs.
> After the design document finalized , please archive and upload it to
> cwiki:https://cwiki.apache.org/confluence/display/
> CARBONDATA/CarbonData+Home
>
> Regards
> Liang
>
>
> Vimal Das Kammath wrote
> > Hi All,
> >
> > This discussion is regarding support for Map Data type in Carbon Data.
> >
> > Carbon Data supports complex and nested data types such as Arrays and
> > Struts. However, Carbon Data does not support other complex data types
> > such
> > as Maps and Union which are generally supported by popular opensource
> file
> > formats.
> >
> >
> > Supporting Map data type will require changes/additions to the DDL, Query
> > Syntax, Data Loading and Storage.
> >
> >
> > I have hosted the design on google docs for review and discussion.
> >
> > https://docs.google.com/document/d/1U6wPohvdDHk0B7bONnVHWa6PKG8R9
> q5-oKMqzMMQHYY/edit?usp=sharing
> >
> >
> > Below is the same inline.
> >
> >
> > 1.  DDL Changes
> >
> > Maps are key->value data types and where the value can be fetched by
> > providing the key. Hence we need to restrict keys to primitive data types
> > whereas values can be of any data type supported in Carbon(primitive and
> > complex).
> >
> > Map data types can be defined in the create table DDL as :-
> >
> > “MAPprimitive_data_type, data_type”
> >
> > For Example:-
> >
> > create table example_table (id Int, name String, salary Int,
> > salary_breakup
> > mapString, Int, city String)
> >
> >
> > 2.  Data Loading Changes
> >
> > Carbon should be able to support loading data into tables with Map type
> > columns from csv files. It should be possible to represent maps in a
> > single
> > row of csv. This will need carbon to support specifying the delimiters
> for
> > :-
> >
> > 1. Between two Key-Value pairs
> >
> > 2. Between each Key and Value in a pair
> >
> > As Carbon already supports Strut and Array Complex types, the data
> loading
> > process already provides support for defining delimiters for complex data
> > types. Carbon provides two Optional parameters for data loading
> >
> > 1. COMPLEX_DELIMITER_LEVEL_1: will define the delimiter between two
> > Key-Value pairs
> >
> > OPTIONS('COMPLEX_DELIMITER_LEVEL_1'='$')
> >
> > 2. COMPLEX_DELIMITER_LEVEL_2: will define the delimiter between each
> > Key and Value in a pair
> >
> > OPTIONS('COMPLEX_DELIMITER_LEVEL_2'=':')
> >
> > With these delimiter options, the below map can be represented in csv:-
> >
> > Fixed->100,000
> >
> > Bonus->30,000
> >
> > Stock->40,000
> >
> > As
> >
> > Fixed:100,000$Bonus:30,000$Stock:40,000 in the csv file.
> >
> >
> >
> > 3.  Query Capabilities
> >
> > A complex datatype like Map will require additional operators to be
> > supported in the query language to fully utilize the strength of the data
> > type.
> >
> > Maps are sequence of key-value pairs, hence should support looking up
> > value
> > for a given key. Users could use the ColumnName[“key”] syntax to lookup
> > values in a map column. For example: salary_breakup[“Fixed”] could be
> used
> > to fetch only the Fixed component in the salary breakup.
> >
> > In Addition, we also need to define how maps can be used in existing
> > constructs such as select, where(filter), group by etc..
> > 1. Select:- Map data type can be directly selected or only the value
> > for a given key can be selected as per the requirement. For
> > example:-“Select
> > name, salary, salary_breakup” will return the content of map long with
> > each
> > row.“Select name, salary, salary_breakup[“Fixed”]” will return only one
> > value from the map whose key is “Fixed”2. Filter:-Map data type
> cannot
> > be directly used in a where clause as where clause can operate only on
> > primitive data types. However the map lookup operator can be used in
> where
> > clauses. For example:-“Select name, salary where
> > salary_breakup[“Bonus”]>10,000”*Note: if the value is not of primitive
> > type, further assessor operators need to be used depending on the type of
> > value to arrive at a primitive type for the filter expression to be
> > valid.*
> > 3. Group By:- Just like with filters, maps cannot be directly used in
> > a
> > group by clause, however the lookup operator can be used.
> >
> > 4. Functions:- A size() function can be provided for map types to
> > determine the number of key-value pairs in a map.
> > 4.  Storage changes
> >
> > As Carbon is a columnar data store, Map values will be stored using 3
> > physical columns
> >
> > 1. One Column for representing the Map Data type. Will store the
> > number
> > of 

[GitHub] incubator-carbondata pull request #244: [CARBONDATA-300] Added Encoder proce...

2016-10-16 Thread ravipesala
GitHub user ravipesala opened a pull request:

https://github.com/apache/incubator-carbondata/pull/244

[CARBONDATA-300] Added Encoder processor for dataloading.

Added interface implementation for encode data load processor. 



You can merge this pull request into a Git repository by running:

$ git pull https://github.com/ravipesala/incubator-carbondata encode-step

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-carbondata/pull/244.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #244


commit e6eb5d1137bd82beed642e7206ebb566dcb81fa3
Author: ravipesala 
Date:   2016-10-17T05:03:35Z

Added Encoder processor for dataloding.




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---