Re: [DISCUSSION] CarbonData loading solution discussion

2016-12-19 Thread Kumar Vishal
+1
Now user will have flexibility to choose the output format.Will get
performance benefit if dictionary files are already generated.

-Regards
Kumar Vishal


On Fri, Dec 16, 2016 at 10:19 AM, Ravindra Pesala <ravi.pes...@gmail.com>
wrote:

> +1 to have separate output formats, now user can have flexibility to choose
> as per scenario.
>
> On Fri, Dec 16, 2016, 2:47 AM Jihong Ma <jihong...@huawei.com> wrote:
>
> >
> > It is great idea to have separate OutputFormat for regular Carbon data
> > files, index files as well as meta data files, For instance: dictionary
> > file, schema file, global index file etc.. for writing Carbon generated
> > files laid out HDFS, and it is orthogonal to the actual data load
> process.
> >
> > Regards.
> >
> > Jihong
> >
> > -Original Message-
> > From: Jacky Li [mailto:jacky.li...@qq.com]
> > Sent: Thursday, December 15, 2016 12:55 AM
> > To: dev@carbondata.incubator.apache.org
> > Subject: [DISCUSSION] CarbonData loading solution discussion
> >
> >
> > Hi community,
> >
> > Since CarbonData has global dictionary feature, currently when loading
> > data to CarbonData, it requires two times of scan of the input data.
> First
> > scan is to generate dictionary, second scan to do actual data encoding
> and
> > write to carbon files. Obviously, this approach is simple, but this
> > approach has at least two problem:
> > 1. involve unnecessary IO read.
> > 2. need two jobs for MapReduce application to write carbon files
> >
> > To solve this, we need single-pass data loading solution, as discussed
> > earlier, and now community is developing it (CARBONDATA-401, PR310).
> >
> > In this post, I want to discuss the OutputFormat part, I think there will
> > be two OutputFormat for CarbonData.
> > 1. DictionaryOutputFormat, which is used for the global dictionary
> > generation. (This should be extracted from CarbonColumnDictGeneratRDD)
> > 2. TableOutputFormat, which is used for writing CarbonData files.
> >
> > When carbon has these output formats, it is more easier to integrate with
> > compute framework like spark, hive, mapreduce.
> > And in order to make data loading faster, user can choose different
> > solution based on its scenario as following
> > Scenario 1:  First load is small (can not cover most dictionary)
> >
> > run two jobs that use DictionaryOutputFormat and TableOutputFormat
> > accordingly, in first few loads
> > after some loads, it becomes like Scenario 2, run one job that use
> > TableOutputFormat with single-pass
> > Scenario 2: First load is big (can cover most dictionary)
> >
> > for first load
> > if the bigest column cardinality > 10K, run two jobs using two output
> > formats
> > otherwise, run one job that use TableOutputFormat with single-pass
> > for subsequent load, run one job that use TableOutputFormat with
> > single-pass
> > What do yo think this idea?
> >
> > Regards,
> > Jacky
> >
>


Re: [DISCUSSION] CarbonData loading solution discussion

2016-12-15 Thread Ravindra Pesala
+1 to have separate output formats, now user can have flexibility to choose
as per scenario.

On Fri, Dec 16, 2016, 2:47 AM Jihong Ma <jihong...@huawei.com> wrote:

>
> It is great idea to have separate OutputFormat for regular Carbon data
> files, index files as well as meta data files, For instance: dictionary
> file, schema file, global index file etc.. for writing Carbon generated
> files laid out HDFS, and it is orthogonal to the actual data load process.
>
> Regards.
>
> Jihong
>
> -Original Message-
> From: Jacky Li [mailto:jacky.li...@qq.com]
> Sent: Thursday, December 15, 2016 12:55 AM
> To: dev@carbondata.incubator.apache.org
> Subject: [DISCUSSION] CarbonData loading solution discussion
>
>
> Hi community,
>
> Since CarbonData has global dictionary feature, currently when loading
> data to CarbonData, it requires two times of scan of the input data. First
> scan is to generate dictionary, second scan to do actual data encoding and
> write to carbon files. Obviously, this approach is simple, but this
> approach has at least two problem:
> 1. involve unnecessary IO read.
> 2. need two jobs for MapReduce application to write carbon files
>
> To solve this, we need single-pass data loading solution, as discussed
> earlier, and now community is developing it (CARBONDATA-401, PR310).
>
> In this post, I want to discuss the OutputFormat part, I think there will
> be two OutputFormat for CarbonData.
> 1. DictionaryOutputFormat, which is used for the global dictionary
> generation. (This should be extracted from CarbonColumnDictGeneratRDD)
> 2. TableOutputFormat, which is used for writing CarbonData files.
>
> When carbon has these output formats, it is more easier to integrate with
> compute framework like spark, hive, mapreduce.
> And in order to make data loading faster, user can choose different
> solution based on its scenario as following
> Scenario 1:  First load is small (can not cover most dictionary)
>
> run two jobs that use DictionaryOutputFormat and TableOutputFormat
> accordingly, in first few loads
> after some loads, it becomes like Scenario 2, run one job that use
> TableOutputFormat with single-pass
> Scenario 2: First load is big (can cover most dictionary)
>
> for first load
> if the bigest column cardinality > 10K, run two jobs using two output
> formats
> otherwise, run one job that use TableOutputFormat with single-pass
> for subsequent load, run one job that use TableOutputFormat with
> single-pass
> What do yo think this idea?
>
> Regards,
> Jacky
>


RE: [DISCUSSION] CarbonData loading solution discussion

2016-12-15 Thread Jihong Ma

It is great idea to have separate OutputFormat for regular Carbon data files, 
index files as well as meta data files, For instance: dictionary file, schema 
file, global index file etc.. for writing Carbon generated files laid out HDFS, 
and it is orthogonal to the actual data load process. 

Regards.

Jihong

-Original Message-
From: Jacky Li [mailto:jacky.li...@qq.com] 
Sent: Thursday, December 15, 2016 12:55 AM
To: dev@carbondata.incubator.apache.org
Subject: [DISCUSSION] CarbonData loading solution discussion


Hi community,

Since CarbonData has global dictionary feature, currently when loading data to 
CarbonData, it requires two times of scan of the input data. First scan is to 
generate dictionary, second scan to do actual data encoding and write to carbon 
files. Obviously, this approach is simple, but this approach has at least two 
problem:
1. involve unnecessary IO read. 
2. need two jobs for MapReduce application to write carbon files

To solve this, we need single-pass data loading solution, as discussed earlier, 
and now community is developing it (CARBONDATA-401, PR310). 

In this post, I want to discuss the OutputFormat part, I think there will be 
two OutputFormat for CarbonData. 
1. DictionaryOutputFormat, which is used for the global dictionary generation. 
(This should be extracted from CarbonColumnDictGeneratRDD)
2. TableOutputFormat, which is used for writing CarbonData files.

When carbon has these output formats, it is more easier to integrate with 
compute framework like spark, hive, mapreduce.
And in order to make data loading faster, user can choose different solution 
based on its scenario as following
Scenario 1:  First load is small (can not cover most dictionary)

run two jobs that use DictionaryOutputFormat and TableOutputFormat accordingly, 
in first few loads
after some loads, it becomes like Scenario 2, run one job that use 
TableOutputFormat with single-pass
Scenario 2: First load is big (can cover most dictionary)

for first load
if the bigest column cardinality > 10K, run two jobs using two output formats
otherwise, run one job that use TableOutputFormat with single-pass
for subsequent load, run one job that use TableOutputFormat with single-pass
What do yo think this idea?

Regards,
Jacky


Re: [DISCUSSION] CarbonData loading solution discussion

2016-12-15 Thread QiangCai
+1We should flexibility choose loading solution according to Scenario 1 and
2, and will get performance benefits.



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4520.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.

Re: [DISCUSSION] CarbonData loading solution discussion

2016-12-15 Thread Liang Chen
Hi Jacky

Thanks you started a good discussion.

see if i understand your points:
Scenario1 likes the current load data solution(0.2.0). 1.0.0 Will provide a
new solution option of "single-pass data loading" to meet this kind of
scenario: For subsequent data loads if the most dictionary code has been
built, then can add "single-pass data loading" option to the command of data
load to reduce scan(can improve performance). 

+1 to add the solution "single-pass data loading" if my understanding is
correct.

Regards
Liang


Jacky Li wrote
> Hi community, 
> 
> Sorry for the incorrect formatting of previous post. I corrected it in
> this post.
> 
> Since CarbonData has global dictionary feature, currently when loading
> data to CarbonData, it requires two times of scan of the input data. First
> scan is to generate dictionary, second scan to do actual data encoding and
> write to carbon files. Obviously, this approach is simple, but this
> approach has at least two problem: 
> 1. involve unnecessary IO read. 
> 2. need two jobs for MapReduce application to write carbon files 
> 
> To solve this, we need single-pass data loading solution, as discussed
> earlier, and now community is developing it (CARBONDATA-401, PR310). 
> 
> In this post, I want to discuss the OutputFormat part, I think there will
> be two OutputFormat for CarbonData. 
> 1. DictionaryOutputFormat, which is used for the global dictionary
> generation. (This should be extracted from CarbonColumnDictGeneratRDD) 
> 2. TableOutputFormat, which is used for writing CarbonData files. 
> 
> When carbon has these output formats, it is more easier to integrate with
> compute framework like spark, hive, mapreduce. 
> And in order to make data loading faster, user can choose different
> solution based on its scenario as following:
> 
> Scenario 1:  First load is small (can not cover most dictionary) 
> 1) for first few loads
> run two jobs that use DictionaryOutputFormat and TableOutputFormat
> accordingly
>  
> 2) after some loads
> It becomes like Scenario 2, so user can just run one job that use
> TableOutputFormat with single-pass support
> 
> Scenario 2: First load is big (can cover most dictionary) 
> 1) for first load 
> If the bigest column cardinality > 10K, run two jobs using two output
> formats. Otherwise, run one job that use TableOutputFormat with
> single-pass support
> 
> 2) for subsequent load
> Run one job that use TableOutputFormat with single-pass support
> 
> What do yo think this idea? 
> 
> Regards, 
> Jacky





--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4509.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


Re: [DISCUSSION] CarbonData loading solution discussion

2016-12-15 Thread Jacky Li

Hi community, 

Sorry for the incorrect formatting of previous post. I corrected it in this
post.

Since CarbonData has global dictionary feature, currently when loading data
to CarbonData, it requires two times of scan of the input data. First scan
is to generate dictionary, second scan to do actual data encoding and write
to carbon files. Obviously, this approach is simple, but this approach has
at least two problem: 
1. involve unnecessary IO read. 
2. need two jobs for MapReduce application to write carbon files 

To solve this, we need single-pass data loading solution, as discussed
earlier, and now community is developing it (CARBONDATA-401, PR310). 

In this post, I want to discuss the OutputFormat part, I think there will be
two OutputFormat for CarbonData. 
1. DictionaryOutputFormat, which is used for the global dictionary
generation. (This should be extracted from CarbonColumnDictGeneratRDD) 
2. TableOutputFormat, which is used for writing CarbonData files. 

When carbon has these output formats, it is more easier to integrate with
compute framework like spark, hive, mapreduce. 
And in order to make data loading faster, user can choose different solution
based on its scenario as following:

Scenario 1:  First load is small (can not cover most dictionary) 
1) for first few loads
run two jobs that use DictionaryOutputFormat and TableOutputFormat
accordingly
 
2) after some loads
It becomes like Scenario 2, so user can just run one job that use
TableOutputFormat with single-pass support

Scenario 2: First load is big (can cover most dictionary) 
1) for first load 
If the bigest column cardinality > 10K, run two jobs using two output
formats. Otherwise, run one job that use TableOutputFormat with single-pass
support

2) for subsequent load
Run one job that use TableOutputFormat with single-pass support

What do yo think this idea? 

Regards, 
Jacky



--
View this message in context: 
http://apache-carbondata-mailing-list-archive.1130556.n5.nabble.com/DISCUSSION-CarbonData-loading-solution-discussion-tp4490p4491.html
Sent from the Apache CarbonData Mailing List archive mailing list archive at 
Nabble.com.


[DISCUSSION] CarbonData loading solution discussion

2016-12-15 Thread Jacky Li

Hi community,

Since CarbonData has global dictionary feature, currently when loading data to 
CarbonData, it requires two times of scan of the input data. First scan is to 
generate dictionary, second scan to do actual data encoding and write to carbon 
files. Obviously, this approach is simple, but this approach has at least two 
problem:
1. involve unnecessary IO read. 
2. need two jobs for MapReduce application to write carbon files

To solve this, we need single-pass data loading solution, as discussed earlier, 
and now community is developing it (CARBONDATA-401, PR310). 

In this post, I want to discuss the OutputFormat part, I think there will be 
two OutputFormat for CarbonData. 
1. DictionaryOutputFormat, which is used for the global dictionary generation. 
(This should be extracted from CarbonColumnDictGeneratRDD)
2. TableOutputFormat, which is used for writing CarbonData files.

When carbon has these output formats, it is more easier to integrate with 
compute framework like spark, hive, mapreduce.
And in order to make data loading faster, user can choose different solution 
based on its scenario as following
Scenario 1:  First load is small (can not cover most dictionary)

run two jobs that use DictionaryOutputFormat and TableOutputFormat accordingly, 
in first few loads
after some loads, it becomes like Scenario 2, run one job that use 
TableOutputFormat with single-pass
Scenario 2: First load is big (can cover most dictionary)

for first load
if the bigest column cardinality > 10K, run two jobs using two output formats
otherwise, run one job that use TableOutputFormat with single-pass
for subsequent load, run one job that use TableOutputFormat with single-pass
What do yo think this idea?

Regards,
Jacky