Re: [ANNOUNCE] Hexiaoqiao as new Apache CarbonData committer

2017-02-20 Thread Xiaoqiao He
Hi PPMC, Liang,

It is my honor that receive the invitation, and very happy to have chance
that participate to build CarbonData community also. I will keep
contributing to Apache CarbonData and continue to promoting the practical
application on CarbonData.

Thank you again and hope CarbonData have a better development in the future.

Best Regards.
Hexiaoqiao


On Tue, Feb 21, 2017 at 9:26 AM, Liang Chen  wrote:

> Hi all
>
> We are pleased to announce that the PPMC has invited Hexiaoqiao as new
> Apache CarbonData committer, and the invite has been accepted !
>
> Congrats to Hexiaoqiao and welcome aboard.
>
> Regards
> Liang
>


Re: Exception throws when I load data using carbondata-1.0.0

2017-02-19 Thread Xiaoqiao He
Hi Ravindra,

Thanks for your suggestions. But another problem met when I create table
and load data.

1. I follow README to compile and build CarbonData actually, via
https://github.com/apache/incubator-carbondata/blob/master/build/README.md :

> mvn -DskipTests -Pspark-1.6 -Dspark.version=1.6.2 clean package


2. I think the exceptions mentioned above (ClassNotFoundException/'exists
and does not match'), is related to configuration item of
'spark.executor.extraClassPath'. Since when i trace executor logs, i found
it tries to load Class from the same path as spark.executor.extraClassPath
config and it can not found local (this local path is valid only for
driver), and throw exception. When I remove this item in configuration and
run the same command with --jar parameter, then not throw this exception
again.

3. but when i create table following quick-start as below:

> scala> cc.sql("CREATE TABLE IF NOT EXISTS sample (id string, name string,
> city string, age Int) STORED BY 'carbondata'")


there is some info logs such as:

> INFO  20-02 12:00:35,690 - main Query [CREATE TABLE TEST.SAMPLE USING
> CARBONDATA OPTIONS (TABLENAME "TEST.SAMPLE", TABLEPATH
> "/HOME/PATH/HEXIAOQIAO/CARBON.STORE/TEST/SAMPLE") ]

and* TABLEPATH looks not the proper path (I have no idea why this path is
not HDFS path)*, and then load data as blow but another exception throws.

> scala> cc.sql("LOAD DATA INPATH
> 'hdfs://hacluster/user/hadoop-data/sample.csv' INTO TABLE sample")


there is some info logs such as:

> INFO  20-02 12:01:27,608 - main HDFS lock
> path:hdfs://hacluster/home/path/hexiaoqiao/carbon.store/test/sample/meta.lock

*this lock path is not the expected hdfs path, it looks [hdfs
scheme://authority] + local setup path of carbondata. (is storelocation not
active?)*
and throw exception:

> INFO  20-02 12:01:42,668 - Table MetaData Unlocked Successfully after data
> load
> java.lang.RuntimeException: Table is locked for updation. Please try after
> some time
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.execution.command.LoadTable.run(carbonTableSchema.scala:360)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:58)
> at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:56)
> at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:70)

 ..


CarbonData Configuration:
carbon.storelocation=hdfs://hacluster/tmp/carbondata/carbon.store
carbon.lock.type=HDFSLOCK
FYI.

Regards,
Hexiaoqiao


On Sat, Feb 18, 2017 at 3:26 PM, Ravindra Pesala <ravi.pes...@gmail.com>
wrote:

> Hi Xiaoqiao,
>
> Is the problem still exists?
> Can you try with clean build  with  "mvn clean -DskipTests -Pspark-1.6
> package" command.
>
> Regards,
> Ravindra.
>
> On 16 February 2017 at 08:36, Xiaoqiao He <xq.he2...@gmail.com> wrote:
>
> > hi Liang Chen,
> >
> > Thank for your help. It is true that i install and configure carbondata
> on
> > "spark on yarn" cluster following installation guide (
> > https://github.com/apache/incubator-carbondata/blob/
> > master/docs/installation-guide.md#installing-and-
> > configuring-carbondata-on-spark-on-yarn-cluster
> > ).
> >
> > Best Regards,
> > Heixaoqiao
> >
> >
> > On Thu, Feb 16, 2017 at 7:47 AM, Liang Chen <chenliang6...@gmail.com>
> > wrote:
> >
> > > Hi He xiaoqiao
> > >
> > > Quick start is local model spark.
> > > Your case is yarn cluster , please check :
> > > https://github.com/apache/incubator-carbondata/blob/
> > > master/docs/installation-guide.md
> > >
> > > Regards
> > > Liang
> > >
> > > 2017-02-15 3:29 GMT-08:00 Xiaoqiao He <xq.he2...@gmail.com>:
> > >
> > > > hi Manish Gupta,
> > > >
> > > > Thanks for you focus, actually i try to load data following
> > > > https://github.com/apache/incubator-carbondata/blob/
> > > > master/docs/quick-start-guide.md
> > > > for deploying carbondata-1.0.0.
> > > >
> > > > 1.when i execute carbondata by `bin/spark-shell`, it throws as above.
> > > > 2.when i execute carbondata by `bin/spark-shell --jars
> > > > carbonlib/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar`,
> it
> > > > throws another exception as below:
> > > >
> > > > org.apache.spark.SparkException: Job aborted due to stage failure:
> > Task
> > > 0
> > > > > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> > > stage
> > > > &

Re: Exception throws when I load data using carbondata-1.0.0

2017-02-15 Thread Xiaoqiao He
hi Liang Chen,

Thank for your help. It is true that i install and configure carbondata on
"spark on yarn" cluster following installation guide (
https://github.com/apache/incubator-carbondata/blob/master/docs/installation-guide.md#installing-and-configuring-carbondata-on-spark-on-yarn-cluster
).

Best Regards,
Heixaoqiao


On Thu, Feb 16, 2017 at 7:47 AM, Liang Chen <chenliang6...@gmail.com> wrote:

> Hi He xiaoqiao
>
> Quick start is local model spark.
> Your case is yarn cluster , please check :
> https://github.com/apache/incubator-carbondata/blob/
> master/docs/installation-guide.md
>
> Regards
> Liang
>
> 2017-02-15 3:29 GMT-08:00 Xiaoqiao He <xq.he2...@gmail.com>:
>
> > hi Manish Gupta,
> >
> > Thanks for you focus, actually i try to load data following
> > https://github.com/apache/incubator-carbondata/blob/
> > master/docs/quick-start-guide.md
> > for deploying carbondata-1.0.0.
> >
> > 1.when i execute carbondata by `bin/spark-shell`, it throws as above.
> > 2.when i execute carbondata by `bin/spark-shell --jars
> > carbonlib/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar`, it
> > throws another exception as below:
> >
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0
> > > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> stage
> > > 0.0 (TID 3, [task hostname]): org.apache.spark.SparkException: File
> > > ./carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar exists and
> does
> > > not match contents of
> > > http://master:50843/jars/carbondata_2.10-1.0.0-
> > incubating-shade-hadoop2.7.1.jar
> >
> >
> > I check the assembly jar and CarbonBlockDistinctValuesCombineRDD is
> > present
> > actually.
> >
> > anyone who meet the same problem?
> >
> > Best Regards,
> > Hexiaoqiao
> >
> >
> > On Wed, Feb 15, 2017 at 12:56 AM, manish gupta <
> tomanishgupt...@gmail.com>
> > wrote:
> >
> > > Hi,
> > >
> > > I think the carbon jar is compiled properly. Can you use any decompiler
> > and
> > > decompile carbondata-spark-common-1.1.0-incubating-SNAPSHOT.jar
> present
> > in
> > > spark-common module target folder and check whether the required class
> > file
> > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD is
> > > present or not.
> > >
> > > If you are using only the assembly jar then decompile and check in
> > assembly
> > > jar.
> > >
> > > Regards
> > > Manish Gupta
> > >
> > > On Tue, Feb 14, 2017 at 11:19 AM, Xiaoqiao He <xq.he2...@gmail.com>
> > wrote:
> > >
> > > >  hi, dev,
> > > >
> > > > The latest release version apache-carbondata-1.0.0-incubating-rc2
> > which
> > > > takes Spark-1.6.2 to build throws exception `
> > > > java.lang.ClassNotFoundException:
> > > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD`
> > > when
> > > > i
> > > > load data following Quick Start Guide.
> > > >
> > > > Env:
> > > > a. CarbonData-1.0.0-incubating-rc2
> > > > b. Spark-1.6.2
> > > > c. Hadoop-2.7.1
> > > > d. CarbonData on "Spark on YARN" Cluster and run yarn-client mode.
> > > >
> > > > any suggestions? Thank you.
> > > >
> > > > The exception stack trace as below:
> > > >
> > > > 
> > > > ERROR 14-02 12:21:02,005 - main generate global dictionary failed
> > > > org.apache.spark.SparkException: Job aborted due to stage failure:
> > Task
> > > 0
> > > > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in
> > stage
> > > > 0.0 (TID 3, nodemanger): java.lang.ClassNotFoundException:
> > > > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD
> > > >  at
> > > > org.apache.spark.repl.ExecutorClassLoader.findClass(
> > > > ExecutorClassLoader.scala:84)
> > > >
> > > >  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> > > >  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> > > >  at java.lang.Class.forName0(Native Method)
> > > >  at java.lang.Class.forName(Class.java:274)
> > > >  at
> > > > org.apache.spark.serializer.JavaDeserial

Re: Exception throws when I load data using carbondata-1.0.0

2017-02-15 Thread Xiaoqiao He
hi Manish Gupta,

Thanks for you focus, actually i try to load data following
https://github.com/apache/incubator-carbondata/blob/master/docs/quick-start-guide.md
for deploying carbondata-1.0.0.

1.when i execute carbondata by `bin/spark-shell`, it throws as above.
2.when i execute carbondata by `bin/spark-shell --jars
carbonlib/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar`, it
throws another exception as below:

org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
> in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> 0.0 (TID 3, [task hostname]): org.apache.spark.SparkException: File
> ./carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar exists and does
> not match contents of
> http://master:50843/jars/carbondata_2.10-1.0.0-incubating-shade-hadoop2.7.1.jar


I check the assembly jar and CarbonBlockDistinctValuesCombineRDD is present
actually.

anyone who meet the same problem?

Best Regards,
Hexiaoqiao


On Wed, Feb 15, 2017 at 12:56 AM, manish gupta <tomanishgupt...@gmail.com>
wrote:

> Hi,
>
> I think the carbon jar is compiled properly. Can you use any decompiler and
> decompile carbondata-spark-common-1.1.0-incubating-SNAPSHOT.jar present in
> spark-common module target folder and check whether the required class file
> org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD is
> present or not.
>
> If you are using only the assembly jar then decompile and check in assembly
> jar.
>
> Regards
> Manish Gupta
>
> On Tue, Feb 14, 2017 at 11:19 AM, Xiaoqiao He <xq.he2...@gmail.com> wrote:
>
> >  hi, dev,
> >
> > The latest release version apache-carbondata-1.0.0-incubating-rc2 which
> > takes Spark-1.6.2 to build throws exception `
> > java.lang.ClassNotFoundException:
> > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD`
> when
> > i
> > load data following Quick Start Guide.
> >
> > Env:
> > a. CarbonData-1.0.0-incubating-rc2
> > b. Spark-1.6.2
> > c. Hadoop-2.7.1
> > d. CarbonData on "Spark on YARN" Cluster and run yarn-client mode.
> >
> > any suggestions? Thank you.
> >
> > The exception stack trace as below:
> >
> > 
> > ERROR 14-02 12:21:02,005 - main generate global dictionary failed
> > org.apache.spark.SparkException: Job aborted due to stage failure: Task
> 0
> > in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
> > 0.0 (TID 3, nodemanger): java.lang.ClassNotFoundException:
> > org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD
> >  at
> > org.apache.spark.repl.ExecutorClassLoader.findClass(
> > ExecutorClassLoader.scala:84)
> >
> >  at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
> >  at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
> >  at java.lang.Class.forName0(Native Method)
> >  at java.lang.Class.forName(Class.java:274)
> >  at
> > org.apache.spark.serializer.JavaDeserializationStream$$
> > anon$1.resolveClass(JavaSerializer.scala:68)
> >
> >  at
> > java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
> >  at
> > java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
> >  at
> > java.io.ObjectInputStream.readOrdinaryObject(
> ObjectInputStream.java:1771)
> >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> java:1350)
> >  at
> > java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
> >  at
> > java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
> >  at
> > java.io.ObjectInputStream.readOrdinaryObject(
> ObjectInputStream.java:1798)
> >  at java.io.ObjectInputStream.readObject0(ObjectInputStream.
> java:1350)
> >  at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
> >  at
> > org.apache.spark.serializer.JavaDeserializationStream.
> > readObject(JavaSerializer.scala:76)
> >
> >  at
> > org.apache.spark.serializer.JavaSerializerInstance.
> > deserialize(JavaSerializer.scala:115)
> >
> >  at
> > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:64)
> >  at
> > org.apache.spark.scheduler.ShuffleMapTask.runTask(
> ShuffleMapTask.scala:41)
> >  at org.apache.spark.scheduler.Task.run(Task.scala:89)
> >  at
> > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
> >  at
> > java.util.concurrent.ThreadPoolExecutor.runWorker(
> > ThreadPoolExecutor.java:1145)
> >
> >

Exception throws when I load data using carbondata-1.0.0

2017-02-13 Thread Xiaoqiao He
 hi, dev,

The latest release version apache-carbondata-1.0.0-incubating-rc2 which
takes Spark-1.6.2 to build throws exception `
java.lang.ClassNotFoundException:
org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD` when i
load data following Quick Start Guide.

Env:
a. CarbonData-1.0.0-incubating-rc2
b. Spark-1.6.2
c. Hadoop-2.7.1
d. CarbonData on "Spark on YARN" Cluster and run yarn-client mode.

any suggestions? Thank you.

The exception stack trace as below:


ERROR 14-02 12:21:02,005 - main generate global dictionary failed
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage
0.0 (TID 3, nodemanger): java.lang.ClassNotFoundException:
org.apache.carbondata.spark.rdd.CarbonBlockDistinctValuesCombineRDD
 at
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:84)

 at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
 at java.lang.Class.forName0(Native Method)
 at java.lang.Class.forName(Class.java:274)
 at
org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:68)

 at
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1612)
 at
java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1517)
 at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1771)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1990)
 at
java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1915)
 at
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1798)
 at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1350)
 at java.io.ObjectInputStream.readObject(ObjectInputStream.java:370)
 at
org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)

 at
org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:115)

 at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:64)
 at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
 at org.apache.spark.scheduler.Task.run(Task.scala:89)
 at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
 at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

 at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 at java.lang.Thread.run(Thread.java:745)

Driver stacktrace:
 at 
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1431)

 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1419)

 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1418)

 at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
 at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1418)
 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

 at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:799)

 at scala.Option.foreach(Option.scala:236)
 at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:799)

 at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1640)

 at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1599)

 at
org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1588)

 at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
 at
org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:620)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1832)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1845)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1858)
 at org.apache.spark.SparkContext.runJob(SparkContext.scala:1929)
 at org.apache.spark.rdd.RDD$$anonfun$collect$1.apply(RDD.scala:927)
 at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)

 at
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)

 at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
 at org.apache.spark.rdd.RDD.collect(RDD.scala:926)
 at
org.apache.carbondata.spark.util.GlobalDictionaryUtil$.generateGlobalDictionary(GlobalDictionaryUtil.scala:742)

 at
org.apache.spark.sql.execution.command.LoadTable.run(carbonTableSchema.scala:577)

 at

Re: [ANNOUNCE] Apache CarbonData 1.0.0-incubating released

2017-02-05 Thread Xiaoqiao He
 Firstly, configuration to *Apache CarbonData 1.0.0-incubating released*
and Thanks for the great works.

Test about CarbonData 1.0.0-incubating found that this version is better in
availability, reliability and performance than previous
ones. Especially the performance of loading data improved significantly.

As well as, new features such as supporting update/delete functionality,
integration with Spark 2.x and removing kettle for loading data solution,
etc. are really amazing. In order to improve the dictionary module
performance, I will keep to work for continuous improvement and
optimization about "Double Array Trie".

I am from MEITUAN which is (one of) the biggest O2O internet company in
China, Query scenario we are facing is very complex and diverse, and
CarbonData is just match some of that. so we are in the process of making a
thorough research and plan to deploy CarbonData on our production
environment.


On Mon, Jan 30, 2017 at 12:01 PM, Jacky Li  wrote:

> Hi All,
>
> The Apache CarbonData PMC team is happy to annouce the release of Apache
> CarbonData version 1.0.0-incubating.
>
> Apache CarbonData(incubating) is an indexed columnar data format for fast
> analytics on big data platform, e.g.Apache Hadoop, Apache Spark, etc.
>
> The release notes is available at:
> https://cwiki.apache.org/confluence/display/CARBONDATA/
> Apache+CarbonData+1.0.0-incubating
>
> The release artifacts are available at:
> https://www.apache.org/dyn/closer.lua/incubator/
> carbondata/1.0.0-incubating
>
> You can follow this document to use these artifacts:
> https://github.com/apache/incubator-carbondata/blob/
> master/docs/quick-start-guide.md
>
> You can find the latest CarbonData document and learn more at:
> http://carbondata.incubator.apache.org
>
> Thanks
> The Apache CarbonData team
>
> 
>
> DISCLAIMER
>
> Apache CarbonData is an effort undergoing incubation at the Apache
>
> Software Foundation (ASF), sponsored by the Apache Incubator PMC.
>
>
>
> Incubation is required of all newly accepted projects until a further
>
> review indicates that the infrastructure, communications, and decision
>
> making process have stabilized in a manner consistent with other
>
> successful ASF projects.
>
>
>
> While incubation status is not necessarily a reflection of the
>
> completeness or stability of the code, it does indicate that the
>
> project has yet to be fully endorsed by the ASF.
>
>
>
>


Re: [UT Fail Report] UT can not pass when run with branch master

2017-01-04 Thread Xiaoqiao He
check it and work well:
1.pull master branch,
2.compile and run ut, all ut pass.
thanks for your repair timely.

On Thu, Jan 5, 2017 at 11:37 AM, Liang Chen  wrote:

> Hi
>
> It is fixed, now the master can pass compilation. Thanks for you pointed
> out
> it.
>
> Regards
> Liang
>
> hexiaoqiao wrote
> > UT fails when run with branch master of carbondata (
> > https://github.com/apache/incubator-carbondata/tree/master).
> >
> > exception as following:
> >
> >> GrtLtFilterProcessorTestCase:
> >> *** RUN ABORTED ***
> >>   java.lang.Exception: DataLoad failure: Due to internal errors, please
> >> check logs for more details.
> >>   at
> >> org.apache.carbondata.spark.rdd.CarbonDataRDDFactory$.loadCarbonData(
> CarbonDataRDDFactory.scala:742)
> >>   at
> >> org.apache.spark.sql.execution.command.LoadTable.
> run(carbonTableSchema.scala:470)
> >>   at
> >> org.apache.spark.sql.execution.ExecutedCommand.
> sideEffectResult$lzycompute(commands.scala:57)
> >>   at
> >> org.apache.spark.sql.execution.ExecutedCommand.
> sideEffectResult(commands.scala:57)
> >>   at
> >> org.apache.spark.sql.execution.ExecutedCommand.
> doExecute(commands.scala:69)
> >>   at
> >> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$5.apply(SparkPlan.scala:140)
> >>   at
> >> org.apache.spark.sql.execution.SparkPlan$$anonfun$
> execute$5.apply(SparkPlan.scala:138)
> >>   at
> >> org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:147)
> >>   at
> >> org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
> >>   at
> >> org.apache.spark.sql.SQLContext$QueryExecution.
> toRdd$lzycompute(SQLContext.scala:933)
> >>   ...
> >
> >
> > Branch: master
> > OS: Darwin Kernel Version 16.3.0
> > JRE: java version "1.8.0_91" Java HotSpot(TM) 64-Bit Server VM (build
> > 25.91-b14, mixed mode)
> > MAVEN: 3.2.5
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-
> mailing-list-archive.1130556.n5.nabble.com/UT-Fail-Report-
> UT-can-not-pass-when-run-with-branch-master-tp5530p5542.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


[UT Fail Report] UT can not pass when run with branch master

2017-01-04 Thread Xiaoqiao He
UT fails when run with branch master of carbondata (
https://github.com/apache/incubator-carbondata/tree/master).

exception as following:

> GrtLtFilterProcessorTestCase:
> *** RUN ABORTED ***
>   java.lang.Exception: DataLoad failure: Due to internal errors, please
> check logs for more details.
>   at
> org.apache.carbondata.spark.rdd.CarbonDataRDDFactory$.loadCarbonData(CarbonDataRDDFactory.scala:742)
>   at
> org.apache.spark.sql.execution.command.LoadTable.run(carbonTableSchema.scala:470)
>   at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult$lzycompute(commands.scala:57)
>   at
> org.apache.spark.sql.execution.ExecutedCommand.sideEffectResult(commands.scala:57)
>   at
> org.apache.spark.sql.execution.ExecutedCommand.doExecute(commands.scala:69)
>   at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:140)
>   at
> org.apache.spark.sql.execution.SparkPlan$$anonfun$execute$5.apply(SparkPlan.scala:138)
>   at
> org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
>   at org.apache.spark.sql.execution.SparkPlan.execute(SparkPlan.scala:138)
>   at
> org.apache.spark.sql.SQLContext$QueryExecution.toRdd$lzycompute(SQLContext.scala:933)
>   ...


Branch: master
OS: Darwin Kernel Version 16.3.0
JRE: java version "1.8.0_91" Java HotSpot(TM) 64-Bit Server VM (build
25.91-b14, mixed mode)
MAVEN: 3.2.5


Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-28 Thread Xiaoqiao He
Hi Jihong,

Thanks for your attentions and reply.
1. Actually I has done benchmark with English/Chinese dictionary size in
{100K,200K,300K,400K,500K,600K} separately, and test result is basic same
as mentioned in this mail flow before, I will submit and open the benchmark
code and dictionary source to github
<https://github.com/Hexiaoqiao/bigdata_algorithm_benchmark> as soon as
possible.
2. I also notice the license of DAT
<https://github.com/komiya-atsushi/darts-java>, and i think it's necessary
to re-implement another DAT following this paper:
https://linux.thai.net/~thep/datrie/datrie.html.

All kinds of suggestions are welcomed.

Regards,
He Xiaoqiao


On Tue, Nov 29, 2016 at 5:17 AM, Jihong Ma <jihong...@huawei.com> wrote:

> Thank you Xiaoqiao for looking into this issue and sharing your result!
>
> Have you tried varied dictionary size for comparison among all the
> alternatives?
>
> And please pay closer attention to the license of DAT implementation, as
> they are under LGPL, generally speaking, it is not legally allowed to be
> included.
>
> Jihong
>
> -Original Message-
> From: Xiaoqiao He [mailto:xq.he2...@gmail.com]
> Sent: Friday, November 25, 2016 9:52 AM
> To: dev@carbondata.incubator.apache.org
> Subject: Re: [Improvement] Use Trie in place of HashMap to reduce memory
> footprint of Dictionary
>
> Hi Liang, Kumar Vishal,
>
> I has done a standard benchmark about multiply data structures for
> Dictionary following your suggestions. Based on the test results, I think
> DAT may be the best choice for CarbonData.
>
> *1. Here are 2 test results:*
> ---
> Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
>   HashMap :   java.util.HashMap
>   DAT (Double Array Trie):
> https://github.com/komiya-atsushi/darts-java
>   RadixTree:
> https://github.com/npgall/concurrent-trees
>   TrieDict (Dictionary in Kylin):
> http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> Dictionary Source (Traditional Chinese):
> https://raw.githubusercontent.com/fxsjy/jieba/master/extra_d
> ict/dict.txt.big
> Test Result
> a. Dictionary Size:584429
> 
> b. Build Time (ms) :
>DAT   : 5714
>HashMap   : 110
>RadixTree : 22044
>TrieDict  : 855
> 
> c. Memory footprint in 64-bit JVM (bytes) :
>DAT   : 16779752
>HashMap   : 32196592
>RadixTree : 46130584
>TrieDict  : 10443608
> 
> d. Retrieval Performance for 9935293 query times (ms) :
>DAT   : 585
>HashMap   : 1010
>RadixTree : 417639
>TrieDict  : 8664
> Test Result
>
> Test Result
> a. Dictionary Size:584429
> 
> b. Build Time (ms) :
>DAT   : 5867
>HashMap   : 100
>RadixTree : 22082
>TrieDict  : 840
> 
> c. Memory footprint in 64-bit JVM (bytes) :
>DAT   : 16779752
>HashMap   : 32196592
>RadixTree : 46130584
>TrieDict  : 10443608
> 
> d. Retrieval Performance for 9935293 query times (ms) :
>DAT   : 593
>HashMap   : 821
>RadixTree : 422297
>TrieDict  : 8752
> Test Result
>
> *2. Conclusion:*
> a. TrieDict is good for building tree and less memory footprint overhead,
> but worst retrieval performance,
> b. DAT is a good tradeoff between memory footprint and retrieval
> performance,
> c. RadixTree has the worst performance in different aspects.
>
> *3. Result Analysis:*
> a. With Trie the memory footprint of the TrieDict mapping is kinda
> minimized if compared to HashMap, in order to improve performance there is
> a cache layer overlays on top of Trie.
> b. Because a large number of duplicate prefix data, the total memory
> footprint is more than trie, meanwhile i think calculating string hash code
> of traditional Chinese consume considerable time overhead, so the
> performance is not the best.
> c. DAT is a better tradeoff.
> d. I have no idea why RadixTree has the worst performance in terms of
> memory, retrieval and building tree.
>
>
> On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <chenliang6...@gmail.com>
> wrote:
>
> > Hi xiaoqiao
> >
> > ok, look forward to seeing your test result.
> > Can you take this task for this improvement? Please let me know if you
> need
> > any support :)
> >
> > Regards
> > Liang
> >
> >
> > hexiaoqiao wrote
> > > Hi Kumar Vishal,
> > >
> > > Thanks for your suggestions. As you said, choose Trie 

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-27 Thread Xiaoqiao He
Hi Kumar Vishal,

I'll create task to trace this issue.
Thanks for your suggestions.

Regards,
He Xiaoqiao


On Sun, Nov 27, 2016 at 1:41 AM, Kumar Vishal <kumarvishal1...@gmail.com>
wrote:

> Hi Xiaoqiao He,
>
> You can go ahead with DAT implementation, based on the result.
> I will look forward for you PR.
>
> Please let me know if you need any support:).
>
> -Regards
> KUmar Vishal
>
> On Fri, Nov 25, 2016 at 11:22 PM, Xiaoqiao He <xq.he2...@gmail.com> wrote:
>
> > Hi Liang, Kumar Vishal,
> >
> > I has done a standard benchmark about multiply data structures for
> > Dictionary following your suggestions. Based on the test results, I think
> > DAT may be the best choice for CarbonData.
> >
> > *1. Here are 2 test results:*
> > ---
> > Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for
> Dictionary
> >   HashMap :   java.util.HashMap
> >   DAT (Double Array Trie):
> > https://github.com/komiya-atsushi/darts-java
> >   RadixTree:
> > https://github.com/npgall/concurrent-trees
> >   TrieDict (Dictionary in Kylin):
> > http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
> > Dictionary Source (Traditional Chinese):
> > https://raw.githubusercontent.com/fxsjy/jieba/master/extra_
> > dict/dict.txt.big
> > Test Result
> > a. Dictionary Size:584429
> > 
> > b. Build Time (ms) :
> >DAT   : 5714
> >HashMap   : 110
> >RadixTree : 22044
> >TrieDict  : 855
> > 
> > c. Memory footprint in 64-bit JVM (bytes) :
> >DAT   : 16779752
> >HashMap   : 32196592
> >RadixTree : 46130584
> >TrieDict  : 10443608
> > 
> > d. Retrieval Performance for 9935293 query times (ms) :
> >DAT   : 585
> >HashMap   : 1010
> >RadixTree : 417639
> >TrieDict  : 8664
> > Test Result
> >
> > Test Result
> > a. Dictionary Size:584429
> > 
> > b. Build Time (ms) :
> >DAT   : 5867
> >HashMap   : 100
> >RadixTree : 22082
> >TrieDict  : 840
> > 
> > c. Memory footprint in 64-bit JVM (bytes) :
> >DAT   : 16779752
> >HashMap   : 32196592
> >RadixTree : 46130584
> >TrieDict  : 10443608
> > 
> > d. Retrieval Performance for 9935293 query times (ms) :
> >DAT   : 593
> >HashMap   : 821
> >RadixTree : 422297
> >TrieDict  : 8752
> > Test Result
> >
> > *2. Conclusion:*
> > a. TrieDict is good for building tree and less memory footprint overhead,
> > but worst retrieval performance,
> > b. DAT is a good tradeoff between memory footprint and retrieval
> > performance,
> > c. RadixTree has the worst performance in different aspects.
> >
> > *3. Result Analysis:*
> > a. With Trie the memory footprint of the TrieDict mapping is kinda
> > minimized if compared to HashMap, in order to improve performance there
> is
> > a cache layer overlays on top of Trie.
> > b. Because a large number of duplicate prefix data, the total memory
> > footprint is more than trie, meanwhile i think calculating string hash
> code
> > of traditional Chinese consume considerable time overhead, so the
> > performance is not the best.
> > c. DAT is a better tradeoff.
> > d. I have no idea why RadixTree has the worst performance in terms of
> > memory, retrieval and building tree.
> >
> >
> > On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <chenliang6...@gmail.com>
> > wrote:
> >
> > > Hi xiaoqiao
> > >
> > > ok, look forward to seeing your test result.
> > > Can you take this task for this improvement? Please let me know if you
> > need
> > > any support :)
> > >
> > > Regards
> > > Liang
> > >
> > >
> > > hexiaoqiao wrote
> > > > Hi Kumar Vishal,
> > > >
> > > > Thanks for your suggestions. As you said, choose Trie replace HashMap
> > we
> > > > can get better memory footprint and also good performance. Of course,
> > DAT
> > > > is not only choice, and I will do test about DAT vs Radix Trie and
> > > release
> > > > the test result as soon as possible. Thanks your suggestions again.
> > > >
> > >

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-25 Thread Xiaoqiao He
Hi Liang, Kumar Vishal,

I has done a standard benchmark about multiply data structures for
Dictionary following your suggestions. Based on the test results, I think
DAT may be the best choice for CarbonData.

*1. Here are 2 test results:*
---
Benchmark about {HashMap,DAT,RadixTree,TrieDict} Structures for Dictionary
  HashMap :   java.util.HashMap
  DAT (Double Array Trie):
https://github.com/komiya-atsushi/darts-java
  RadixTree:
https://github.com/npgall/concurrent-trees
  TrieDict (Dictionary in Kylin):
http://kylin.apache.org/blog/2015/08/13/kylin-dictionary
Dictionary Source (Traditional Chinese):
https://raw.githubusercontent.com/fxsjy/jieba/master/extra_dict/dict.txt.big
Test Result
a. Dictionary Size:584429

b. Build Time (ms) :
   DAT   : 5714
   HashMap   : 110
   RadixTree : 22044
   TrieDict  : 855

c. Memory footprint in 64-bit JVM (bytes) :
   DAT   : 16779752
   HashMap   : 32196592
   RadixTree : 46130584
   TrieDict  : 10443608

d. Retrieval Performance for 9935293 query times (ms) :
   DAT   : 585
   HashMap   : 1010
   RadixTree : 417639
   TrieDict  : 8664
Test Result

Test Result
a. Dictionary Size:584429

b. Build Time (ms) :
   DAT   : 5867
   HashMap   : 100
   RadixTree : 22082
   TrieDict  : 840

c. Memory footprint in 64-bit JVM (bytes) :
   DAT   : 16779752
   HashMap   : 32196592
   RadixTree : 46130584
   TrieDict  : 10443608

d. Retrieval Performance for 9935293 query times (ms) :
   DAT   : 593
   HashMap   : 821
   RadixTree : 422297
   TrieDict  : 8752
Test Result

*2. Conclusion:*
a. TrieDict is good for building tree and less memory footprint overhead,
but worst retrieval performance,
b. DAT is a good tradeoff between memory footprint and retrieval
performance,
c. RadixTree has the worst performance in different aspects.

*3. Result Analysis:*
a. With Trie the memory footprint of the TrieDict mapping is kinda
minimized if compared to HashMap, in order to improve performance there is
a cache layer overlays on top of Trie.
b. Because a large number of duplicate prefix data, the total memory
footprint is more than trie, meanwhile i think calculating string hash code
of traditional Chinese consume considerable time overhead, so the
performance is not the best.
c. DAT is a better tradeoff.
d. I have no idea why RadixTree has the worst performance in terms of
memory, retrieval and building tree.


On Fri, Nov 25, 2016 at 11:28 AM, Liang Chen <chenliang6...@gmail.com>
wrote:

> Hi xiaoqiao
>
> ok, look forward to seeing your test result.
> Can you take this task for this improvement? Please let me know if you need
> any support :)
>
> Regards
> Liang
>
>
> hexiaoqiao wrote
> > Hi Kumar Vishal,
> >
> > Thanks for your suggestions. As you said, choose Trie replace HashMap we
> > can get better memory footprint and also good performance. Of course, DAT
> > is not only choice, and I will do test about DAT vs Radix Trie and
> release
> > the test result as soon as possible. Thanks your suggestions again.
> >
> > Regards,
> > Xiaoqiao
> >
> > On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal 
>
> > kumarvishal1802@
>
> > 
> > wrote:
> >
> >> Hi XIaoqiao He,
> >> +1,
> >> For forward dictionary case it will be very good optimisation, as our
> >> case
> >> is very specific storing byte array to int mapping[data to surrogate key
> >> mapping], I think we will get much better memory footprint and
> >> performance
> >> will be also good(2x). We can also try radix tree(radix trie), it is
> more
> >> optimise for storage.
> >>
> >> -Regards
> >> Kumar Vishal
> >>
> >> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen 
>
> > chenliang6136@
>
> > 
> >> wrote:
> >>
> >> > Hi xiaoqiao
> >> >
> >> > For the below example, 600K dictionary data:
> >> > It is to say that using "DAT" can save 36M memory against
> >> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >> >
> >> > One more question:if increases the dictionary data size, what's the
> >> > comparison results "ConcurrentHashMap" VS "DAT"
> >> >
> >> > Regards
> >> > Liang
> >> > 
> >> > --
> >> > a. memory footprint (approxima

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-24 Thread Xiaoqiao He
Hi Kumar Vishal,

Thanks for your suggestions. As you said, choose Trie replace HashMap we
can get better memory footprint and also good performance. Of course, DAT
is not only choice, and I will do test about DAT vs Radix Trie and release
the test result as soon as possible. Thanks your suggestions again.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 4:48 PM, Kumar Vishal <kumarvishal1...@gmail.com>
wrote:

> Hi XIaoqiao He,
> +1,
> For forward dictionary case it will be very good optimisation, as our case
> is very specific storing byte array to int mapping[data to surrogate key
> mapping], I think we will get much better memory footprint and performance
> will be also good(2x). We can also try radix tree(radix trie), it is more
> optimise for storage.
>
> -Regards
> Kumar Vishal
>
> On Thu, Nov 24, 2016 at 12:12 PM, Liang Chen <chenliang6...@gmail.com>
> wrote:
>
> > Hi xiaoqiao
> >
> > For the below example, 600K dictionary data:
> > It is to say that using "DAT" can save 36M memory against
> > "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
> >
> > One more question:if increases the dictionary data size, what's the
> > comparison results "ConcurrentHashMap" VS "DAT"
> >
> > Regards
> > Liang
> > 
> > --
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards
> > Liang
> >
> > hexiaoqiao wrote
> > > hi Liang,
> > >
> > > Thanks for your reply, i need to correct the experiment result because
> > > it's
> > > wrong order NO.1 column of result data table.
> > >
> > > In order to compare performance between Trie and HashMap, Two different
> > > structures are constructed using the same dictionary data which size is
> > > 600K and each item's length is between 2 and 50 bytes.
> > >
> > > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > > Double
> > > Array Trie (one implementation of Trie Structures)
> > >
> > > a. memory footprint (approximate quantity) in 64-bit JVM:
> > > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> > >
> > > b. retrieval performance: total time(ms) of 500 million query:
> > > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> > >
> > > Regards,
> > > He Xiaoqiao
> > >
> > >
> > > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen 
> >
> > > chenliang6136@
> >
> > >  wrote:
> > >
> > >> Hi xiaoqiao
> > >>
> > >> This improvement looks great!
> > >> Can you please explain the below data, what does it mean?
> > >> --
> > >> ConcurrentHashMap
> > >> ~68MB 14543
> > >> Double Array Trie
> > >> ~104MB 12825
> > >>
> > >> Regards
> > >> Liang
> > >>
> > >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He 
> >
> > > xq.he2009@
> >
> > > :
> > >>
> > >> >  Hi All,
> > >> >
> > >> > I would like to propose Dictionary improvement which using Trie in
> > >> place
> > >> of
> > >> > HashMap.
> > >> >
> > >> > In order to speedup aggregation, reduce run-time memory footprint,
> > >> enable
> > >> > fast
> > >> > distinct count etc, CarbonData encodes data using dictionary at file
> > >> level
> > >> > or table level based on cardinality. It is a general and efficient
> way
> > >> in
> > >> > many big data systems, but when apply ConcurrentHashMap
> > >> > to maintain Dictionary in CarbonData currently, memory overhead of
> > >> > Driver is very huge since it has to load whole Dictionary to decode
> > >> actual
> > >> > data value, especially column cardinality is a large number. and
> > >> CarbonData
> > >> > will not do dictionary if cardinality > 1 million at default
> behavior.
> > >> >
> > >> > I propose using Trie in place of HashMap for the following three
> > >> reasons:
> > >> > (1) Trie is a proper structure for 

Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-24 Thread Xiaoqiao He
Hi Liang,

Generally, yes, because the same prefix of items in dictionary does not
require to repeat in DAT, and more data better result.

Actually the cost of DAT is building Tree, and i don't think we need to
consider it since this cost appears only once when load data.

FYI.

Regards,
Xiaoqiao

On Thu, Nov 24, 2016 at 2:42 PM, Liang Chen <chenliang6...@gmail.com> wrote:

> Hi xiaoqiao
>
> For the below example, 600K dictionary data:
> It is to say that using "DAT" can save 36M memory against
> "ConcurrentHashMap", whereas the performance just lost less (1718ms) ?
>
> One more question:if increases the dictionary data size, what's the
> comparison results "ConcurrentHashMap" VS "DAT"
>
> Regards
> Liang
> 
> --
> a. memory footprint (approximate quantity) in 64-bit JVM:
> ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
>
> b. retrieval performance: total time(ms) of 500 million query:
> 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
>
> Regards
> Liang
>
> hexiaoqiao wrote
> > hi Liang,
> >
> > Thanks for your reply, i need to correct the experiment result because
> > it's
> > wrong order NO.1 column of result data table.
> >
> > In order to compare performance between Trie and HashMap, Two different
> > structures are constructed using the same dictionary data which size is
> > 600K and each item's length is between 2 and 50 bytes.
> >
> > ConcurrentHashMap (structure which is used in CarbonData currently) vs
> > Double
> > Array Trie (one implementation of Trie Structures)
> >
> > a. memory footprint (approximate quantity) in 64-bit JVM:
> > ~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)
> >
> > b. retrieval performance: total time(ms) of 500 million query:
> > 12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)
> >
> > Regards,
> > He Xiaoqiao
> >
> >
> > On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen 
>
> > chenliang6136@
>
> >  wrote:
> >
> >> Hi xiaoqiao
> >>
> >> This improvement looks great!
> >> Can you please explain the below data, what does it mean?
> >> --
> >> ConcurrentHashMap
> >> ~68MB 14543
> >> Double Array Trie
> >> ~104MB 12825
> >>
> >> Regards
> >> Liang
> >>
> >> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He 
>
> > xq.he2009@
>
> > :
> >>
> >> >  Hi All,
> >> >
> >> > I would like to propose Dictionary improvement which using Trie in
> >> place
> >> of
> >> > HashMap.
> >> >
> >> > In order to speedup aggregation, reduce run-time memory footprint,
> >> enable
> >> > fast
> >> > distinct count etc, CarbonData encodes data using dictionary at file
> >> level
> >> > or table level based on cardinality. It is a general and efficient way
> >> in
> >> > many big data systems, but when apply ConcurrentHashMap
> >> > to maintain Dictionary in CarbonData currently, memory overhead of
> >> > Driver is very huge since it has to load whole Dictionary to decode
> >> actual
> >> > data value, especially column cardinality is a large number. and
> >> CarbonData
> >> > will not do dictionary if cardinality > 1 million at default behavior.
> >> >
> >> > I propose using Trie in place of HashMap for the following three
> >> reasons:
> >> > (1) Trie is a proper structure for Dictionary,
> >> > (2) Reduce memory footprint,
> >> > (3) Not impact retrieval performance
> >> >
> >> > The experimental results show that Trie is able to meet the
> >> requirement.
> >> > a. ConcurrentHashMap vs Double Array Trie
> >> > https://linux.thai.net/~thep/datrie/datrie.html;(one
> >> implementation of
> >> > Trie Structures)
> >> > b. Dictionary size: 600K
> >> > c. Memory footprint and query time
> >> > - memory footprint (64-bit JVM) 500 million query time(ms)
> >> > ConcurrentHashMap
> >> > ~68MB 14543
> >> > Double Array Trie
> >> > ~104MB 12825
> >> >
> >> > Please share your suggestions about the proposed improvement of
> >> Dictionary.
> >> >
> >> > Regards
> >> > He Xiaoqiao
> >> >
> >>
> >>
> >>
> >> --
> >> Regards
> >> Liang
> >>
>
>
>
>
>
> --
> View this message in context: http://apache-carbondata-maili
> ng-list-archive.1130556.n5.nabble.com/Improvement-Use-Trie-
> in-place-of-HashMap-to-reduce-memory-footprint-of-Dictionary
> -tp3132p3143.html
> Sent from the Apache CarbonData Mailing List archive mailing list archive
> at Nabble.com.
>


Re: [Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-23 Thread Xiaoqiao He
hi Liang,

Thanks for your reply, i need to correct the experiment result because it's
wrong order NO.1 column of result data table.

In order to compare performance between Trie and HashMap, Two different
structures are constructed using the same dictionary data which size is
600K and each item's length is between 2 and 50 bytes.

ConcurrentHashMap (structure which is used in CarbonData currently) vs Double
Array Trie (one implementation of Trie Structures)

a. memory footprint (approximate quantity) in 64-bit JVM:
~104MB (*ConcurrentHashMap*) vs ~68MB (*DAT*)

b. retrieval performance: total time(ms) of 500 million query:
12825 ms(*ConcurrentHashMap*) vs 14543 ms(*DAT*)

Regards,
He Xiaoqiao


On Thu, Nov 24, 2016 at 7:48 AM, Liang Chen <chenliang6...@gmail.com> wrote:

> Hi xiaoqiao
>
> This improvement looks great!
> Can you please explain the below data, what does it mean?
> --
> ConcurrentHashMap
> ~68MB 14543
> Double Array Trie
> ~104MB 12825
>
> Regards
> Liang
>
> 2016-11-24 2:04 GMT+08:00 Xiaoqiao He <xq.he2...@gmail.com>:
>
> >  Hi All,
> >
> > I would like to propose Dictionary improvement which using Trie in place
> of
> > HashMap.
> >
> > In order to speedup aggregation, reduce run-time memory footprint, enable
> > fast
> > distinct count etc, CarbonData encodes data using dictionary at file
> level
> > or table level based on cardinality. It is a general and efficient way in
> > many big data systems, but when apply ConcurrentHashMap
> > to maintain Dictionary in CarbonData currently, memory overhead of
> > Driver is very huge since it has to load whole Dictionary to decode
> actual
> > data value, especially column cardinality is a large number. and
> CarbonData
> > will not do dictionary if cardinality > 1 million at default behavior.
> >
> > I propose using Trie in place of HashMap for the following three reasons:
> > (1) Trie is a proper structure for Dictionary,
> > (2) Reduce memory footprint,
> > (3) Not impact retrieval performance
> >
> > The experimental results show that Trie is able to meet the requirement.
> > a. ConcurrentHashMap vs Double Array Trie
> > <https://linux.thai.net/~thep/datrie/datrie.html>(one implementation of
> > Trie Structures)
> > b. Dictionary size: 600K
> > c. Memory footprint and query time
> > - memory footprint (64-bit JVM) 500 million query time(ms)
> > ConcurrentHashMap
> > ~68MB 14543
> > Double Array Trie
> > ~104MB 12825
> >
> > Please share your suggestions about the proposed improvement of
> Dictionary.
> >
> > Regards
> > He Xiaoqiao
> >
>
>
>
> --
> Regards
> Liang
>


[Improvement] Use Trie in place of HashMap to reduce memory footprint of Dictionary

2016-11-23 Thread Xiaoqiao He
 Hi All,

I would like to propose Dictionary improvement which using Trie in place of
HashMap.

In order to speedup aggregation, reduce run-time memory footprint, enable fast
distinct count etc, CarbonData encodes data using dictionary at file level
or table level based on cardinality. It is a general and efficient way in
many big data systems, but when apply ConcurrentHashMap
to maintain Dictionary in CarbonData currently, memory overhead of
Driver is very huge since it has to load whole Dictionary to decode actual
data value, especially column cardinality is a large number. and CarbonData
will not do dictionary if cardinality > 1 million at default behavior.

I propose using Trie in place of HashMap for the following three reasons:
(1) Trie is a proper structure for Dictionary,
(2) Reduce memory footprint,
(3) Not impact retrieval performance

The experimental results show that Trie is able to meet the requirement.
a. ConcurrentHashMap vs Double Array Trie
(one implementation of
Trie Structures)
b. Dictionary size: 600K
c. Memory footprint and query time
- memory footprint (64-bit JVM) 500 million query time(ms)
ConcurrentHashMap
~68MB 14543
Double Array Trie
~104MB 12825

Please share your suggestions about the proposed improvement of Dictionary.

Regards
He Xiaoqiao


Re: [Feature ]Design Document for Update/Delete support in CarbonData

2016-11-20 Thread Xiaoqiao He
hi Aniket Adnaik,

It is a great design document about update/delete and very useful feature
for CarbonData.

For the solution you proposed, i think the most difficult challenge is
Compaction. If without careful attention, rewriting data over and over can
lead to some serious network and disk over-subscription, In other words,
compaction is about trading some disk IO now for fewer seeks later, as
HBase and LevelDB raise the same issue.

The following compaction solution for LevelDB/HBase could be reference for
the detailed design. FYI.

   - FIFO Compaction(HBASE-14468
   )
   - Tier-Based Compaction(HBASE-7055
   ,HBASE-14477
   )
   - Level Compaction(LevelDB Implementation notes
   )/Stripe
   Compaction(HBASE-7667 )

Please correct me if I am wrong.

Regards,
He Xiaoqiao


On Sun, Nov 20, 2016 at 11:54 PM, Aniket Adnaik 
wrote:

> Hi All,
>
> Please find a design doc for Update/Delete support in CarbonData.
>
> https://drive.google.com/file/d/0B71_EuXTdDi8S2dxVjN6Z1RhWlU/view?
> usp=sharing
>
> Best Regards,
> Aniket
>


Re: [Feature] proposal for update and delete support in Carbon data

2016-11-15 Thread Xiaoqiao He
hi Vinod,

It is an expected feature for many people as Jacky mentioned. I think
Update/Delete should be basic module for CarbonData, meanwhile it is
complex question for distributed storage system. The solution you proposed
is based on traditional 'Base + Delta' approach, which is applied on
bigtable/hbase/kudu/etc successfully. following your proposed solution for
CarbonData i have some confusion include doubts Jacky mentioned transaction
and index:

1. How to trade-off IO overhead when add delta files. i think there may be
two query approaches for delta files: (1) load whole delta data and replace
based query result if also exist in delta file. in this case, it may
increase IO overhead which CarbonData try to reduce it as possible.  (2)
build separate index for all delta file, or label delta records and upgrade
file format. right?
2. When and how to do minor/major compaction on (base + delta) or (delta +
delta)?
3. Any questions for update or delete Directory item?

I look forward to the detailed design of your solution.

Please correct me if i am wrong.

Best Regards,
He Xiaoqiao


On Tue, Nov 15, 2016 at 5:39 PM, Jacky Li  wrote:

> Hi Vinod,
>
> It is great to have this feature, as there were many people asking for
> data update during the CarbonData meetup earlier. I believe it will be
> useful for many big data applications.
>
> For the solution you proposed, I have following doubts:
> 1. Data update is complex as if transaction is involved, so what kind of
> ACID level support are you thinking about?
> 2. If I understand correctly, you are proposing to do data update via base
> + delta file approach, right? So in this case, new file format needs to be
> added in CarbonData project.
> 3. As CarbonData has builtin support for index, any idea what is the
> impaction to the B tree index already in driver and executor memory?
>
> Regards,
> Jacky
>
> > 在 2016年11月15日,下午12:25,Vinod KC  写道:
> >
> > Hi All
> > I would like to propose following new features in Carbon data
> > 1) Update statement to support modifying existing records in carbon data
> > table
> > 2) Delete statement to remove records from carbon data table
> >
> > A) Update operation: 'Update' features can be added to CarbonData using
> > intermediate Delta files [delete/update delta files] support with lesser
> > impact on existing code.
> > Update can be considered as a ‘delete’ followed by an‘insert’ operation.
> > Once an update is done on carbon data file, on select query operation,
> > Carbondata store reader can make use of delete delta data cache to
> exclude
> > deleted records in that segment and then include records from newly added
> > update delta files.
> >
> > B) Delete operation: In the case of delete operation, a delete delta file
> > will be added to each segment matching the records. During select query
> > operation Carbon data reader will exclude those deleted records from the
> > result set.
> >
> > Please share your suggestions and thoughts about design and functional
> > aspects on this feature. I’ll share a detailed design document about
> above
> > thoughts later.
> >
> > Regards
> > Vinod
>
>
>
>


Re: [Discussion] Please vote and comment for carbon data file format change

2016-11-01 Thread Xiaoqiao He
Hi Kumar Vishal,

I couldn't get Fig. of the file format, could you re-upload them?
Thanks.

Best Regards

On Tue, Nov 1, 2016 at 7:12 PM, Kumar Vishal 
wrote:

>
> ​Hello All,
>
> Improving carbon first time query performance
>
> Reason:
> 1. As file system cache is cleared file reading will make it slower to
> read and cache
> 2. In first time query carbon will have to read the footer from file data
> file to form the btree
> 3. Carbon reading more footer data than its required(data chunk)
> 4. There are lots of random seek is happening in carbon as column
> data(data page, rle, inverted index) are not stored together.
>
> Solution:
> 1. Improve block loading time. This can be done by removing data chunk
> from blockletInfo and storing only offset and length of data chunk
> 2. compress presence meta bitset stored for null values for measure column
> using snappy
> 3. Store the metadata and data of a column together and read together this
> reduces random seek and improve IO
>
> For this I am planing to change the carbondata thrift format
>
> *Old format*
>
>
>
> *New format*
>
>
>
> *​*
>
> Please vote and comment for this new format change
>
> -Regards
> Kumar Vishal
>
>
>
>


Re: Beijing Apache CarbonData meetup:https://www.meetup.com/Apache-Carbondata-Meetup/events/235013117/

2016-10-31 Thread Xiaoqiao He
This meetup is really interesting. it's helpful to understanding arch. and
some details of Carbondata:
1.What is the advantages of Carbondata and scenarios are good for;
2.How Carbondata gets to its goal, include arch. and implementation, some
differences between carbondata and orc/parquet/hbase/kylin etc. also
intergraded with hadoop & spark;
3.Future and Roadmap.
I think it's a good choice to getting started and advanced with the slide

shared
by C.Liang and L.Kun.


On Fri, Oct 21, 2016 at 3:18 PM, Liang Chen  wrote:

> Hi all
>
> Saturday, October 29, 2016 1:30 PM to 5:30 PM
> You can apply through this link :
> https://www.meetup.com/Apache-Carbondata-Meetup/events/235013117/
>
>
> Regards
> Liang
>