Re: 回复： a new FileFormat 5x~100x faster than parquet

Ted Yu Mon, 22 Feb 2016 07:12:00 -0800

The referenced benchmark is in Chinese. Please provide English version so
that more people can understand.


For item 7, looks like the speed of ingest is much slower compared to using
Parquet.

Cheers

On Mon, Feb 22, 2016 at 6:12 AM, 开心延年 <muyann...@qq.com> wrote:

> 1.ya100 is not only the invert index  ,but also include the TOP N sort
> lazy read,also include label .
> 2.our test on ya100 and parquet is on this link address
> https://github.com/ycloudnet/ya100/blob/master/v1.0.8/ya100%E6%80%A7%E8%83%BD%E6%B5%8B%E8%AF%95%E6%8A%A5%E5%91%8A.docx?raw=true
>
> 3.you are right ,the load into ya100 is longer then parquet ,but the query
> time is very fast.
> 4. the code is already available ，but i`m sorry ya100 and ydb is a
> Commercial product now by our company (called ycloud),but i think if apache
> like it ,the source code is not a problem.
>
> our invert index indeed is apache lucene-but a quite different useage,
> such as solr or es.
> we can make a test on ya100(we support java jar),it is really very faster
> than  parquet especially on sort 、group by 、filter and so on.
>
>
> provide more information? what any thing other infomation do you want ,i
> can provide it  tomorrow.
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Gavin Yue";<yue.yuany...@gmail.com>;
> *发送时间:* 2016年2月22日(星期一) 晚上9:33
> *收件人:* "开心延年"<muyann...@qq.com>;
> *抄送:* "Akhil Das"<ak...@sigmoidanalytics.com>; "user"<
> u...@spark.apache.org>; "dev"<dev@spark.apache.org>;
> *主题:* Re: 回复： a new FileFormat 5x~100x faster than parquet
>
> I recommend you provide more information. Using inverted index certainly
> speed up the query time if hitting the index, but it would take longer to
> create and insert.
>
> Is the source code not available at this moment?
>
> Thanks
> Gavin
>
> On Feb 22, 2016, at 20:27, 开心延年 <muyann...@qq.com> wrote:
>
> if apache enjoy this project , of course we provider the source code .
>
> BUt if apache dislike the porject , we had continue to improve the project
> by myself .
>
> ya100 and ydb  max process data is 180billions rows data per day for neary
> realtime import .
>
> because of index ,we make the search 10 secondes return in 1800billion
> (10days) rows data.
>
>
> ------------------ 原始邮件 ------------------
> *发件人:* "Akhil Das";<ak...@sigmoidanalytics.com>;
> *发送时间:* 2016年2月22日(星期一) 晚上8:42
> *收件人:* "开心延年"<muyann...@qq.com>;
> *抄送:* "user"<u...@spark.apache.org>; "dev"<dev@spark.apache.org>;
> *主题:* Re: a new FileFormat 5x~100x faster than parquet
>
> Would be good to see the source code and the documentation in English.
>
> Thanks
> Best Regards
>
> On Mon, Feb 22, 2016 at 4:44 PM, 开心延年 <muyann...@qq.com> wrote:
>
>> Ya100 is a FileFormat 5x~100x  faster than parquet。
>> we can get ya100 from this link
>> https://github.com/ycloudnet/ya100/tree/master/v1.0.8
>>
>>
>> <A2D2FBB6@C4EFD745.B6FECA56>
>>
>> 1.we used the inverted index，so we skip the rows that we does need.
>>
>>   for example  the trade log search SQL
>>
>>
>>
>>         select
>>
>> （1）
>> phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong
>>
>>
>>         from spark_txt where
>>
>> （2）tradeid=' 2014012213870282671'
>>
>>         limit 10;
>>
>>
>>      this sql is compose by two part
>>
>>      (1)the part 1 is return the result which has 9 columns
>>
>>      (2) the part 2 is the filter condition ,filter by tradeid
>>
>>
>>
>>       let guess which plan is faster
>>
>>      plan A :first read all the 9 columns result then filter by tradeid
>>
>>      plan B: first filter by tradeid ,then we read the match 9 columns
>> result.
>>
>>
>>     Ya100 choose plan B
>>
>>
>>      contrast  performance Ya100`index with parquet
>>
>>
>> <60C62790@C4EFD745.B6FECA56>
>>
>>
>>
>> 2.TOP N sort ,the non sort column we doesn`t read it until the last
>>
>>   for example  we sort by the logtime
>>
>>     select
>>
>> （1）
>> phonenum,usernick,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_day,amtlong
>>
>>
>>     from spark_txt
>>
>> （2）order by logtime desc
>>
>>     limit 10;
>>
>>
>>   this sql is compose by two part
>>
>>      (1)the part 1 is return the result which has 9 columns
>>
>>      (2) the part 2 is the column need to sort
>>
>>
>>
>>       let guess which plan is faster
>>      plan A :first read all the 9 columns result then sort by logtime
>>
>>      plan B: first sort by logtime ,then we read the match 9 columns
>> result.
>>
>>
>>        Ya100 choose plan B
>>
>>
>>      contrast  performance Ya100`lazy read with parquet
>> <88FABE61@C4EFD745.B6FECA56>
>>
>> 3.we used label instead of the original value for grouping and sorting
>>
>> <C2F6865D@C4EFD745.B6FECA56>
>>
>> 1).General situation,the data has a lot of repeat value,for exampe the
>> sex file ,the age field .
>> 2).if we store the original value ,that will weast a lot of storage.
>> so we make a small modify at original  value, Additional add a new filed
>> called label.
>> make a unique value sort by fields, and then gave each term a unique
>> Number from begin to end  .
>> 3).we use number value(we called label) instead of original  value.lable
>> is store by fixed length. the file could be read by random read.
>> 4).the label`s order is the same with dictionary  order .so if we do some
>> calculation like order by or group by only need to  read the label. we
>> don`t need to read the original value.
>> 5).some field like sex field ,only have 2 different values.so we only use
>> 2 bits(not 2 bytes) to store the label, it will save a lot of Disk io.
>>  when we finish all of the calculation, we translate label to original
>> value by a dictionary.
>> 6)if a lots of rows have the same original value ,the original value we
>> only store once,only read once.
>> Solve the problem:
>> 1)ya100`s data is quite big we don`t have enough memory to load all
>> Values to memory.
>> 2)on realtime mode ,data is change Frequent , The cache is invalidated
>> Frequent by append or update. build Cache will take a lot of times and io;
>> 3)the Original value  is a string type.  whene sorting or grouping ,thed
>> string value need a lot of memory and need lot of cpu time to calculate
>> hashcode \compare \equals ,But label is number  is fast.
>> 4)the label is number ,it`s type maybe short ,or maybe a byte ,or may be
>> integer whitch depending on the max number of the label.
>>
>> two-phase search
>> Original:
>> 1)group by order by use original value,the real value may be is a string
>> type,may be more larger ,the real value maybe  need a lot of io
>> 2)compare by string is slowly then compare by integer
>> Our improve:
>> 1)we split one search into multy-phase search
>> 2)the first search we only search the field that use for order by ,group
>> by
>> 3)the first search we doesn`t need to read the original value(the real
>> value),we only need to read the docid and label for order by group by.
>> 4)when we finish all the order by and group by ,may be we only need to
>> return Top n records .so we start next to search to get the Top n records
>> original value.
>> Solve the problem:
>> 5)reduce io ,read original take a lot of disk io
>> 6)reduce network io (for merger)
>> 7)most of the field has repeated value, the repeated only need to read
>> once
>> the group by filed only need to read the origin once by label where
>> display to user.
>> 8)most of the search only need to display on Top n (n<=100) results, so
>> most of the original value could be skip.
>> <D328C2B9@C4EFD745.B6FECA56>
>>
>>
>> <6694C613@C4EFD745.B6FECA56>
>>
>>
>> <F21D52B8@C4EFD745.B6FECA56>
>>
>> <2CBE257B@C4EFD745.B6FECA56>
>>
>>
>> How to install ya100
>>
>> 1)Add the depend jar
>>
>> export SPARK_CLASSPATH=$SPARK_CLASSPATH:/data/ycloud/spark_ydb/ya100.jar
>>
>> You can get the ya100 jar from this link
>>
>>
>> https://github.com/ycloudnet/ya100/blob/master/v1.0.8/ya100-1.0.8.jar
>>
>>
>> 2)Start thriftserver
>>
>> ./start-thriftserver.sh     --master yarn-client  --executor-memory 990m
>> --executor-cores 2 --num-executors 16
>>
>> Our suggest spark version is 6
>>
>> 3)Config the ya100 function
>>
>> create  function Yfilter as 'cn.net.ycloud.ydb.handle.fun.Yfilter';
>>
>> create  function Ytop10000 as 'cn.net.ycloud.ydb.handle.fun.Ytop10000';
>>
>> create  function Ycombine as 'cn.net.ycloud.ydb.handle.fun.Ycombine';
>>
>> create  function Ycount as 'cn.net.ycloud.ydb.handle.fun.Ycount';
>>
>> create  function Ymax as 'cn.net.ycloud.ydb.handle.fun.Ymax';
>>
>> create  function Ymin as 'cn.net.ycloud.ydb.handle.fun.Ymin';
>>
>> create  function Yavg as 'cn.net.ycloud.ydb.handle.fun.Yavg';
>>
>> create  function Ysum as 'cn.net.ycloud.ydb.handle.fun.Ysum';
>>
>> create  function Ymaxstring as 'cn.net.ycloud.ydb.handle.fun.Ymaxstring';
>>
>> create  function Yminstring as 'cn.net.ycloud.ydb.handle.fun.Yminstring‘;
>>
>> YA100 support data type
>> <30907791@C4EFD745.B6FECA56>
>>
>>
>> How to create ya100 table on spark Sql
>>
>> CREATE external  table spark_ya100(
>>
>>  phonenum bigint, usernick string, ydb_sex string,  ydb_province string,
>> ydb_grade string, ydb_age string, ydb_blood string, ydb_zhiye string,
>> ydb_earn string, ydb_prefer string, ydb_consume string, ydb_day string,
>> amtdouble double,amtlong int, content string, ydbpartion string,
>> ya100_pipe string
>>
>> ) partitioned by (dt string)
>>
>> STORED BY 'cn.net.ycloud.ydb.handle.Ya100StorageHandler'
>>
>> LOCATION  '/data/ydb/shu_ya100'
>>
>> TBLPROPERTIES(
>>
>> "ya100.handler.table.name"="ydb_example_shu",
>>
>> "ya100.handler.schema"="phonenum long,usernick string,ydb_sex
>> string,ydb_province string,ydb_grade string,ydb_age string,ydb_blood
>> string,ydb_zhiye string,ydb_earn string,ydb_prefer string,ydb_consume
>> string,ydb_day string, amtdouble double,amtlong int,content
>> textcjk,ydbpartion string,ya100_pipe string"
>>
>> )
>>
>> //注，ydbpartion为ydb的分区字段与ya100_pipe 为ya100的管道字段，必须要都要创建，否则有些功能会被限制使用。
>>
>> ya100.handler.table.name为预留的key,不同的表之间要区分开来
>>
>> ya100.handler.schema 为内置的索引的数据类型，名字要与spark的字段名称一致。
>>
>>
>> How to Import data into ya100 table
>>
>>
>> 1)Set up import params
>>
>> set hive.mapred.supports.subdirectories=true;
>>
>> set mapred.min.split.size=2147483648;
>>
>> 注：
>>
>> ya100因为生成的最终不是一个文件，而是一个目录，这个目录下会有索引，所以要通过subdirectories参数设置spark
>> 支持子目录，否则向表里导入数据会报错失败。
>>
>> set mapred.min.split.size=2147483648 是为了控制map的个数，防止生成的索引文件数量太多。
>>
>> 2)Begin import
>>
>> insert into table spark_ya100 partition (dt='1200million')
>>
>>  select
>> phonenum,usernick,ydb_sex,ydb_province,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtdouble,amtlong,content,2,1
>>
>> from spark_txt where dt='1200million';
>>
>> How to use ya100 to filter data
>>
>> 1),basic example 1
>>
>> set ya100.spark.filter.ydb_example_shu=ydb_sex='女' or ydb_province='辽宁'
>> or ydb_day>='20151217';
>>
>> 有些时候，会遇到转义的问题，这里的value值，可以进行urlencode处理。
>>
>> 如：
>>
>> set
>> ya100.spark.filter.ydb_example_shu=ydbpartion%3D%2720151110%27+and+%28ydb_sex%3D%27%E5%A5%B3%27+or+ydb_province%3D%27%E8%BE%BD%E5%AE%81%27+or+ydb_day%3E%3D%2720151217%27%29;
>>
>> 2)Filter example 2
>>
>> set ya100.spark.filter.ydb_example_shu=ydb_sex='女' or ydb_province='辽宁'
>> or ydb_day>='20151217';
>>
>> select * from spark_ya100 where dt=‘1200million’ limit 10;
>>
>>
>> Other filter example use case
>>
>> 1)equal
>>
>> qq=‘165162897’
>>
>> 2)Support in
>>
>> 如：indexnum in (1,2,3)
>>
>> 3)>,<,>=,<=,
>>
>> clickcount >=10 and clickcount <=11
>>
>> 4) range
>>
>> indexnum like '({0 TO 11}) '      不包含边界值
>>
>> indexnum like '([10 TO 11] ) '    包含边界值
>>
>> 5) Unequal
>>
>> label<>'l_14' and label<>'l_15'
>>
>> 6)
>>
>> indexnum='1' or indexnum='2' or (clickcount >=5 and clickcount <=11)
>>
>>
>>
>> <EE93F720@C4EFD745.B6FECA56>
>>
>>
>> How to use ya100 to make a top N sort
>>
>> set ya100.spark.top10000.ydb_example_shu=ydb_age desc limit 10;
>>
>> set ya100.spark.top10000.ydb_example_shu=ydb_sex desc,ydb_province limit
>> 100；
>>
>> set ya100.spark.top10000.ydb_example_shu=* limit 10;
>>
>> 注意，当前版本limit的最大值为10000 （每个索引最大返回1000）
>>
>> 列的名字 写*表示 不需要排序，但只返回前N条就可以了
>>
>> SQL example
>>
>> set ya100.spark.top10000.ydb_example_shu=phonenum desc limit 10;
>>
>> select * from spark_ya100 where dt='1200million' order by phonenum desc
>> limit 10;
>>
>>
>>
>> How to make group by or stat by ya100
>>
>> Example 1
>>
>> set ya100.spark.combine.ydb_example_shu=*,ydb_age;
>>
>> select ydb_province,Ycount('*',ya100_pipe), Ycount('ydb_age',ya100_pipe),
>> Ymaxstring('ydb_age',ya100_pipe), Yminstring('ydb_age',ya100_pipe)
>>
>>  from spark_ya100  group by ydb_province limit 10
>>
>>
>>
>> Example 2
>>
>> set ya100.spark.combine.ydb_example_shu=amtlong;
>>
>> select ydb_sex, ydb_province,Ysum('amtlong',ya100_pipe) as cnt from
>> spark_ya100  group by ydb_sex, ydb_province  limit 10
>>
>>
>>
>> Other SQL EXAMPLE
>>
>>
>> •EXAMPLE COUNT(*)
>>
>> set ya100.spark.filter.ydb_example_shu=phonenum='13870282671'  and
>> usernick='江峻熙';
>>
>> set ya100.spark.combine.ydb_example_shu=*;
>>
>> set ya100.spark.top10000.ydb_example_shu=;
>>
>> select Ycount('*',ya100_pipe) from spark_ya100 where dt='100million'
>> limit 10;
>> •TOP N SORT EXAMPLE
>> •set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100]
>> )' and amtdouble like '([1090 TO 1100] )';
>>
>> set ya100.spark.combine.ydb_example_shu=;
>>
>> set ya100.spark.top10000.ydb_example_shu=phonenum desc limit 10;
>>
>> select * from spark_ya100 where dt='100million' order by phonenum desc
>> limit 10;
>>
>> SUM(amtlong)
>>
>> set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100] )'
>> and amtdouble like '([1090 TO 1100] )';
>>
>> set ya100.spark.combine.ydb_example_shu=amtlong;
>>
>> set ya100.spark.top10000.ydb_example_shu=;
>>
>> select Ycount(‘amtlong’,ya100_pipe) ,Ysum(‘amtlong’,ya100_pipe) 
>> ,Yavg(‘amtlong’,ya100_pipe),
>> Ymax(‘amtlong’,ya100_pipe), Ymin(‘amtlong’,ya100_pipe)  from spark_ya100
>> where dt=‘100million’  limit 10;
>>
>> •Top n sort
>>
>> set ya100.spark.filter.ydb_example_shu= amtlong like '([1090 TO 1100] )'
>> and amtdouble like '([1090 TO 1100] )';
>>
>> set ya100.spark.combine.ydb_example_shu=;
>>
>> set ya100.spark.top10000.ydb_example_shu= ydb_age desc, ydb_sex;
>>
>> select ydb_sex, ydb_age from spark_ya100 where dt='100million'  order  by
>> ydb_age desc, ydb_sex  limit 10;
>> •Group by
>>
>> set ya100.spark.filter.ydb_example_shu=;
>>
>> set
>> ya100.spark.combine.ydb_example_shu=ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtlong;
>>
>> set ya100.spark.top10000.ydb_example_shu=;
>>
>> select ydb_sex,ydb_province,ydb_grade, Ymaxstring(‘ydb_age’,ya100_pipe),
>> Ymaxstring(‘ydb_blood’,ya100_pipe),Ymaxstring(‘ydb_zhiye’,ya100_pipe),
>> Ymaxstring(‘ydb_earn’,ya100_pipe),Ymaxstring(‘ydb_prefer’,ya100_pipe),
>> Ymaxstring(‘ydb_consume’,ya100_pipe),Ymaxstring(‘ydb_day’,ya100_pipe),
>> Ysum(‘amtlong’,ya100_pipe) as cnt from spark_ya100 where dt=‘100million’
>> group by ydb_sex,ydb_province,ydb_grade order by cnt desc limit 10
>>
>>
>> Connect Ya100 to ydb  make real time data
>> CREATE external  table spark_ydb(
>>  phonenum bigint, usernick string, ydb_sex string,  ydb_province string,
>> ydb_grade string, ydb_age string, ydb_blood string, ydb_zhiye string,
>> ydb_earn string, ydb_prefer string, ydb_consume string, ydb_day string,
>> amtdouble double,amtlong int, content string, ydbpartion string, ya100_pipe
>> string
>> )
>> STORED BY 'cn.net.ycloud.ydb.handle.Ya100StorageHandler'
>> LOCATION  '/data/ydb/shu_ydb'
>> TBLPROPERTIES(
>> "ya100.handler.table.name"="ydb_example_shu",
>> "ya100.handler.master"="101.200.130.48:8080",
>>
>> "ya100.handler.columns.mapping"="phonenum,usernick,ydb_sex,ydb_province,ydb_grade,ydb_age,ydb_blood,ydb_zhiye,ydb_earn,ydb_prefer,ydb_consume,ydb_day,amtdouble,amtlong,content,ydbpartion,ya100_pipe")
>>
>> 注
>>
>> ydbpartion为ydb的分区字段，查询的时候必须执行ydbpartion,而spark表则没有分区字段
>>
>> ya100_pipe 为ya100的管道字段，必须要都要创建，否则有些功能会被限制使用。
>>
>> ya100.handler.table.name为ydb系统里的表名字
>>
>> ya100.handler.columns.mapping 为ydb系统内的表与spark的表之间的映射。
>>
>>
>>
>> <5C00AFEC@C4EFD745.B6FECA56>
>> <CDC3E8B4@C4EFD745.B6FECA56>
>>
>> <10C4E788@C4EFD745.B6FECA56>
>>
>>
>>
>

Re: 回复： a new FileFormat 5x~100x faster than parquet

Reply via email to