Re: 答复: kylin nonsupport Multi-value dimensions?

2017-05-13 Thread Li Yang
> java.lang.IllegalStateException: The table: DIM_XXX Dup key found,
key=[1446], value1=[1446,29,1,1], value2=[1446,28,0,0]

This error is about dup key in a dimension table. The primary key of
dimension table must be unique on all rows. And in this case, the key
"1446" appears twice.

On Wed, May 10, 2017 at 6:59 PM, Alberto Ramón 
wrote:

> You can convert this dim to string and check performance using like filters
>
> With hive duplicate values in fact table.  One for each dim value
>
> Other complex solution can be extended dictionary encode dimension to
> understand multivalues
>
> No more ideas :)
>
>
> On 10 May 2017 8:51 a.m., "jianhui.yi"  wrote:
>
> Sorry, I write it wrongly,this problem is multi-value dimension,
>
> Example: I have a fact table named fact_order,a dimension table named
> dim_sales
>
> In the fact_order table ,An order data contains multiple salespeople.
>
> When I use fact_order join dim_sales it report that error: Dup key found.
>
> How can I solve it ?
>
>
>
> *发件人:* Alberto Ramón [mailto:a.ramonporto...@gmail.com]
> *发送时间:* 2017年5月10日 15:29
> *收件人:* user 
> *主题:* Re: kylin nonsupport Multi-value dimensions?
>
>
>
> Hi,
>
> Not all hive types are supported
>
> Check this lines:
> https://github.com/apache/kylin/blob/5d4982e247a2172d97d44c8
> 5309cef4b3dbfce09/core-metadata/src/main/java/org/
> apache/kylin/dimension/DimensionEncodingFactory.java#L76
>
>
>
> On 10 May 2017 at 08:10, jianhui.yi  wrote:
>
> I encountered a multi-dimensional dimension of the problem, and I used
> bridge table to try to solve it, but when building a cube,it report an error
>
> java.lang.IllegalStateException: The table: DIM_XXX Dup key found,
> key=[1446], value1=[1446,29,1,1], value2=[1446,28,0,0]
>
>  at org.apache.kylin.dict.lookup.LookupTable.initRow(LookupTable
> .java:86)
>
>  at org.apache.kylin.dict.lookup.LookupTable.init(LookupTable.ja
> va:69)
>
>  at org.apache.kylin.dict.lookup.LookupStringTable.init(LookupSt
> ringTable.java:79)
>
>  at org.apache.kylin.dict.lookup.LookupTable.(LookupTable.
> java:57)
>
>  at org.apache.kylin.dict.lookup.LookupStringTable.(Lookup
> StringTable.java:65)
>
>  at org.apache.kylin.cube.CubeManager.getLookupTable(CubeManager
> .java:644)
>
>  at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegm
> ent(DictionaryGeneratorCLI.java:98)
>
>  at org.apache.kylin.cube.cli.DictionaryGeneratorCLI.processSegm
> ent(DictionaryGeneratorCLI.java:54)
>
>  at org.apache.kylin.engine.mr.steps.CreateDictionaryJob.run(Cre
> ateDictionaryJob.java:66)
>
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>
>  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
>
>  at org.apache.kylin.engine.mr.common.HadoopShellExecutable.doWo
> rk(HadoopShellExecutable.java:63)
>
>  at org.apache.kylin.job.execution.AbstractExecutable.execute(
> AbstractExecutable.java:124)
>
>  at org.apache.kylin.job.execution.DefaultChainedExecutable.doWo
> rk(DefaultChainedExecutable.java:64)
>
>  at org.apache.kylin.job.execution.AbstractExecutable.execute(
> AbstractExecutable.java:124)
>
>  at org.apache.kylin.job.impl.threadpool.DefaultScheduler$JobRun
> ner.run(DefaultScheduler.java:142)
>
>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPool
> Executor.java:1145)
>
>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoo
> lExecutor.java:615)
>
>  at java.lang.Thread.run(Thread.java:745)
>
> result code:2
>
>
>
>
>
>
>
>
>
>
>


Re: multiple column distinct count

2017-05-13 Thread Li Yang
You are right. The GUI still cannot input multiple columns for the
count-distinct measure. A JIRA is created.
https://issues.apache.org/jira/browse/KYLIN-2616



On Thu, May 4, 2017 at 6:04 PM, 市场中心-ZHANGDA32698  wrote:

> Hi,
>
>
>
> As stated in the release note and jira https://issues.apache.org/
> jira/browse/KYLIN-490 , multiple column distinct count is supported in
> v2.0. So in order to do ‘select count distinct (A,B) from table ’, I assume
> I need to specify a count-distinct measure that includes both A and B in
> the cube design stage, but I notice that in the ‘edit measure’ UI, the
> ‘Param Value’ is a drop-down list where I can’t input more than 1 column.
> I’m curious how does kylin do multi column count distinct query without
> defining a multi-column count-distinct measure?
>


Re: Questions about SUM behavior when rewritten as TOPN

2017-05-13 Thread Billy Liu
Thanks Tingmao for the report.

Could you show us the complete SQL? In your SQL, there is no order by
statement. If no ORDER BY, the query should not be rewritten into TopN
measure.

2017-05-12 23:52 GMT+08:00 Tingmao Lin :

> Hi,
>
> We found that SUM() query on a cardinality 1 dimension is not accurate
> (or "not correct") when automatically  rewritten as TOPN.
> Is that the expected behavior of kylin or there are any other issue?
>
> We built a cube on a table ( measure1: bigint, dim1_id:varchar,
> dim2_id:varchar, ... ) using kylin 1.6.0 (Kafka streaming source)
>
> The cube has two measures: SUM(measure1) and 
> TOPN(10,sum-orderby(measure1),group
> by dim2_id) . (other measures omitted)
> and two dimensions  dim1_id, dim2_id   (other dims omitted)
>
> About the source table data:
> The cardinality of dim1_id  is 1 (same dim1_id for all rows in the source
> table)
> The cardinality of dim2_id  is 1 (same dim2_id for all rows in the source
> table)
> The possible value of measure1 is [1,0,-1]
>
> When we query
> "select SUM(measure1) FROM table GROUP BY dim2_id"
>  => the result has one row:"sum=7",
>   from the kylin logs we found that the query has been automatically  
> rewritten
> as TOPN(measure1,sum-orderby(measure1),group by dim2_id)
>
> When we write another query to prevent TOPN rewrite, for example:
>
>"select SUM(measure1),count(*) FROM table GROUP BY dim2_id" =>   one
> row -- "sum=-2,count=24576"
>
>"select SUM(measure1),count(*) FROM table"
>=>   one row -- "sum=-2,count=24576"
>
>
> The result is different (7 and -2) when rewritting to TOPN or not.
>
>
> My question is: are the following behavior "works as expected" ,or TOPN
> algorithm does not support negative counter values very well , or any issue
> there?
>
>
> 1. SUM() query  automatically rewritten as TOPN and gives approximated
> result when no TOPN present in the query.
>
> 2. When cardinality is 1, TOPN does not give accurate result.
>
>
>
>
> Thanks.
>
>
>
>