Re: AsterixDB Performance Tuning

Rana Alotaibi Fri, 26 Jan 2018 13:59:25 -0800

Hi all,
Thanks for your immediate and quick reply.

@Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
AsterixDB setting, it seems that navigating deeply in the nested fields and
create an index is not supported.
@Taewoo Adding  /*+indexnl */ didn't change the plan.
@Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins on
average without including the warm-up time. The plan has changed and it is
attached.


Can you please points me where I can find this in the documentation? And
here is my question: Why do I need to add some physical details in the
query like adding  /*+indexnl */  index-nested-loop-join or /*+bcast/
broadcast_exchange?
@Michael: The data size is ~7GB for PATEINTS dataset and 1MB for LABITEMS
dataset (Both datasets have an open schema). After data ingestion, in my
setting the data in MongoDB is ~3GB(It seems MongoDB does some compression)
and in AsterixDB is ~10GB. (I have 4 partitions,and I checked the size of
files in each partition and the total is ~10GB!)

Thanks!
--Rana




On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <[email protected]>
wrote:

> One thing I noticed is that the "large" unnested arrays are hash
> partitioned to the probably "small" index-filtered dataset.
> Since the data can fit in memory (7 GB in total), I think
> broadcast_exchange may do better in this particular case.
>
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>              E.FLAG = 'abnormal' AND
>              I.FLUID='Blood' AND
>              I.LABEL='Haptoglobin'
>
> Note: I reordered the FROM clause...
>
> Another thing is that I think it's a CPU bound query ... and I'm not sure
> how MongoDB utilizes CPU resources compared with AsterixDB.
>
>
>
> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <[email protected]> wrote:
>
>> PS: UNNEST doc
>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
>>
>> Best,
>> Taewoo
>>
>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <[email protected]> wrote:
>>
>>> Hi Rana,
>>>
>>> Thank you for attaching your plan. It seems that the selections are
>>> correctly made before each join. If your query predicate is selective
>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>> the result), I suggest you could try an index-nested-loop-join. Changes are
>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>
>>> USE mimiciii;
>>> SET `compiler.parallelism` "5";
>>> SET `compiler.sortmemory` "128MB";
>>> SET `compiler.joinmemory` "265MB";
>>> SELECT P.SUBJECT_ID
>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>>              E.FLAG = 'abnormal' AND
>>>              I.FLUID='Blood' AND
>>>              I.LABEL='Haptoglobin'
>>>
>>> Best,
>>> Taewoo
>>>
>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]> wrote:
>>>
>>>> Hi Rana,
>>>>
>>>> I think the performance issue might related to the access of nested
>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>
>>>> @Wail, do you have any clue on nested fields access?
>>>>
>>>> Best regards,
>>>> Chen Luo
>>>>
>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Taewoo,
>>>>>
>>>>>    -
>>>>>
>>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>
>>>>>    -
>>>>>
>>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an 
>>>>> index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this 
>>>>> message when I tried to create the index : "msg": "ASX0001: Field type 
>>>>> array can't be promoted to type object"
>>>>>
>>>>>    - Can you switch the predicate order? -- It seems for me that the
>>>>>    plan remains the same even if I changed the order of the predicates.
>>>>>    (Attached the plan after changing the order of the predicates 
>>>>> Plan_02.txt)
>>>>>
>>>>> Thanks
>>>>>
>>>>> Rana
>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi Chen,
>>>>>>
>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>>>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>> (\"format\"=\"json\"))).
>>>>>>
>>>>>>
>>>>>> *Which version of AsterixDB are you using?*
>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Rana,
>>>>>>>
>>>>>>> Nice to see you again! You may post to [email protected] as
>>>>>>> well to get more feedbacks from our developers.
>>>>>>>
>>>>>>> Just clarify two things: how did you import data into the dataset?
>>>>>>> using "load" or "feed"? And which version of AsterixDB are you using? 
>>>>>>> But
>>>>>>> any way in your case it seems the join takes a lot of time, and your 
>>>>>>> data
>>>>>>> is pretty much cached into the memory...
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Chen Luo
>>>>>>>
>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>>> didn't miss any performance tuning parameters ( I have run the same 
>>>>>>>> query
>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>
>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>> haptoglobin blood test result. (The query result can have duplicate 
>>>>>>>> values).
>>>>>>>>
>>>>>>>> *Query:*
>>>>>>>> USE mimiciii;
>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>>              I.FLUID='Blood' AND
>>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>>
>>>>>>>> *Datasets Schema:*
>>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>
>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>> - 4 partitions (iodevices)
>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to
>>>>>>>> the buffercache (storage.buffercache.size).
>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>> - The data size is 7GB
>>>>>>>>
>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>
>>>>>>>> --Rana
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
>
> *Regards,*
> Wail Alkowaileet
>

distribute result [$$49]
-- DISTRIBUTE_RESULT  |PARTITIONED|
  exchange
  -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
    project ([$$49])
    -- STREAM_PROJECT  |PARTITIONED|
      assign [$$49] <- [{"SUBJECT_ID": $$50}]
      -- ASSIGN  |PARTITIONED|
        project ([$$50])
        -- STREAM_PROJECT  |PARTITIONED|
          exchange
          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
            join (eq($$52, $$51))
            -- HYBRID_HASH_JOIN [$$52][$$51]  |PARTITIONED|
              exchange
              -- RANDOM_PARTITION_EXCHANGE  |PARTITIONED|
                project ([$$50, $$52])
                -- STREAM_PROJECT  |PARTITIONED|
                  select (eq($$E.getField("FLAG"), "abnormal"))
                  -- STREAM_SELECT  |PARTITIONED|
                    assign [$$52] <- [$$E.getField("ITEMID")]
                    -- ASSIGN  |PARTITIONED|
                      project ([$$50, $$E])
                      -- STREAM_PROJECT  |PARTITIONED|
                        unnest $$E <- scan-collection($$55)
                        -- UNNEST  |PARTITIONED|
                          project ([$$50, $$55])
                          -- STREAM_PROJECT  |PARTITIONED|
                            assign [$$55] <- [$$A.getField("LABEVENTS")]
                            -- ASSIGN  |PARTITIONED|
                              project ([$$50, $$A])
                              -- STREAM_PROJECT  |PARTITIONED|
                                unnest $$A <- scan-collection($$54)
                                -- UNNEST  |PARTITIONED|
                                  project ([$$50, $$54])
                                  -- STREAM_PROJECT  |PARTITIONED|
                                    assign [$$54] <- 
[$$P.getField("ADMISSIONS")]
                                    -- ASSIGN  |PARTITIONED|
                                      exchange
                                      -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                        data-scan []<-[$$50, $$P] <- 
mimiciii.mimic
                                        -- DATASOURCE_SCAN  |PARTITIONED|
                                          exchange
                                          -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                            empty-tuple-source
                                            -- EMPTY_TUPLE_SOURCE  |PARTITIONED|
              exchange
              -- BROADCAST_EXCHANGE  |PARTITIONED|
                project ([$$51])
                -- STREAM_PROJECT  |PARTITIONED|
                  select (and(eq($$I.getField("FLUID"), "Blood"), 
eq($$I.getField("LABEL"), "Haptoglobin")))
                  -- STREAM_SELECT  |PARTITIONED|
                    exchange
                    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                      unnest-map [$$51, $$I] <- index-search("LABITEMS", 0, 
"mimiciii", "LABITEMS", FALSE, FALSE, 1, $$63, 1, $$63, TRUE, TRUE, TRUE)
                      -- BTREE_SEARCH  |PARTITIONED|
                        exchange
                        -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                          intersect ([$$63] <- [[$$63], [$$67]])
                          -- INTERSECT  |PARTITIONED|
                            exchange
                            -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                              order (ASC, $$63) 
                              -- STABLE_SORT [$$63(ASC)]  |PARTITIONED|
                                exchange
                                -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                  project ([$$63])
                                  -- STREAM_PROJECT  |PARTITIONED|
                                    exchange
                                    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                      unnest-map [$$62, $$63] <- 
index-search("LABLEIndx", 0, "mimiciii", "LABITEMS", FALSE, FALSE, 1, $$60, 1, 
$$61, TRUE, TRUE, TRUE)
                                      -- BTREE_SEARCH  |PARTITIONED|
                                        exchange
                                        -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                          assign [$$60, $$61] <- 
["Haptoglobin", "Haptoglobin"]
                                          -- ASSIGN  |PARTITIONED|
                                            empty-tuple-source
                                            -- EMPTY_TUPLE_SOURCE  |PARTITIONED|
                            exchange
                            -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                              order (ASC, $$67) 
                              -- STABLE_SORT [$$67(ASC)]  |PARTITIONED|
                                exchange
                                -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                  project ([$$67])
                                  -- STREAM_PROJECT  |PARTITIONED|
                                    exchange
                                    -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                      unnest-map [$$66, $$67] <- 
index-search("FLUIDIndx", 0, "mimiciii", "LABITEMS", FALSE, FALSE, 1, $$64, 1, 
$$65, TRUE, TRUE, TRUE)
                                      -- BTREE_SEARCH  |PARTITIONED|
                                        exchange
                                        -- ONE_TO_ONE_EXCHANGE  |PARTITIONED|
                                          assign [$$64, $$65] <- ["Blood", 
"Blood"]
                                          -- ASSIGN  |PARTITIONED|
                                            empty-tuple-source
                                            -- EMPTY_TUPLE_SOURCE  |PARTITIONED|

Re: AsterixDB Performance Tuning

Reply via email to