One thing I noticed is that the "large" unnested arrays are hash
partitioned to the probably "small" index-filtered dataset.
Since the data can fit in memory (7 GB in total), I think
broadcast_exchange may do better in this particular case.

USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT P.SUBJECT_ID
FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
             E.FLAG = 'abnormal' AND
             I.FLUID='Blood' AND
             I.LABEL='Haptoglobin'

Note: I reordered the FROM clause...

Another thing is that I think it's a CPU bound query ... and I'm not sure
how MongoDB utilizes CPU resources compared with AsterixDB.



On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <[email protected]> wrote:

> PS: UNNEST doc
> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
>
> Best,
> Taewoo
>
> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <[email protected]> wrote:
>
>> Hi Rana,
>>
>> Thank you for attaching your plan. It seems that the selections are
>> correctly made before each join. If your query predicate is selective
>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>> the result), I suggest you could try an index-nested-loop-join. Changes are
>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>
>> USE mimiciii;
>> SET `compiler.parallelism` "5";
>> SET `compiler.sortmemory` "128MB";
>> SET `compiler.joinmemory` "265MB";
>> SELECT P.SUBJECT_ID
>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>              E.FLAG = 'abnormal' AND
>>              I.FLUID='Blood' AND
>>              I.LABEL='Haptoglobin'
>>
>> Best,
>> Taewoo
>>
>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]> wrote:
>>
>>> Hi Rana,
>>>
>>> I think the performance issue might related to the access of nested
>>> fields, since the rest performance hot spots (index search, hash join etc
>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB supports
>>> array index, did you build an index on L.FLAG using MongoDB?
>>>
>>> @Wail, do you have any clue on nested fields access?
>>>
>>> Best regards,
>>> Chen Luo
>>>
>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <[email protected]>
>>> wrote:
>>>
>>>> Hi Taewoo,
>>>>
>>>>    -
>>>>
>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>
>>>>    -
>>>>
>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an 
>>>> index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this 
>>>> message when I tried to create the index : "msg": "ASX0001: Field type 
>>>> array can't be promoted to type object"
>>>>
>>>>    - Can you switch the predicate order? -- It seems for me that the
>>>>    plan remains the same even if I changed the order of the predicates.
>>>>    (Attached the plan after changing the order of the predicates 
>>>> Plan_02.txt)
>>>>
>>>> Thanks
>>>>
>>>> Rana
>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Chen,
>>>>>
>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>> (\"format\"=\"json\"))).
>>>>>
>>>>>
>>>>> *Which version of AsterixDB are you using?*
>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>
>>>>> Thanks!
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]> wrote:
>>>>>
>>>>>> Hi Rana,
>>>>>>
>>>>>> Nice to see you again! You may post to [email protected] as
>>>>>> well to get more feedbacks from our developers.
>>>>>>
>>>>>> Just clarify two things: how did you import data into the dataset?
>>>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>>>> any way in your case it seems the join takes a lot of time, and your data
>>>>>> is pretty much cached into the memory...
>>>>>>
>>>>>> Best regards,
>>>>>> Chen Luo
>>>>>>
>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi there,
>>>>>>>
>>>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>> didn't miss any performance tuning parameters ( I have run the same 
>>>>>>> query
>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>
>>>>>>> The query asks to find all patients that have 'abnormal' haptoglobin
>>>>>>> blood test result. (The query result can have duplicate values).
>>>>>>>
>>>>>>> *Query:*
>>>>>>> USE mimiciii;
>>>>>>> SET `compiler.parallelism` "5";
>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>> SELECT P.SUBJECT_ID
>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>              I.FLUID='Blood' AND
>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>
>>>>>>> *Datasets Schema:*
>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>> - The DDL for both datasets is attached
>>>>>>>
>>>>>>> *Performance Tuning Parameters:*
>>>>>>> - 4 partitions (iodevices)
>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to
>>>>>>> the buffercache (storage.buffercache.size).
>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>> - The data size is 7GB
>>>>>>>
>>>>>>> Your feedback is highly appreciated!
>>>>>>>
>>>>>>> --Rana
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


-- 

*Regards,*
Wail Alkowaileet

Reply via email to