Hi Rana,
Thank you for attaching your plan. It seems that the selections are
correctly made before each join. If your query predicate is selective
enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
the result), I suggest you could try an index-nested-loop-join. Changes are
highlighted. And one more question: if LABEVENTS.FLAG is an array, you
can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT P.SUBJECT_ID
FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
E.FLAG = 'abnormal' AND
I.FLUID='Blood' AND
I.LABEL='Haptoglobin'
Best,
Taewoo
On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]> wrote:
> Hi Rana,
>
> I think the performance issue might related to the access of nested
> fields, since the rest performance hot spots (index search, hash join etc
> looks normal to me), and I assume " I.FLUID='Blood' AND
> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB supports
> array index, did you build an index on L.FLAG using MongoDB?
>
> @Wail, do you have any clue on nested fields access?
>
> Best regards,
> Chen Luo
>
> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <[email protected]>
> wrote:
>
>> Hi Taewoo,
>>
>> -
>>
>> Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>
>> -
>>
>> Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index
>> on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when
>> I tried to create the index : "msg": "ASX0001: Field type array can't be
>> promoted to type object"
>>
>> - Can you switch the predicate order? -- It seems for me that the
>> plan remains the same even if I changed the order of the predicates.
>> (Attached the plan after changing the order of the predicates Plan_02.txt)
>>
>> Thanks
>>
>> Rana
>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <[email protected]>
>> wrote:
>>
>>> Hi Chen,
>>>
>>> *How did you import data into the dataset? using "load" or "feed"?*
>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>> (\"format\"=\"json\"))).
>>>
>>>
>>> *Which version of AsterixDB are you using?*
>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>
>>> Thanks!
>>>
>>>
>>>
>>>
>>>
>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]> wrote:
>>>
>>>> Hi Rana,
>>>>
>>>> Nice to see you again! You may post to [email protected] as
>>>> well to get more feedbacks from our developers.
>>>>
>>>> Just clarify two things: how did you import data into the dataset?
>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>> any way in your case it seems the join takes a lot of time, and your data
>>>> is pretty much cached into the memory...
>>>>
>>>> Best regards,
>>>> Chen Luo
>>>>
>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>> on MongoDB, and it took ~2mins).
>>>>>
>>>>> The query asks to find all patients that have 'abnormal' haptoglobin
>>>>> blood test result. (The query result can have duplicate values).
>>>>>
>>>>> *Query:*
>>>>> USE mimiciii;
>>>>> SET `compiler.parallelism` "5";
>>>>> SET `compiler.sortmemory` "128MB";
>>>>> SET `compiler.joinmemory` "265MB";
>>>>> SELECT P.SUBJECT_ID
>>>>> FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>> E.FLAG = 'abnormal' AND
>>>>> I.FLUID='Blood' AND
>>>>> I.LABEL='Haptoglobin'
>>>>>
>>>>> *Datasets Schema:*
>>>>> - PATIENTS and LABITEMS datasets have an open schema.
>>>>> - LABITEMS's primary key is ITEMID
>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>> - The JSON schema for both datasets is attached.
>>>>> - The DDL for both datasets is attached
>>>>>
>>>>> *Performance Tuning Parameters:*
>>>>> - 4 partitions (iodevices)
>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
>>>>> buffercache (storage.buffercache.size).
>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>> - The data size is 7GB
>>>>>
>>>>> Your feedback is highly appreciated!
>>>>>
>>>>> --Rana
>>>>>
>>>>
>>>>
>>>
>>
>