PS: UNNEST doc
https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses

Best,
Taewoo

On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <[email protected]> wrote:

> Hi Rana,
>
> Thank you for attaching your plan. It seems that the selections are
> correctly made before each join. If your query predicate is selective
> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
> the result), I suggest you could try an index-nested-loop-join. Changes are
> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>              E.FLAG = 'abnormal' AND
>              I.FLUID='Blood' AND
>              I.LABEL='Haptoglobin'
>
> Best,
> Taewoo
>
> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]> wrote:
>
>> Hi Rana,
>>
>> I think the performance issue might related to the access of nested
>> fields, since the rest performance hot spots (index search, hash join etc
>> looks normal to me), and I assume " I.FLUID='Blood' AND
>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB supports
>> array index, did you build an index on L.FLAG using MongoDB?
>>
>> @Wail, do you have any clue on nested fields access?
>>
>> Best regards,
>> Chen Luo
>>
>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <[email protected]>
>> wrote:
>>
>>> Hi Taewoo,
>>>
>>>    -
>>>
>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>
>>>    -
>>>
>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index 
>>> on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when 
>>> I tried to create the index : "msg": "ASX0001: Field type array can't be 
>>> promoted to type object"
>>>
>>>    - Can you switch the predicate order? -- It seems for me that the
>>>    plan remains the same even if I changed the order of the predicates.
>>>    (Attached the plan after changing the order of the predicates 
>>> Plan_02.txt)
>>>
>>> Thanks
>>>
>>> Rana
>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <[email protected]>
>>> wrote:
>>>
>>>> Hi Chen,
>>>>
>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>> (\"format\"=\"json\"))).
>>>>
>>>>
>>>> *Which version of AsterixDB are you using?*
>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>
>>>> Thanks!
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]> wrote:
>>>>
>>>>> Hi Rana,
>>>>>
>>>>> Nice to see you again! You may post to [email protected] as
>>>>> well to get more feedbacks from our developers.
>>>>>
>>>>> Just clarify two things: how did you import data into the dataset?
>>>>> using "load" or "feed"? And which version of AsterixDB are you using? But
>>>>> any way in your case it seems the join takes a lot of time, and your data
>>>>> is pretty much cached into the memory...
>>>>>
>>>>> Best regards,
>>>>> Chen Luo
>>>>>
>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <[email protected]>
>>>>> wrote:
>>>>>
>>>>>> Hi there,
>>>>>>
>>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>> didn't miss any performance tuning parameters ( I have run the same query
>>>>>> on MongoDB, and it took ~2mins).
>>>>>>
>>>>>> The query asks to find all patients that have 'abnormal' haptoglobin
>>>>>> blood test result. (The query result can have duplicate values).
>>>>>>
>>>>>> *Query:*
>>>>>> USE mimiciii;
>>>>>> SET `compiler.parallelism` "5";
>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>> SELECT P.SUBJECT_ID
>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>              I.FLUID='Blood' AND
>>>>>>              I.LABEL='Haptoglobin'
>>>>>>
>>>>>> *Datasets Schema:*
>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>> - The JSON schema for both datasets is attached.
>>>>>> - The DDL for both datasets is attached
>>>>>>
>>>>>> *Performance Tuning Parameters:*
>>>>>> - 4 partitions (iodevices)
>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the
>>>>>> buffercache (storage.buffercache.size).
>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>> - The data size is 7GB
>>>>>>
>>>>>> Your feedback is highly appreciated!
>>>>>>
>>>>>> --Rana
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to