PS: UNNEST doc https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
Best, Taewoo On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <[email protected]> wrote: > Hi Rana, > > Thank you for attaching your plan. It seems that the selections are > correctly made before each join. If your query predicate is selective > enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as > the result), I suggest you could try an index-nested-loop-join. Changes are > highlighted. And one more question: if LABEVENTS.FLAG is an array, you > can't just use "E.FLAG="abnormal". I think you need to use UNNEST. > > USE mimiciii; > SET `compiler.parallelism` "5"; > SET `compiler.sortmemory` "128MB"; > SET `compiler.joinmemory` "265MB"; > SELECT P.SUBJECT_ID > FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E > WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND > E.FLAG = 'abnormal' AND > I.FLUID='Blood' AND > I.LABEL='Haptoglobin' > > Best, > Taewoo > > On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]> wrote: > >> Hi Rana, >> >> I think the performance issue might related to the access of nested >> fields, since the rest performance hot spots (index search, hash join etc >> looks normal to me), and I assume " I.FLUID='Blood' AND >> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB supports >> array index, did you build an index on L.FLAG using MongoDB? >> >> @Wail, do you have any clue on nested fields access? >> >> Best regards, >> Chen Luo >> >> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <[email protected]> >> wrote: >> >>> Hi Taewoo, >>> >>> - >>> >>> Can you paste the optimized plan? -- Attached the plan (Plan_01.txt) >>> >>> - >>> >>> Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index >>> on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when >>> I tried to create the index : "msg": "ASX0001: Field type array can't be >>> promoted to type object" >>> >>> - Can you switch the predicate order? -- It seems for me that the >>> plan remains the same even if I changed the order of the predicates. >>> (Attached the plan after changing the order of the predicates >>> Plan_02.txt) >>> >>> Thanks >>> >>> Rana >>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <[email protected]> >>> wrote: >>> >>>> Hi Chen, >>>> >>>> *How did you import data into the dataset? using "load" or "feed"?* >>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs >>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"), >>>> (\"format\"=\"json\"))). >>>> >>>> >>>> *Which version of AsterixDB are you using?* >>>> AsterixDB Master (0.9.3-SNAPSHOT) >>>> >>>> Thanks! >>>> >>>> >>>> >>>> >>>> >>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]> wrote: >>>> >>>>> Hi Rana, >>>>> >>>>> Nice to see you again! You may post to [email protected] as >>>>> well to get more feedbacks from our developers. >>>>> >>>>> Just clarify two things: how did you import data into the dataset? >>>>> using "load" or "feed"? And which version of AsterixDB are you using? But >>>>> any way in your case it seems the join takes a lot of time, and your data >>>>> is pretty much cached into the memory... >>>>> >>>>> Best regards, >>>>> Chen Luo >>>>> >>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <[email protected]> >>>>> wrote: >>>>> >>>>>> Hi there, >>>>>> >>>>>> I have a query that takes ~12.7mins on average (I have excluded the >>>>>> warm-up time which was 30mins)!, and I would like to make sure that I >>>>>> didn't miss any performance tuning parameters ( I have run the same query >>>>>> on MongoDB, and it took ~2mins). >>>>>> >>>>>> The query asks to find all patients that have 'abnormal' haptoglobin >>>>>> blood test result. (The query result can have duplicate values). >>>>>> >>>>>> *Query:* >>>>>> USE mimiciii; >>>>>> SET `compiler.parallelism` "5"; >>>>>> SET `compiler.sortmemory` "128MB"; >>>>>> SET `compiler.joinmemory` "265MB"; >>>>>> SELECT P.SUBJECT_ID >>>>>> FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E >>>>>> WHERE I.ITEMID=E.ITEMID AND >>>>>> E.FLAG = 'abnormal' AND >>>>>> I.FLUID='Blood' AND >>>>>> I.LABEL='Haptoglobin' >>>>>> >>>>>> *Datasets Schema:* >>>>>> - PATIENTS and LABITEMS datasets have an open schema. >>>>>> - LABITEMS's primary key is ITEMID >>>>>> - PATIENTS 's primary key is SUBJECT_ID >>>>>> - The JSON schema for both datasets is attached. >>>>>> - The DDL for both datasets is attached >>>>>> >>>>>> *Performance Tuning Parameters:* >>>>>> - 4 partitions (iodevices) >>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the >>>>>> buffercache (storage.buffercache.size). >>>>>> - As you can see from the query, I set the parallelism to 5, >>>>>> sort-memory to 128MB, join-memory to 265MB. >>>>>> - The data size is 7GB >>>>>> >>>>>> Your feedback is highly appreciated! >>>>>> >>>>>> --Rana >>>>>> >>>>> >>>>> >>>> >>> >> >
