Hi Rana, I think the performance issue might related to the access of nested fields, since the rest performance hot spots (index search, hash join etc looks normal to me), and I assume " I.FLUID='Blood' AND I.LABEL='Haptoglobin'" should be very selective. Since MongoDB supports array index, did you build an index on L.FLAG using MongoDB?
@Wail, do you have any clue on nested fields access? Best regards, Chen Luo On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <[email protected]> wrote: > Hi Taewoo, > > - > > Can you paste the optimized plan? -- Attached the plan (Plan_01.txt) > > - > > Can you create an index on LABEVENTS.FLAG? -- I couldn't create an index > on LABEVENTS.FLAG since LABEVENTS is of type array. I got this message when I > tried to create the index : "msg": "ASX0001: Field type array can't be > promoted to type object" > > - Can you switch the predicate order? -- It seems for me that the plan > remains the same even if I changed the order of the predicates. (Attached > the plan after changing the order of the predicates Plan_02.txt) > > Thanks > > Rana > On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <[email protected]> > wrote: > >> Hi Chen, >> >> *How did you import data into the dataset? using "load" or "feed"?* >> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs >> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"), >> (\"format\"=\"json\"))). >> >> >> *Which version of AsterixDB are you using?* >> AsterixDB Master (0.9.3-SNAPSHOT) >> >> Thanks! >> >> >> >> >> >> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]> wrote: >> >>> Hi Rana, >>> >>> Nice to see you again! You may post to [email protected] as well >>> to get more feedbacks from our developers. >>> >>> Just clarify two things: how did you import data into the dataset? using >>> "load" or "feed"? And which version of AsterixDB are you using? But any way >>> in your case it seems the join takes a lot of time, and your data is pretty >>> much cached into the memory... >>> >>> Best regards, >>> Chen Luo >>> >>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <[email protected]> >>> wrote: >>> >>>> Hi there, >>>> >>>> I have a query that takes ~12.7mins on average (I have excluded the >>>> warm-up time which was 30mins)!, and I would like to make sure that I >>>> didn't miss any performance tuning parameters ( I have run the same query >>>> on MongoDB, and it took ~2mins). >>>> >>>> The query asks to find all patients that have 'abnormal' haptoglobin >>>> blood test result. (The query result can have duplicate values). >>>> >>>> *Query:* >>>> USE mimiciii; >>>> SET `compiler.parallelism` "5"; >>>> SET `compiler.sortmemory` "128MB"; >>>> SET `compiler.joinmemory` "265MB"; >>>> SELECT P.SUBJECT_ID >>>> FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E >>>> WHERE I.ITEMID=E.ITEMID AND >>>> E.FLAG = 'abnormal' AND >>>> I.FLUID='Blood' AND >>>> I.LABEL='Haptoglobin' >>>> >>>> *Datasets Schema:* >>>> - PATIENTS and LABITEMS datasets have an open schema. >>>> - LABITEMS's primary key is ITEMID >>>> - PATIENTS 's primary key is SUBJECT_ID >>>> - The JSON schema for both datasets is attached. >>>> - The DDL for both datasets is attached >>>> >>>> *Performance Tuning Parameters:* >>>> - 4 partitions (iodevices) >>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to the >>>> buffercache (storage.buffercache.size). >>>> - As you can see from the query, I set the parallelism to 5, >>>> sort-memory to 128MB, join-memory to 265MB. >>>> - The data size is 7GB >>>> >>>> Your feedback is highly appreciated! >>>> >>>> --Rana >>>> >>> >>> >> >
