Hi all,
Thanks for your immediate and quick reply.
@Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
AsterixDB setting, it seems that navigating deeply in the nested fields and
create an index is not supported.
@Taewoo Adding /*+indexnl */ didn't change the plan.
@Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins on
average without including the warm-up time. The plan has changed and it is
attached.
Can you please points me where I can find this in the documentation? And
here is my question: Why do I need to add some physical details in the
query like adding /*+indexnl */ index-nested-loop-join or /*+bcast/
broadcast_exchange?
@Michael: The data size is ~7GB for PATEINTS dataset and 1MB for LABITEMS
dataset (Both datasets have an open schema). After data ingestion, in my
setting the data in MongoDB is ~3GB(It seems MongoDB does some compression)
and in AsterixDB is ~10GB. (I have 4 partitions,and I checked the size of
files in each partition and the total is ~10GB!)
Thanks!
--Rana
On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <[email protected]>
wrote:
> One thing I noticed is that the "large" unnested arrays are hash
> partitioned to the probably "small" index-filtered dataset.
> Since the data can fit in memory (7 GB in total), I think
> broadcast_exchange may do better in this particular case.
>
> USE mimiciii;
> SET `compiler.parallelism` "5";
> SET `compiler.sortmemory` "128MB";
> SET `compiler.joinmemory` "265MB";
> SELECT P.SUBJECT_ID
> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
> WHERE E.ITEMID/*+bcast*/ = I.ITEMID AND
> E.FLAG = 'abnormal' AND
> I.FLUID='Blood' AND
> I.LABEL='Haptoglobin'
>
> Note: I reordered the FROM clause...
>
> Another thing is that I think it's a CPU bound query ... and I'm not sure
> how MongoDB utilizes CPU resources compared with AsterixDB.
>
>
>
> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <[email protected]> wrote:
>
>> PS: UNNEST doc
>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
>>
>> Best,
>> Taewoo
>>
>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <[email protected]> wrote:
>>
>>> Hi Rana,
>>>
>>> Thank you for attaching your plan. It seems that the selections are
>>> correctly made before each join. If your query predicate is selective
>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>> the result), I suggest you could try an index-nested-loop-join. Changes are
>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>
>>> USE mimiciii;
>>> SET `compiler.parallelism` "5";
>>> SET `compiler.sortmemory` "128MB";
>>> SET `compiler.joinmemory` "265MB";
>>> SELECT P.SUBJECT_ID
>>> FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>> E.FLAG = 'abnormal' AND
>>> I.FLUID='Blood' AND
>>> I.LABEL='Haptoglobin'
>>>
>>> Best,
>>> Taewoo
>>>
>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]> wrote:
>>>
>>>> Hi Rana,
>>>>
>>>> I think the performance issue might related to the access of nested
>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>
>>>> @Wail, do you have any clue on nested fields access?
>>>>
>>>> Best regards,
>>>> Chen Luo
>>>>
>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Taewoo,
>>>>>
>>>>> -
>>>>>
>>>>> Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>
>>>>> -
>>>>>
>>>>> Can you create an index on LABEVENTS.FLAG? -- I couldn't create an
>>>>> index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this
>>>>> message when I tried to create the index : "msg": "ASX0001: Field type
>>>>> array can't be promoted to type object"
>>>>>
>>>>> - Can you switch the predicate order? -- It seems for me that the
>>>>> plan remains the same even if I changed the order of the predicates.
>>>>> (Attached the plan after changing the order of the predicates
>>>>> Plan_02.txt)
>>>>>
>>>>> Thanks
>>>>>
>>>>> Rana
>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <[email protected]
>>>>> > wrote:
>>>>>
>>>>>> Hi Chen,
>>>>>>
>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING localfs
>>>>>> ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>> (\"format\"=\"json\"))).
>>>>>>
>>>>>>
>>>>>> *Which version of AsterixDB are you using?*
>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>
>>>>>> Thanks!
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]> wrote:
>>>>>>
>>>>>>> Hi Rana,
>>>>>>>
>>>>>>> Nice to see you again! You may post to [email protected] as
>>>>>>> well to get more feedbacks from our developers.
>>>>>>>
>>>>>>> Just clarify two things: how did you import data into the dataset?
>>>>>>> using "load" or "feed"? And which version of AsterixDB are you using?
>>>>>>> But
>>>>>>> any way in your case it seems the join takes a lot of time, and your
>>>>>>> data
>>>>>>> is pretty much cached into the memory...
>>>>>>>
>>>>>>> Best regards,
>>>>>>> Chen Luo
>>>>>>>
>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi there,
>>>>>>>>
>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded the
>>>>>>>> warm-up time which was 30mins)!, and I would like to make sure that I
>>>>>>>> didn't miss any performance tuning parameters ( I have run the same
>>>>>>>> query
>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>
>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>> haptoglobin blood test result. (The query result can have duplicate
>>>>>>>> values).
>>>>>>>>
>>>>>>>> *Query:*
>>>>>>>> USE mimiciii;
>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>> FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>> E.FLAG = 'abnormal' AND
>>>>>>>> I.FLUID='Blood' AND
>>>>>>>> I.LABEL='Haptoglobin'
>>>>>>>>
>>>>>>>> *Datasets Schema:*
>>>>>>>> - PATIENTS and LABITEMS datasets have an open schema.
>>>>>>>> - LABITEMS's primary key is ITEMID
>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>
>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>> - 4 partitions (iodevices)
>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to
>>>>>>>> the buffercache (storage.buffercache.size).
>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>> - The data size is 7GB
>>>>>>>>
>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>
>>>>>>>> --Rana
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
>
> *Regards,*
> Wail Alkowaileet
>
distribute result [$$49]
-- DISTRIBUTE_RESULT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
project ([$$49])
-- STREAM_PROJECT |PARTITIONED|
assign [$$49] <- [{"SUBJECT_ID": $$50}]
-- ASSIGN |PARTITIONED|
project ([$$50])
-- STREAM_PROJECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
join (eq($$52, $$51))
-- HYBRID_HASH_JOIN [$$52][$$51] |PARTITIONED|
exchange
-- RANDOM_PARTITION_EXCHANGE |PARTITIONED|
project ([$$50, $$52])
-- STREAM_PROJECT |PARTITIONED|
select (eq($$E.getField("FLAG"), "abnormal"))
-- STREAM_SELECT |PARTITIONED|
assign [$$52] <- [$$E.getField("ITEMID")]
-- ASSIGN |PARTITIONED|
project ([$$50, $$E])
-- STREAM_PROJECT |PARTITIONED|
unnest $$E <- scan-collection($$55)
-- UNNEST |PARTITIONED|
project ([$$50, $$55])
-- STREAM_PROJECT |PARTITIONED|
assign [$$55] <- [$$A.getField("LABEVENTS")]
-- ASSIGN |PARTITIONED|
project ([$$50, $$A])
-- STREAM_PROJECT |PARTITIONED|
unnest $$A <- scan-collection($$54)
-- UNNEST |PARTITIONED|
project ([$$50, $$54])
-- STREAM_PROJECT |PARTITIONED|
assign [$$54] <-
[$$P.getField("ADMISSIONS")]
-- ASSIGN |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
data-scan []<-[$$50, $$P] <-
mimiciii.mimic
-- DATASOURCE_SCAN |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
empty-tuple-source
-- EMPTY_TUPLE_SOURCE |PARTITIONED|
exchange
-- BROADCAST_EXCHANGE |PARTITIONED|
project ([$$51])
-- STREAM_PROJECT |PARTITIONED|
select (and(eq($$I.getField("FLUID"), "Blood"),
eq($$I.getField("LABEL"), "Haptoglobin")))
-- STREAM_SELECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
unnest-map [$$51, $$I] <- index-search("LABITEMS", 0,
"mimiciii", "LABITEMS", FALSE, FALSE, 1, $$63, 1, $$63, TRUE, TRUE, TRUE)
-- BTREE_SEARCH |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
intersect ([$$63] <- [[$$63], [$$67]])
-- INTERSECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
order (ASC, $$63)
-- STABLE_SORT [$$63(ASC)] |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
project ([$$63])
-- STREAM_PROJECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
unnest-map [$$62, $$63] <-
index-search("LABLEIndx", 0, "mimiciii", "LABITEMS", FALSE, FALSE, 1, $$60, 1,
$$61, TRUE, TRUE, TRUE)
-- BTREE_SEARCH |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
assign [$$60, $$61] <-
["Haptoglobin", "Haptoglobin"]
-- ASSIGN |PARTITIONED|
empty-tuple-source
-- EMPTY_TUPLE_SOURCE |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
order (ASC, $$67)
-- STABLE_SORT [$$67(ASC)] |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
project ([$$67])
-- STREAM_PROJECT |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
unnest-map [$$66, $$67] <-
index-search("FLUIDIndx", 0, "mimiciii", "LABITEMS", FALSE, FALSE, 1, $$64, 1,
$$65, TRUE, TRUE, TRUE)
-- BTREE_SEARCH |PARTITIONED|
exchange
-- ONE_TO_ONE_EXCHANGE |PARTITIONED|
assign [$$64, $$65] <- ["Blood",
"Blood"]
-- ASSIGN |PARTITIONED|
empty-tuple-source
-- EMPTY_TUPLE_SOURCE |PARTITIONED|