Hi all,

I have a follow-up issue using the same query. Let's forget about the join
and selection predicates. I have the following query (same as previous
one), but without the join and selection predicates:

*AsterixDB Query: *
USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT COUNT(*) AS cnt
FROM   PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E

Result : {cnt:22237108}

Please note I have changed the page-size default configuration from 128 KB
to 1MB, and I have the buffercache stays that same ( 57GB)

*MongoDB Equivalent Query:*
db.patients.aggregate(
   [
       {
           "$unwind":"$ADMISSIONS",

       },
       {
           "$unwind":"$ADMISSIONS.LABEVENTS",

       },

       { $count: "cnt }

     ]
  )

Result : {cnt:22237108}

The query takes ~7min in AsterixDB, and 30sec in MongoDB given that MongoDB
is running on a single core. I don't think that MongoDB storage compression
techniques play some factor here (All the data is cached in memory) unless
MongoDB does some in-memory compression (Which I need to investigate more
about that).

Does this explain that navigating deeply into nested fields is an expensive
operation in AsterixDB? or I do still have some issues with AsterixDB
configuration parameters?.

Thanks
Rana


On Sat, Jan 27, 2018 at 3:45 PM, Rana Alotaibi <[email protected]>
wrote:

> Hi Mike,
>
> Here is some results:
> 1) Non-reordered FROM clause+ no bcast added ~12mins
> 2) Reordered FROM clause + no bcast added ~12mins (same as (1) )
> 3) Non-reordered FROM clause+ bcast added ~6mins
> 4) Reordered FROM clause+bacst added ~6mins
>
> It seems the FROM clause datasets order has no impact. But in both cases,
> the bcast reduced the execution time.
>
> As for querying MongoDB, I'm almost writing a "logical" plan for that
> query (It took me days to understand MongoDB query operators). I totally
> prefer SQL++ using hints. However, think about data scientists who are
> mostly familiar with SQL queries, I don't expect them to spend time and
> determine for example the predicates selectivity and accordingly decide
> what's the appropriate join algorithms to use and specify this in their
> query (i.e /*indexnl*/) (Basically they end-up doing the cost-based
> optimizer job :) ).
>
> Thanks,
> --Rana
>
> On Fri, Jan 26, 2018 at 2:15 PM, Mike Carey <[email protected]> wrote:
>
>> Rana,
>>
>> We need the physical hints because we have a conservative cost-minded
>> rule set rather than an actual cost-based optimizer - so it always picks
>> partitioned hash joins when doing joins.  (I am curious as to how much the
>> bcast hint helps vs. the reordered from clause - what fraction does each
>> contribute to the win? - it would be cool to have the numbers without and
>> with that hint if you felt like trying that - but don't feel obligated).
>>
>> Question: In MongoDB, didn't you end up essentially writing a
>> query-plan-like program to solve this query - and isn't the SQL++ with
>> hints a lot smaller/simpler?  (Just asking - I'm curious as to your
>> feedback on that.)  We'd argue that you can write a mostly declarative
>> familiar query and then mess with it and annotate it a little to tune it -
>> which isn't as good as a great cost-based optimizer, but is better than
>> writing/maintaining a program.  Thoughts?
>>
>> In terms of how good we can get - the size answer is telling.  In past
>> days, when we were normally either on par with (smaller things) or beating
>> (larger things) MongoDB, they hadn't yet acquired their new storage engine
>> company (WiredTiger) with its compression.  Now they're running 3x+
>> smaller, I would not be surprised if that's now the explanation for the
>> remaining difference, which is indeed about 3x.  (We are going to
>> experiment with compression as well.)
>>
>> Cheers,
>>
>> Mike
>>
>> On 1/26/18 1:58 PM, Rana Alotaibi wrote:
>>
>> Hi all,
>> Thanks for your immediate and quick reply.
>>
>> @Chen Luo : Yes. I have crated an index on FLAG in MongoDB. However, in
>> AsterixDB setting, it seems that navigating deeply in the nested fields and
>> create an index is not supported.
>> @Taewoo Adding  /*+indexnl */ didn't change the plan.
>> @Wail : Adding /*+bcast/ did the magic. Now, it takes roughly ~6.34 mins
>> on average without including the warm-up time. The plan has changed and it
>> is attached.
>>
>> Can you please points me where I can find this in the documentation? And
>> here is my question: Why do I need to add some physical details in the
>> query like adding  /*+indexnl */  index-nested-loop-join or /*+bcast/
>> broadcast_exchange?
>> @Michael: The data size is ~7GB for PATEINTS dataset and 1MB for LABITEMS
>> dataset (Both datasets have an open schema). After data ingestion, in my
>> setting the data in MongoDB is ~3GB(It seems MongoDB does some compression)
>> and in AsterixDB is ~10GB. (I have 4 partitions,and I checked the size of
>> files in each partition and the total is ~10GB!)
>>
>> Thanks!
>> --Rana
>>
>>
>>
>>
>> On Fri, Jan 26, 2018 at 12:52 PM, Wail Alkowaileet <[email protected]>
>> wrote:
>>
>>> One thing I noticed is that the "large" unnested arrays are hash
>>> partitioned to the probably "small" index-filtered dataset.
>>> Since the data can fit in memory (7 GB in total), I think
>>> broadcast_exchange may do better in this particular case.
>>>
>>> USE mimiciii;
>>> SET `compiler.parallelism` "5";
>>> SET `compiler.sortmemory` "128MB";
>>> SET `compiler.joinmemory` "265MB";
>>> SELECT P.SUBJECT_ID
>>> FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
>>> WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
>>>              E.FLAG = 'abnormal' AND
>>>              I.FLUID='Blood' AND
>>>              I.LABEL='Haptoglobin'
>>>
>>> Note: I reordered the FROM clause...
>>>
>>> Another thing is that I think it's a CPU bound query ... and I'm not
>>> sure how MongoDB utilizes CPU resources compared with AsterixDB.
>>>
>>>
>>>
>>> On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <[email protected]> wrote:
>>>
>>>> PS: UNNEST doc
>>>> https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#U
>>>> nnest_clauses
>>>>
>>>> Best,
>>>> Taewoo
>>>>
>>>> On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <[email protected]>
>>>> wrote:
>>>>
>>>>> Hi Rana,
>>>>>
>>>>> Thank you for attaching your plan. It seems that the selections are
>>>>> correctly made before each join. If your query predicate is selective
>>>>> enough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% of records as
>>>>> the result), I suggest you could try an index-nested-loop-join. Changes 
>>>>> are
>>>>> highlighted. And one more question: if LABEVENTS.FLAG is an array, you
>>>>> can't just use "E.FLAG="abnormal". I think you need to use UNNEST.
>>>>>
>>>>> USE mimiciii;
>>>>> SET `compiler.parallelism` "5";
>>>>> SET `compiler.sortmemory` "128MB";
>>>>> SET `compiler.joinmemory` "265MB";
>>>>> SELECT P.SUBJECT_ID
>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>> WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
>>>>>              E.FLAG = 'abnormal' AND
>>>>>              I.FLUID='Blood' AND
>>>>>              I.LABEL='Haptoglobin'
>>>>>
>>>>> Best,
>>>>> Taewoo
>>>>>
>>>>> On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]> wrote:
>>>>>
>>>>>> Hi Rana,
>>>>>>
>>>>>> I think the performance issue might related to the access of nested
>>>>>> fields, since the rest performance hot spots (index search, hash join etc
>>>>>> looks normal to me), and I assume " I.FLUID='Blood' AND
>>>>>> I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
>>>>>> supports array index, did you build an index on L.FLAG using MongoDB?
>>>>>>
>>>>>> @Wail, do you have any clue on nested fields access?
>>>>>>
>>>>>> Best regards,
>>>>>> Chen Luo
>>>>>>
>>>>>> On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi <[email protected]
>>>>>> > wrote:
>>>>>>
>>>>>>> Hi Taewoo,
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Can you paste the optimized plan? -- Attached the plan (Plan_01.txt)
>>>>>>>
>>>>>>>    -
>>>>>>>
>>>>>>>    Can you create an index on LABEVENTS.FLAG? -- I couldn't create an 
>>>>>>> index on LABEVENTS.FLAG since LABEVENTS is of type array. I got this 
>>>>>>> message when I tried to create the index : "msg": "ASX0001: Field type 
>>>>>>> array can't be promoted to type object"
>>>>>>>
>>>>>>>    - Can you switch the predicate order? -- It seems for me that
>>>>>>>    the plan remains the same even if I changed the order of the 
>>>>>>> predicates.
>>>>>>>    (Attached the plan after changing the order of the predicates 
>>>>>>> Plan_02.txt)
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>> Rana
>>>>>>> On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi <
>>>>>>> [email protected]> wrote:
>>>>>>>
>>>>>>>> Hi Chen,
>>>>>>>>
>>>>>>>> *How did you import data into the dataset? using "load" or "feed"?*
>>>>>>>> I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS USING
>>>>>>>> localfs ((\"path\"=\"127.0.0.1:///data/ralotaib/patients.json\"),
>>>>>>>> (\"format\"=\"json\"))).
>>>>>>>>
>>>>>>>>
>>>>>>>> *Which version of AsterixDB are you using? *
>>>>>>>> AsterixDB Master (0.9.3-SNAPSHOT)
>>>>>>>>
>>>>>>>> Thanks!
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]> wrote:
>>>>>>>>
>>>>>>>>> Hi Rana,
>>>>>>>>>
>>>>>>>>> Nice to see you again! You may post to [email protected]
>>>>>>>>> as well to get more feedbacks from our developers.
>>>>>>>>>
>>>>>>>>> Just clarify two things: how did you import data into the dataset?
>>>>>>>>> using "load" or "feed"? And which version of AsterixDB are you using? 
>>>>>>>>> But
>>>>>>>>> any way in your case it seems the join takes a lot of time, and your 
>>>>>>>>> data
>>>>>>>>> is pretty much cached into the memory...
>>>>>>>>>
>>>>>>>>> Best regards,
>>>>>>>>> Chen Luo
>>>>>>>>>
>>>>>>>>> On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi <
>>>>>>>>> [email protected]> wrote:
>>>>>>>>>
>>>>>>>>>> Hi there,
>>>>>>>>>>
>>>>>>>>>> I have a query that takes ~12.7mins on average (I have excluded
>>>>>>>>>> the warm-up time which was 30mins)!, and I would like to make sure 
>>>>>>>>>> that I
>>>>>>>>>> didn't miss any performance tuning parameters ( I have run the same 
>>>>>>>>>> query
>>>>>>>>>> on MongoDB, and it took ~2mins).
>>>>>>>>>>
>>>>>>>>>> The query asks to find all patients that have 'abnormal'
>>>>>>>>>> haptoglobin blood test result. (The query result can have duplicate 
>>>>>>>>>> values).
>>>>>>>>>>
>>>>>>>>>> *Query:*
>>>>>>>>>> USE mimiciii;
>>>>>>>>>> SET `compiler.parallelism` "5";
>>>>>>>>>> SET `compiler.sortmemory` "128MB";
>>>>>>>>>> SET `compiler.joinmemory` "265MB";
>>>>>>>>>> SELECT P.SUBJECT_ID
>>>>>>>>>> FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
>>>>>>>>>> WHERE I.ITEMID=E.ITEMID AND
>>>>>>>>>>              E.FLAG = 'abnormal' AND
>>>>>>>>>>              I.FLUID='Blood' AND
>>>>>>>>>>              I.LABEL='Haptoglobin'
>>>>>>>>>>
>>>>>>>>>> *Datasets Schema:*
>>>>>>>>>> - PATIENTS  and LABITEMS datasets have an open schema.
>>>>>>>>>> - LABITEMS's  primary key is ITEMID
>>>>>>>>>> - PATIENTS 's primary key is SUBJECT_ID
>>>>>>>>>> - The JSON schema for both datasets is attached.
>>>>>>>>>> - The DDL for both datasets is attached
>>>>>>>>>>
>>>>>>>>>> *Performance Tuning Parameters:*
>>>>>>>>>> - 4 partitions (iodevices)
>>>>>>>>>> - The total memory size is : 125GB, and I have assigned ~ 57GB to
>>>>>>>>>> the buffercache (storage.buffercache.size).
>>>>>>>>>> - As you can see from the query, I set the parallelism to 5,
>>>>>>>>>> sort-memory to 128MB, join-memory to 265MB.
>>>>>>>>>> - The data size is 7GB
>>>>>>>>>>
>>>>>>>>>> Your feedback is highly appreciated!
>>>>>>>>>>
>>>>>>>>>> --Rana
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>>
>>> *Regards,*
>>> Wail Alkowaileet
>>>
>>
>>
>>
>

Reply via email to