It would be interesting to see what the "MongoDB query plan" (program :-)) is in this case, and how the AsterixDB plan compares.

On 1/26/18 12:52 PM, Wail Alkowaileet wrote:
One thing I noticed is that the "large" unnested arrays are hash partitioned to the probably "small" index-filtered dataset. Since the data can fit in memory (7 GB in total), I think broadcast_exchange may do better in this particular case.

USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT P.SUBJECT_ID
FROM PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E, LABITEMS I
WHERE  E.ITEMID/*+bcast*/ = I.ITEMID AND
             E.FLAG = 'abnormal' AND
             I.FLUID='Blood' AND
             I.LABEL='Haptoglobin'

Note: I reordered the FROM clause...

Another thing is that I think it's a CPU bound query ... and I'm not sure how MongoDB utilizes CPU resources compared with AsterixDB.



On Fri, Jan 26, 2018 at 10:36 AM, Taewoo Kim <[email protected] <mailto:[email protected]>> wrote:

    PS: UNNEST doc
    https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses
    <https://ci.apache.org/projects/asterixdb/sqlpp/manual.html#Unnest_clauses>

    Best,
    Taewoo

    On Fri, Jan 26, 2018 at 10:00 AM, Taewoo Kim <[email protected]
    <mailto:[email protected]>> wrote:

        Hi Rana,

        Thank you for attaching your plan. It seems that the
        selections are correctly made before each join. If your query
        predicate is selective enough (e.g., I.LABEL = 'Haptoglobin'
        generates less than 1% of records as the result), I suggest
        you could try an index-nested-loop-join. Changes are
        highlighted. And one more question: if LABEVENTS.FLAG is an
        array, you can't just use "E.FLAG="abnormal". I think you need
        to use UNNEST.

        USE mimiciii;
        SET `compiler.parallelism` "5";
        SET `compiler.sortmemory` "128MB";
        SET `compiler.joinmemory` "265MB";
        SELECT P.SUBJECT_ID
        FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
        WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
        E.FLAG = 'abnormal' AND
        I.FLUID='Blood' AND
        I.LABEL='Haptoglobin'

        Best,
        Taewoo

        On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]
        <mailto:[email protected]>> wrote:

            Hi Rana,

            I think the performance issue might related to the access
            of nested fields, since the rest performance hot spots
            (index search, hash join etc looks normal to me), and I
            assume " I.FLUID='Blood' AND I.LABEL='Haptoglobin'" should
            be very selective. Since MongoDB supports array index, did
            you build an index on L.FLAG using MongoDB?

            @Wail, do you have any clue on nested fields access?

            Best regards,
            Chen Luo

            On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi
            <[email protected] <mailto:[email protected]>> wrote:

                Hi Taewoo,

                 *

                    Can you paste the optimized plan? -- Attached the
                    plan (Plan_01.txt)

                 *

                    Can you create an index on LABEVENTS.FLAG? -- I
                    couldn't create an index on LABEVENTS.FLAG since
                    LABEVENTS is of type array. I got this message
                    when I tried to create the index : "msg":
                    "ASX0001: Field type array can't be promoted to
                    type object"

                  * Can you switch the predicate order? -- It seems
                    for me that the plan remains the same even if I
                    changed the order of the predicates. (Attached the
                    plan after changing the order of the predicates
                    Plan_02.txt)

                Thanks

                Rana

                On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi
                <[email protected] <mailto:[email protected]>>
                wrote:

                    Hi Chen,

                    *How did you import data into the dataset? using
                    "load" or "feed"?*
                    I used "LOAD" (i.e USE mimiciii; LOAD DATASET
                    PATIENTS USING localfs ((\"path\"=\"127.0.0.1
                    <http://127.0.0.1>:///data/ralotaib/patients.json\"),
                    (\"format\"=\"json\"))).

                    *Which version of AsterixDB are you using?
                    *
                    AsterixDB Master (0.9.3-SNAPSHOT)

                    Thanks!





                    On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo
                    <[email protected] <mailto:[email protected]>> wrote:

                        Hi Rana,

                        Nice to see you again! You may post to
                        [email protected]
                        <mailto:[email protected]> as well to
                        get more feedbacks from our developers.

                        Just clarify two things: how did you import
                        data into the dataset? using "load" or "feed"?
                        And which version of AsterixDB are you using?
                        But any way in your case it seems the join
                        takes a lot of time, and your data is pretty
                        much cached into the memory...

                        Best regards,
                        Chen Luo

                        On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi
                        <[email protected]
                        <mailto:[email protected]>> wrote:

                            Hi there,

                            I have a query that takes ~12.7mins on
                            average (I have excluded the warm-up time
                            which was 30mins)!, and I would like to
                            make sure that I didn't miss any
                            performance tuning parameters ( I have run
                            the same query on MongoDB, and it took
                            ~2mins).

                            The query asks to find all patients that
                            have 'abnormal' haptoglobin blood test
                            result. (The query result can have
                            duplicate values).

                            *Query:*
                            USE mimiciii;
                            SET `compiler.parallelism` "5";
                            SET `compiler.sortmemory` "128MB";
                            SET `compiler.joinmemory` "265MB";
                            SELECT P.SUBJECT_ID
                            FROM LABITEMS I, PATIENTS P, P.ADMISSIONS
                            A, A.LABEVENTS E
                            WHERE I.ITEMID=E.ITEMID AND
                            E.FLAG = 'abnormal' AND
                            I.FLUID='Blood' AND
                            I.LABEL='Haptoglobin'

                            *Datasets Schema:*
                            - PATIENTS and LABITEMS datasets have an
                            open schema.
                            - LABITEMS's primary key is ITEMID
                            - PATIENTS 's primary key is SUBJECT_ID
                            - The JSON schema for both datasets is
                            attached.
                            - The DDL for both datasets is attached

                            *Performance Tuning Parameters:*
                            - 4 partitions (iodevices)
                            - The total memory size is : 125GB, and I
                            have assigned ~ 57GB to the buffercache
                            (storage.buffercache.size).
                            - As you can see from the query, I set the
                            parallelism to 5, sort-memory to 128MB,
                            join-memory to 265MB.
                            - The data size is 7GB

                            Your feedback is highly appreciated!

                            --Rana










--

*Regards,*
Wail Alkowaileet

Reply via email to