Re: AsterixDB Performance Tuning

Michael Carey Fri, 26 Jan 2018 10:08:32 -0800

Also:  What are the data sizes in the two systems?


On 1/26/18 10:00 AM, Taewoo Kim wrote:

Hi Rana,

Thank you for attaching your plan. It seems that the selections arecorrectly made before each join. If your query predicate is selectiveenough (e.g., I.LABEL = 'Haptoglobin' generates less than 1% ofrecords as the result), I suggest you could try anindex-nested-loop-join. Changes are highlighted. And one morequestion: if LABEVENTS.FLAG is an array, you can't just use"E.FLAG="abnormal". I think you need to use UNNEST.


USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT P.SUBJECT_ID
FROM   LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
WHERE I.ITEMID */* +indexnl */ *=E.ITEMID AND
             E.FLAG = 'abnormal' AND
I.FLUID='Blood' AND
I.LABEL='Haptoglobin'

Best,
Taewoo

On Fri, Jan 26, 2018 at 9:16 AM, Chen Luo <[email protected]<mailto:[email protected]>> wrote:


    Hi Rana,

    I think the performance issue might related to the access of
    nested fields, since the rest performance hot spots (index search,
    hash join etc looks normal to me), and I assume " I.FLUID='Blood'
    AND I.LABEL='Haptoglobin'" should be very selective. Since MongoDB
    supports array index, did you build an index on L.FLAG using MongoDB?

    @Wail, do you have any clue on nested fields access?

    Best regards,
    Chen Luo

    On Fri, Jan 26, 2018 at 1:47 AM, Rana Alotaibi
    <[email protected] <mailto:[email protected]>> wrote:

        Hi Taewoo,

         *

            Can you paste the optimized plan? -- Attached the plan
            (Plan_01.txt)

         *

            Can you create an index on LABEVENTS.FLAG? -- I couldn't
            create an index on LABEVENTS.FLAG since LABEVENTS is of
            type array. I got this message when I tried to create the
            index : "msg": "ASX0001: Field type array can't be
            promoted to type object"

          * Can you switch the predicate order? -- It seems for me
            that the plan remains the same even if I changed the order
            of the predicates. (Attached the plan after changing the
            order of the predicates Plan_02.txt)

        Thanks

        Rana

        On Thu, Jan 25, 2018 at 11:24 PM, Rana Alotaibi
        <[email protected] <mailto:[email protected]>> wrote:

            Hi Chen,

            *How did you import data into the dataset? using "load" or
            "feed"?*
            I used "LOAD" (i.e USE mimiciii; LOAD DATASET PATIENTS
            USING localfs ((\"path\"=\"*MailScanner warning: numerical
            links are often malicious:* 127.0.0.1
            <http://127.0.0.1>:///data/ralotaib/patients.json\"),
            (\"format\"=\"json\"))).

            *Which version of AsterixDB are you using?
            *
            AsterixDB Master (0.9.3-SNAPSHOT)

            Thanks!





            On Thu, Jan 25, 2018 at 10:39 PM, Chen Luo <[email protected]
            <mailto:[email protected]>> wrote:

                Hi Rana,

                Nice to see you again! You may post to
                [email protected]
                <mailto:[email protected]> as well to get more
                feedbacks from our developers.

                Just clarify two things: how did you import data into
                the dataset? using "load" or "feed"? And which version
                of AsterixDB are you using? But any way in your case
                it seems the join takes a lot of time, and your data
                is pretty much cached into the memory...

                Best regards,
                Chen Luo

                On Thu, Jan 25, 2018 at 8:46 PM, Rana Alotaibi
                <[email protected] <mailto:[email protected]>>
                wrote:

                    Hi there,

                    I have a query that takes ~12.7mins on average (I
                    have excluded the warm-up time which was 30mins)!,
                    and I would like to make sure that I didn't miss
                    any performance tuning parameters ( I have run the
                    same query on MongoDB, and it took ~2mins).

                    The query asks to find all patients that have
                    'abnormal' haptoglobin blood test result. (The
                    query result can have duplicate values).

                    *Query:*
                    USE mimiciii;
                    SET `compiler.parallelism` "5";
                    SET `compiler.sortmemory` "128MB";
                    SET `compiler.joinmemory` "265MB";
                    SELECT P.SUBJECT_ID
                    FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A,
                    A.LABEVENTS E
                    WHERE I.ITEMID=E.ITEMID AND
                    E.FLAG = 'abnormal' AND
                    I.FLUID='Blood' AND
                    I.LABEL='Haptoglobin'

                    *Datasets Schema:*
                    - PATIENTS and LABITEMS datasets have an open schema.
                    - LABITEMS's primary key is ITEMID
                    - PATIENTS 's primary key is SUBJECT_ID
                    - The JSON schema for both datasets is attached.
                    - The DDL for both datasets is attached

                    *Performance Tuning Parameters:*
                    - 4 partitions (iodevices)
                    - The total memory size is : 125GB, and I have
                    assigned ~ 57GB to the buffercache
                    (storage.buffercache.size).
                    - As you can see from the query, I set the
                    parallelism to 5, sort-memory to 128MB,
                    join-memory to 265MB.
                    - The data size is 7GB

                    Your feedback is highly appreciated!

                    --Rana

Re: AsterixDB Performance Tuning

Reply via email to