Hi there, I have a query that takes ~12.7mins on average (I have excluded the warm-up time which was 30mins)!, and I would like to make sure that I didn't miss any performance tuning parameters ( I have run the same query on MongoDB, and it took ~2mins).
The query asks to find all patients that have 'abnormal' haptoglobin blood
test result. (The query result can have duplicate values).
*Query:*
USE mimiciii;
SET `compiler.parallelism` "5";
SET `compiler.sortmemory` "128MB";
SET `compiler.joinmemory` "265MB";
SELECT P.SUBJECT_ID
FROM LABITEMS I, PATIENTS P, P.ADMISSIONS A, A.LABEVENTS E
WHERE I.ITEMID=E.ITEMID AND
E.FLAG = 'abnormal' AND
I.FLUID='Blood' AND
I.LABEL='Haptoglobin'
*Datasets Schema:*
- PATIENTS and LABITEMS datasets have an open schema.
- LABITEMS's primary key is ITEMID
- PATIENTS 's primary key is SUBJECT_ID
- The JSON schema for both datasets is attached.
- The DDL for both datasets is attached
*Performance Tuning Parameters:*
- 4 partitions (iodevices)
- The total memory size is : 125GB, and I have assigned ~ 57GB to the
buffercache (storage.buffercache.size).
- As you can see from the query, I set the parallelism to 5, sort-memory
to 128MB, join-memory to 265MB.
- The data size is 7GB
Your feedback is highly appreciated!
--Rana
--LABITEMS
USE mimiciii;
CREATE TYPE lType AS { ITEMID: int };
CREATE DATASET LABITEMS(lType) PRIMARY KEY ITEMID;
--PATIENTS
USE mimiciii;
CREATE TYPE pType AS { SUBJECT_ID: int };
CREATE DATASET PATIENTS(pType)
PRIMARY KEY SUBJECT_ID;
--Indexes
USE mimiciii;
CREATE INDEX FLUIDIndx on LABITEMS (FLUID: string?) enforced;
CREATE INDEX LABLEIndx on LABITEMS (LABEL: string?) enforced;
MIMIC-SchemaJSON.json
Description: application/json
