Hi Ayan
Thanks a lot for reply, what is GROUPING SET? I did try GROUP BY with UDAF but it doesn’t perform well. for one combination it takes 1.5 mins in my use case I have 400 combinations which will take ~400 mins I am looking for a solution which will scale on the combinations. Thank you Anil Langote +1-425-633-9747 From: ayan guha <guha.a...@gmail.com> Date: Sunday, January 8, 2017 at 10:26 PM To: Anil Langote <anillangote0...@gmail.com> Cc: Holden Karau <hol...@pigscanfly.ca>, user <user@spark.apache.org> Subject: Re: Efficient look up in Key Pair RDD Have you tried something like GROUPING SET? That seems to be the exact thing you are looking for.... On Mon, Jan 9, 2017 at 12:37 PM, Anil Langote <anillangote0...@gmail.com> wrote: Sure. Let me explain you my requirement I have an input file which has attributes (25) and las column is array of doubles (14500 elements in original file) Attribute_0Attribute_1Attribute_2Attribute_3DoubleArray 53530.2938933463658645 0.0437040427073041 0.23002681025029648 0.18003221216680454 32130.5353599620508771 0.026777650111232787 0.31473082754161674 0.2647786522276575 53520.8803063581705307 0.8101324740101096 0.48523937757683544 0.5897714618376072 32130.33960064683141955 0.46537001358164043 0.543428826489435 0.42653939565053034 22050.5108235777360906 0.4368119043922922 0.8651556676944931 0.7451477943975504 Now I have to compute the addition of the double for any given combination for example in above file we will have below possible combinations 1. Attribute_0, Attribute_1 2. Attribute_0, Attribute_2 3. Attribute_0, Attribute_3 4. Attribute_1, Attribute_2 5. Attribute_2, Attribute_3 6. Attribute_1, Attribute_3 7. Attribute_0, Attribute_1, Attribute_2 8. Attribute_0, Attribute_1, Attribute_3 9. Attribute_0, Attribute_2, Attribute_3 10. Attribute_1, Attribute_2, Attribute_3 11. Attribute_1, Attribute_2, Attribute_3, Attribute_4 now if we process the Attribute_0, Attribute_1 combination we want below output. In similar way we have to process all the above combinations 5_3 ==> [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 0.7698036740044117] 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 0.6913180478781878] Solution tried I have created parequet file which will have the schema and last column will be array of doubles. The size of the parquet file I have is 276G which has 2.65 M records. I have implemented the UDAF which will have Input schema : array of doubles Buffer schema : array of doubles Return schema : array of doubles I load the data from parquet file and then register the UDAF to use with below query, note that SUM is UDAF SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), Attribute_0, Attribute_1 FROM RAW_TABLE GROUP BY Attribute_0, Attribute_1 HAVING COUNT(*)>1 This works fine and it takes 1.2 mins for one combination my use case will have 400 combinations which means 8 hours which is not meeting the SLA we want this to be below 1 hours. What is the best way to implement this use case. Best Regards, Anil Langote +1-425-633-9747 On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote: To start with caching and having a known partioner will help a bit, then there is also the IndexedRDD project, but in general spark might not be the best tool for the job. Have you considered having Spark output to something like memcache? What's the goal of you are trying to accomplish? On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com> wrote: Hi All, I have a requirement where I wanted to build a distributed HashMap which holds 10M key value pairs and provides very efficient lookups for each key. I tried loading the file into JavaPairedRDD and tried calling lookup method its very slow. How can I achieve very very faster lookup by a given key? Thank you Anil Langote -- Best Regards, Ayan Guha