Have you tried something like GROUPING SET? That seems to be the exact thing you are looking for....
On Mon, Jan 9, 2017 at 12:37 PM, Anil Langote <anillangote0...@gmail.com> wrote: > Sure. Let me explain you my requirement I have an input file which has > attributes (25) and las column is array of doubles (14500 elements in > original file) > > > > Attribute_0 > > Attribute_1 > > Attribute_2 > > Attribute_3 > > DoubleArray > > 5 > > 3 > > 5 > > 3 > > 0.2938933463658645 0.0437040427073041 0.23002681025029648 > 0.18003221216680454 > > 3 > > 2 > > 1 > > 3 > > 0.5353599620508771 0.026777650111232787 0.31473082754161674 > 0.2647786522276575 > > 5 > > 3 > > 5 > > 2 > > 0.8803063581705307 0.8101324740101096 0.48523937757683544 > 0.5897714618376072 > > 3 > > 2 > > 1 > > 3 > > 0.33960064683141955 0.46537001358164043 0.543428826489435 > 0.42653939565053034 > > 2 > > 2 > > 0 > > 5 > > 0.5108235777360906 0.4368119043922922 0.8651556676944931 > 0.7451477943975504 > > > > Now I have to compute the addition of the double for any given combination > for example in above file we will have below possible combinations > > > > 1. Attribute_0, Attribute_1 > > 2. Attribute_0, Attribute_2 > > 3. Attribute_0, Attribute_3 > > 4. Attribute_1, Attribute_2 > > 5. Attribute_2, Attribute_3 > > 6. Attribute_1, Attribute_3 > > 7. Attribute_0, Attribute_1, Attribute_2 > > 8. Attribute_0, Attribute_1, Attribute_3 > > 9. Attribute_0, Attribute_2, Attribute_3 > > 10. Attribute_1, Attribute_2, Attribute_3 > > 11. Attribute_1, Attribute_2, Attribute_3, Attribute_4 > > > > now if we process the *Attribute_0, Attribute_1* combination we want > below output. In similar way we have to process all the above combinations > > > > 5_3 ==> [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, > 0.7698036740044117] > > 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, > 0.6913180478781878] > > > > Solution tried > > > > I have created parequet file which will have the schema and last column > will be array of doubles. The size of the parquet file I have is 276G which > has 2.65 M records. > > > > I have implemented the UDAF which will have > > > > Input schema : array of doubles > > Buffer schema : array of doubles > > Return schema : array of doubles > > > > I load the data from parquet file and then register the UDAF to use with > below query, note that SUM is UDAF > > > > SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), *Attribute_0, Attribute_1 > FROM RAW_TABLE GROUP BY Attribute_0, Attribute_1 HAVING COUNT(*)>1* > > > > This works fine and it takes 1.2 mins for one combination my use case will > have 400 combinations which means 8 hours which is not meeting the SLA we > want this to be below 1 hours. What is the best way to implement this use > case. > > Best Regards, > > Anil Langote > > +1-425-633-9747 > > On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > > To start with caching and having a known partioner will help a bit, then > there is also the IndexedRDD project, but in general spark might not be the > best tool for the job. Have you considered having Spark output to > something like memcache? > > What's the goal of you are trying to accomplish? > > On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com> > wrote: > >> Hi All, >> >> I have a requirement where I wanted to build a distributed HashMap which >> holds 10M key value pairs and provides very efficient lookups for each key. >> I tried loading the file into JavaPairedRDD and tried calling lookup method >> its very slow. >> >> How can I achieve very very faster lookup by a given key? >> >> Thank you >> Anil Langote >> > -- Best Regards, Ayan Guha