It is a hive construct, supported since hive 0.10, so I would be very surprised if Spark does not support it....can't speak for Spark 2.0 (not got a chance to touch it yet :) )
On Mon, Jan 9, 2017 at 2:33 PM, Anil Langote <anillangote0...@gmail.com> wrote: > Does it support in Spark Dataset 2.0 ? > > > > > > Thank you > > Anil Langote > > +1-425-633-9747 <+1%20425-633-9747> > > > > > > *From: *ayan guha <guha.a...@gmail.com> > *Date: *Sunday, January 8, 2017 at 10:32 PM > *To: *Anil Langote <anillangote0...@gmail.com> > > *Subject: *Re: Efficient look up in Key Pair RDD > > > > Hi > > > > Please have a look in this wiki > <https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup>. > Grouping Set is a variation of GROUP BY where you can specify the > combinations in one go. > > > > For Example, if you have 2 attributes, you can roll up (ATT1), > (ATT2,ATT2), (ATT2) by specifying the groups using grouping sets. > > > > Best > > Ayan > > > > On Mon, Jan 9, 2017 at 2:29 PM, Anil Langote <anillangote0...@gmail.com> > wrote: > > Hi Ayan > > > > Thanks a lot for reply, what is GROUPING SET? I did try GROUP BY with UDAF > but it doesn’t perform well. for one combination it takes 1.5 mins in my > use case I have 400 combinations which will take ~400 mins I am looking for > a solution which will scale on the combinations. > > > > Thank you > > Anil Langote > > +1-425-633-9747 <+1%20425-633-9747> > > > > > > *From: *ayan guha <guha.a...@gmail.com> > *Date: *Sunday, January 8, 2017 at 10:26 PM > *To: *Anil Langote <anillangote0...@gmail.com> > *Cc: *Holden Karau <hol...@pigscanfly.ca>, user <user@spark.apache.org> > *Subject: *Re: Efficient look up in Key Pair RDD > > > > Have you tried something like GROUPING SET? That seems to be the exact > thing you are looking for.... > > > > On Mon, Jan 9, 2017 at 12:37 PM, Anil Langote <anillangote0...@gmail.com> > wrote: > > Sure. Let me explain you my requirement I have an input file which has > attributes (25) and las column is array of doubles (14500 elements in > original file) > > > > Attribute_0 > > Attribute_1 > > Attribute_2 > > Attribute_3 > > DoubleArray > > 5 > > 3 > > 5 > > 3 > > 0.2938933463658645 0.0437040427073041 0.23002681025029648 > 0.18003221216680454 > > 3 > > 2 > > 1 > > 3 > > 0.5353599620508771 0.026777650111232787 0.31473082754161674 > 0.2647786522276575 > > 5 > > 3 > > 5 > > 2 > > 0.8803063581705307 0.8101324740101096 0.48523937757683544 > 0.5897714618376072 > > 3 > > 2 > > 1 > > 3 > > 0.33960064683141955 0.46537001358164043 0.543428826489435 > 0.42653939565053034 > > 2 > > 2 > > 0 > > 5 > > 0.5108235777360906 0.4368119043922922 0.8651556676944931 > 0.7451477943975504 > > > > Now I have to compute the addition of the double for any given combination > for example in above file we will have below possible combinations > > > > 1. Attribute_0, Attribute_1 > > 2. Attribute_0, Attribute_2 > > 3. Attribute_0, Attribute_3 > > 4. Attribute_1, Attribute_2 > > 5. Attribute_2, Attribute_3 > > 6. Attribute_1, Attribute_3 > > 7. Attribute_0, Attribute_1, Attribute_2 > > 8. Attribute_0, Attribute_1, Attribute_3 > > 9. Attribute_0, Attribute_2, Attribute_3 > > 10. Attribute_1, Attribute_2, Attribute_3 > > 11. Attribute_1, Attribute_2, Attribute_3, Attribute_4 > > > > now if we process the *Attribute_0, Attribute_1* combination we want > below output. In similar way we have to process all the above combinations > > > > 5_3 ==> [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, > 0.7698036740044117] > > 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, > 0.6913180478781878] > > > > Solution tried > > > > I have created parequet file which will have the schema and last column > will be array of doubles. The size of the parquet file I have is 276G which > has 2.65 M records. > > > > I have implemented the UDAF which will have > > > > Input schema : array of doubles > > Buffer schema : array of doubles > > Return schema : array of doubles > > > > I load the data from parquet file and then register the UDAF to use with > below query, note that SUM is UDAF > > > > SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), *Attribute_0, Attribute_1 > FROM RAW_TABLE GROUP BY Attribute_0, Attribute_1 HAVING COUNT(*)>1* > > > > This works fine and it takes 1.2 mins for one combination my use case will > have 400 combinations which means 8 hours which is not meeting the SLA we > want this to be below 1 hours. What is the best way to implement this use > case. > > > > Best Regards, > > Anil Langote > > +1-425-633-9747 > > > On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote: > > To start with caching and having a known partioner will help a bit, then > there is also the IndexedRDD project, but in general spark might not be the > best tool for the job. Have you considered having Spark output to > something like memcache? > > > > What's the goal of you are trying to accomplish? > > > > On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com> > wrote: > > Hi All, > > > > I have a requirement where I wanted to build a distributed HashMap which > holds 10M key value pairs and provides very efficient lookups for each key. > I tried loading the file into JavaPairedRDD and tried calling lookup method > its very slow. > > > > How can I achieve very very faster lookup by a given key? > > > > Thank you > > Anil Langote > > > > > > -- > > Best Regards, > Ayan Guha > > > > > > -- > > Best Regards, > Ayan Guha > -- Best Regards, Ayan Guha