Re: Efficient look up in Key Pair RDD

ayan guha Sun, 08 Jan 2017 19:36:59 -0800

It is a hive construct, supported since hive 0.10, so I would be very
surprised if Spark does not support it....can't speak for Spark 2.0 (not
got a chance to touch it yet :) )


On Mon, Jan 9, 2017 at 2:33 PM, Anil Langote <anillangote0...@gmail.com>
wrote:

> Does it support in Spark Dataset 2.0 ?
>
>
>
>
>
> Thank you
>
> Anil Langote
>
> +1-425-633-9747 <+1%20425-633-9747>
>
>
>
>
>
> *From: *ayan guha <guha.a...@gmail.com>
> *Date: *Sunday, January 8, 2017 at 10:32 PM
> *To: *Anil Langote <anillangote0...@gmail.com>
>
> *Subject: *Re: Efficient look up in Key Pair RDD
>
>
>
> Hi
>
>
>
> Please have a look in this wiki
> <https://cwiki.apache.org/confluence/display/Hive/Enhanced+Aggregation,+Cube,+Grouping+and+Rollup>.
> Grouping Set is a variation of GROUP BY where you can specify the
> combinations in one go.
>
>
>
> For Example, if you have 2 attributes, you can roll up (ATT1),
> (ATT2,ATT2), (ATT2) by specifying the groups using grouping sets.
>
>
>
> Best
>
> Ayan
>
>
>
> On Mon, Jan 9, 2017 at 2:29 PM, Anil Langote <anillangote0...@gmail.com>
> wrote:
>
> Hi Ayan
>
>
>
> Thanks a lot for reply, what is GROUPING SET? I did try GROUP BY with UDAF
> but it doesn’t perform well. for one combination it takes 1.5 mins in my
> use case I have 400 combinations which will take ~400 mins I am looking for
> a solution which will scale on the combinations.
>
>
>
> Thank you
>
> Anil Langote
>
> +1-425-633-9747 <+1%20425-633-9747>
>
>
>
>
>
> *From: *ayan guha <guha.a...@gmail.com>
> *Date: *Sunday, January 8, 2017 at 10:26 PM
> *To: *Anil Langote <anillangote0...@gmail.com>
> *Cc: *Holden Karau <hol...@pigscanfly.ca>, user <user@spark.apache.org>
> *Subject: *Re: Efficient look up in Key Pair RDD
>
>
>
> Have you tried something like GROUPING SET? That seems to be the exact
> thing you are looking for....
>
>
>
> On Mon, Jan 9, 2017 at 12:37 PM, Anil Langote <anillangote0...@gmail.com>
> wrote:
>
> Sure. Let me explain you my requirement I have an input file which has
> attributes (25) and las column is array of doubles (14500 elements in
> original file)
>
>
>
> Attribute_0
>
> Attribute_1
>
> Attribute_2
>
> Attribute_3
>
> DoubleArray
>
> 5
>
> 3
>
> 5
>
> 3
>
> 0.2938933463658645  0.0437040427073041  0.23002681025029648
> 0.18003221216680454
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.5353599620508771  0.026777650111232787  0.31473082754161674
> 0.2647786522276575
>
> 5
>
> 3
>
> 5
>
> 2
>
> 0.8803063581705307  0.8101324740101096  0.48523937757683544
> 0.5897714618376072
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.33960064683141955  0.46537001358164043  0.543428826489435
> 0.42653939565053034
>
> 2
>
> 2
>
> 0
>
> 5
>
> 0.5108235777360906  0.4368119043922922  0.8651556676944931
> 0.7451477943975504
>
>
>
> Now I have to compute the addition of the double for any given combination
> for example in above file we will have below possible combinations
>
>
>
> 1.      Attribute_0, Attribute_1
>
> 2.      Attribute_0, Attribute_2
>
> 3.      Attribute_0, Attribute_3
>
> 4.      Attribute_1, Attribute_2
>
> 5.      Attribute_2, Attribute_3
>
> 6.      Attribute_1, Attribute_3
>
> 7.      Attribute_0, Attribute_1, Attribute_2
>
> 8.      Attribute_0, Attribute_1, Attribute_3
>
> 9.      Attribute_0, Attribute_2, Attribute_3
>
> 10.  Attribute_1, Attribute_2, Attribute_3
>
> 11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
>
>
>
> now if we process the *Attribute_0, Attribute_1* combination we want
> below output. In similar way we have to process all the above combinations
>
>
>
> 5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319,
> 0.7698036740044117]
>
> 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518,
> 0.6913180478781878]
>
>
>
> Solution tried
>
>
>
> I have created parequet file which will have the schema and last column
> will be array of doubles. The size of the parquet file I have is 276G which
> has 2.65 M records.
>
>
>
> I have implemented the UDAF which will have
>
>
>
> Input schema : array of doubles
>
> Buffer schema : array of doubles
>
> Return schema : array of doubles
>
>
>
> I load the data from parquet file and then register the UDAF to use with
> below query, note that SUM is UDAF
>
>
>
> SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), *Attribute_0, Attribute_1
> FROM RAW_TABLE GROUP BY Attribute_0, Attribute_1 HAVING COUNT(*)>1*
>
>
>
> This works fine and it takes 1.2 mins for one combination my use case will
> have 400 combinations which means 8 hours which is not meeting the SLA we
> want this to be below 1 hours. What is the best way to implement this use
> case.
>
>
>
> Best Regards,
>
> Anil Langote
>
> +1-425-633-9747
>
>
> On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
> To start with caching and having a known partioner will help a bit, then
> there is also the IndexedRDD project, but in general spark might not be the
> best tool for the job.  Have you considered having Spark output to
> something like memcache?
>
>
>
> What's the goal of you are trying to accomplish?
>
>
>
> On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com>
> wrote:
>
> Hi All,
>
>
>
> I have a requirement where I wanted to build a distributed HashMap which
> holds 10M key value pairs and provides very efficient lookups for each key.
> I tried loading the file into JavaPairedRDD and tried calling lookup method
> its very slow.
>
>
>
> How can I achieve very very faster lookup by a given key?
>
>
>
> Thank you
>
> Anil Langote
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>
>
>
>
>
> --
>
> Best Regards,
> Ayan Guha
>



-- 
Best Regards,
Ayan Guha

Re: Efficient look up in Key Pair RDD

Reply via email to