Have you tried something like GROUPING SET? That seems to be the exact
thing you are looking for....

On Mon, Jan 9, 2017 at 12:37 PM, Anil Langote <anillangote0...@gmail.com>
wrote:

> Sure. Let me explain you my requirement I have an input file which has
> attributes (25) and las column is array of doubles (14500 elements in
> original file)
>
>
>
> Attribute_0
>
> Attribute_1
>
> Attribute_2
>
> Attribute_3
>
> DoubleArray
>
> 5
>
> 3
>
> 5
>
> 3
>
> 0.2938933463658645  0.0437040427073041  0.23002681025029648
> 0.18003221216680454
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.5353599620508771  0.026777650111232787  0.31473082754161674
> 0.2647786522276575
>
> 5
>
> 3
>
> 5
>
> 2
>
> 0.8803063581705307  0.8101324740101096  0.48523937757683544
> 0.5897714618376072
>
> 3
>
> 2
>
> 1
>
> 3
>
> 0.33960064683141955  0.46537001358164043  0.543428826489435
> 0.42653939565053034
>
> 2
>
> 2
>
> 0
>
> 5
>
> 0.5108235777360906  0.4368119043922922  0.8651556676944931
> 0.7451477943975504
>
>
>
> Now I have to compute the addition of the double for any given combination
> for example in above file we will have below possible combinations
>
>
>
> 1.      Attribute_0, Attribute_1
>
> 2.      Attribute_0, Attribute_2
>
> 3.      Attribute_0, Attribute_3
>
> 4.      Attribute_1, Attribute_2
>
> 5.      Attribute_2, Attribute_3
>
> 6.      Attribute_1, Attribute_3
>
> 7.      Attribute_0, Attribute_1, Attribute_2
>
> 8.      Attribute_0, Attribute_1, Attribute_3
>
> 9.      Attribute_0, Attribute_2, Attribute_3
>
> 10.  Attribute_1, Attribute_2, Attribute_3
>
> 11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4
>
>
>
> now if we process the *Attribute_0, Attribute_1* combination we want
> below output. In similar way we have to process all the above combinations
>
>
>
> 5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319,
> 0.7698036740044117]
>
> 3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518,
> 0.6913180478781878]
>
>
>
> Solution tried
>
>
>
> I have created parequet file which will have the schema and last column
> will be array of doubles. The size of the parquet file I have is 276G which
> has 2.65 M records.
>
>
>
> I have implemented the UDAF which will have
>
>
>
> Input schema : array of doubles
>
> Buffer schema : array of doubles
>
> Return schema : array of doubles
>
>
>
> I load the data from parquet file and then register the UDAF to use with
> below query, note that SUM is UDAF
>
>
>
> SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), *Attribute_0, Attribute_1
> FROM RAW_TABLE GROUP BY Attribute_0, Attribute_1 HAVING COUNT(*)>1*
>
>
>
> This works fine and it takes 1.2 mins for one combination my use case will
> have 400 combinations which means 8 hours which is not meeting the SLA we
> want this to be below 1 hours. What is the best way to implement this use
> case.
>
> Best Regards,
>
> Anil Langote
>
> +1-425-633-9747
>
> On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote:
>
> To start with caching and having a known partioner will help a bit, then
> there is also the IndexedRDD project, but in general spark might not be the
> best tool for the job.  Have you considered having Spark output to
> something like memcache?
>
> What's the goal of you are trying to accomplish?
>
> On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com>
> wrote:
>
>> Hi All,
>>
>> I have a requirement where I wanted to build a distributed HashMap which
>> holds 10M key value pairs and provides very efficient lookups for each key.
>> I tried loading the file into JavaPairedRDD and tried calling lookup method
>> its very slow.
>>
>> How can I achieve very very faster lookup by a given key?
>>
>> Thank you
>> Anil Langote
>>
>


-- 
Best Regards,
Ayan Guha

Reply via email to