Re: Efficient look up in Key Pair RDD

Anil Langote Sun, 08 Jan 2017 19:30:07 -0800

Hi Ayan


Thanks a lot for reply, what is GROUPING SET? I did try GROUP BY with UDAF but 
it doesn’t perform well. for one combination it takes 1.5 mins in my use case I 
have 400 combinations which will take ~400 mins I am looking for a solution 
which will scale on the combinations.

 

Thank you

Anil Langote

+1-425-633-9747

 

 

From: ayan guha <guha.a...@gmail.com>
Date: Sunday, January 8, 2017 at 10:26 PM
To: Anil Langote <anillangote0...@gmail.com>
Cc: Holden Karau <hol...@pigscanfly.ca>, user <user@spark.apache.org>
Subject: Re: Efficient look up in Key Pair RDD

 

Have you tried something like GROUPING SET? That seems to be the exact thing 
you are looking for....

 

On Mon, Jan 9, 2017 at 12:37 PM, Anil Langote <anillangote0...@gmail.com> wrote:

Sure. Let me explain you my requirement I have an input file which has 
attributes (25) and las column is array of doubles (14500 elements in original 
file)

 

Attribute_0Attribute_1Attribute_2Attribute_3DoubleArray
53530.2938933463658645  0.0437040427073041  0.23002681025029648  
0.18003221216680454
32130.5353599620508771  0.026777650111232787  0.31473082754161674  
0.2647786522276575
53520.8803063581705307  0.8101324740101096  0.48523937757683544  
0.5897714618376072
32130.33960064683141955  0.46537001358164043  0.543428826489435  
0.42653939565053034
22050.5108235777360906  0.4368119043922922  0.8651556676944931  
0.7451477943975504
 

Now I have to compute the addition of the double for any given combination for 
example in above file we will have below possible combinations

 

1.      Attribute_0, Attribute_1

2.      Attribute_0, Attribute_2

3.      Attribute_0, Attribute_3

4.      Attribute_1, Attribute_2

5.      Attribute_2, Attribute_3

6.      Attribute_1, Attribute_3

7.      Attribute_0, Attribute_1, Attribute_2

8.      Attribute_0, Attribute_1, Attribute_3

9.      Attribute_0, Attribute_2, Attribute_3

10.  Attribute_1, Attribute_2, Attribute_3

11.  Attribute_1, Attribute_2, Attribute_3, Attribute_4

 

now if we process the Attribute_0, Attribute_1 combination we want below 
output. In similar way we have to process all the above combinations

 

5_3 ==>  [1.1741997045363952, 0.8538365167174137, 0.7152661878271319, 
0.7698036740044117]

3_2 ==> [0.8749606088822967, 0.4921476636928732, 0.8581596540310518, 
0.6913180478781878]

 

Solution tried

 

I have created parequet file which will have the schema and last column will be 
array of doubles. The size of the parquet file I have is 276G which has 2.65 M 
records.

 

I have implemented the UDAF which will have 

 

Input schema : array of doubles

Buffer schema : array of doubles 

Return schema : array of doubles

 

I load the data from parquet file and then register the UDAF to use with below 
query, note that SUM is UDAF

 

SELECT COUNT(*) AS MATCHES, SUM(DOUBLEARRAY), Attribute_0, Attribute_1 FROM 
RAW_TABLE GROUP BY Attribute_0, Attribute_1 HAVING COUNT(*)>1

 

This works fine and it takes 1.2 mins for one combination my use case will have 
400 combinations which means 8 hours which is not meeting the SLA we want this 
to be below 1 hours. What is the best way to implement this use case.

 

Best Regards,

Anil Langote

+1-425-633-9747


On Jan 8, 2017, at 8:17 PM, Holden Karau <hol...@pigscanfly.ca> wrote:

To start with caching and having a known partioner will help a bit, then there 
is also the IndexedRDD project, but in general spark might not be the best tool 
for the job.  Have you considered having Spark output to something like 
memcache?

 

What's the goal of you are trying to accomplish?

 

On Sun, Jan 8, 2017 at 5:04 PM Anil Langote <anillangote0...@gmail.com> wrote:

Hi All,

 

I have a requirement where I wanted to build a distributed HashMap which holds 
10M key value pairs and provides very efficient lookups for each key. I tried 
loading the file into JavaPairedRDD and tried calling lookup method its very 
slow.

 

How can I achieve very very faster lookup by a given key?

 

Thank you

Anil Langote 



 

-- 

Best Regards,
Ayan Guha

Re: Efficient look up in Key Pair RDD

Reply via email to