RE: Implement customized Join for SparkSQL

2015-01-09 Thread Dai, Kevin
Hi,  Rishi

You are right. But the ids may be tens of thousands and B is a database with 
index for id,  which means querying by id is very fast.

In fact we load A and B as separate schemaRDDs as you suggested. But we hope we 
can extend the join implementation to achieve it in the parsing stage.

Best Regards,
Kevin

From: Rishi Yadav [mailto:ri...@infoobjects.com]
Sent: 2015年1月9日 6:52
To: Dai, Kevin
Cc: user@spark.apache.org
Subject: Re: Implement customized Join for SparkSQL

Hi Kevin,

Say A has 10 ids, so you are pulling data from B's data source only for these 
10 ids?

What if you load A and B as separate schemaRDDs and then do join. Spark will 
optimize the path anyway when action is fired .

On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin 
mailto:yun...@ebay.com>> wrote:
Hi, All

Suppose I want to join two tables A and B as follows:

Select * from A join B on A.id = B.id

A is a file while B is a database which indexed by id and I wrapped it by Data 
source API.
The desired join flow is:

1.   Generate A’s RDD[Row]

2.   Generate B’s RDD[Row] from A by using A’s id and B’s data source api 
to get row from the database

3.   Merge these two RDDs to the final RDD[Row]

However it seems existing join strategy doesn’t support it?

Any way to achieve it?

Best Regards,
Kevin.



Re: Implement customized Join for SparkSQL

2015-01-08 Thread Rishi Yadav
Hi Kevin,

Say A has 10 ids, so you are pulling data from B's data source only for
these 10 ids?

What if you load A and B as separate schemaRDDs and then do join. Spark
will optimize the path anyway when action is fired .

On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin  wrote:

>  Hi, All
>
>
>
> Suppose I want to join two tables A and B as follows:
>
>
>
> Select * from A join B on A.id = B.id
>
>
>
> A is a file while B is a database which indexed by id and I wrapped it by
> Data source API.
>
> The desired join flow is:
>
> 1.   Generate A’s RDD[Row]
>
> 2.   Generate B’s RDD[Row] from A by using A’s id and B’s data source
> api to get row from the database
>
> 3.   Merge these two RDDs to the final RDD[Row]
>
>
>
> However it seems existing join strategy doesn’t support it?
>
>
>
> Any way to achieve it?
>
>
>
> Best Regards,
>
> Kevin.
>


RE: Implement customized Join for SparkSQL

2015-01-05 Thread Cheng, Hao
Can you paste the error log?

From: Dai, Kevin [mailto:yun...@ebay.com]
Sent: Monday, January 5, 2015 6:29 PM
To: user@spark.apache.org
Subject: Implement customized Join for SparkSQL

Hi, All

Suppose I want to join two tables A and B as follows:

Select * from A join B on A.id = B.id

A is a file while B is a database which indexed by id and I wrapped it by Data 
source API.
The desired join flow is:

1.  Generate A's RDD[Row]

2.  Generate B's RDD[Row] from A by using A's id and B's data source api to 
get row from the database

3.  Merge these two RDDs to the final RDD[Row]

However it seems existing join strategy doesn't support it?

Any way to achieve it?

Best Regards,
Kevin.