RE: Implement customized Join for SparkSQL
Hi, Rishi You are right. But the ids may be tens of thousands and B is a database with index for id, which means querying by id is very fast. In fact we load A and B as separate schemaRDDs as you suggested. But we hope we can extend the join implementation to achieve it in the parsing stage. Best Regards, Kevin From: Rishi Yadav [mailto:ri...@infoobjects.com] Sent: 2015年1月9日 6:52 To: Dai, Kevin Cc: user@spark.apache.org Subject: Re: Implement customized Join for SparkSQL Hi Kevin, Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids? What if you load A and B as separate schemaRDDs and then do join. Spark will optimize the path anyway when action is fired . On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin mailto:yun...@ebay.com>> wrote: Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id A is a file while B is a database which indexed by id and I wrapped it by Data source API. The desired join flow is: 1. Generate A’s RDD[Row] 2. Generate B’s RDD[Row] from A by using A’s id and B’s data source api to get row from the database 3. Merge these two RDDs to the final RDD[Row] However it seems existing join strategy doesn’t support it? Any way to achieve it? Best Regards, Kevin.
Re: Implement customized Join for SparkSQL
Hi Kevin, Say A has 10 ids, so you are pulling data from B's data source only for these 10 ids? What if you load A and B as separate schemaRDDs and then do join. Spark will optimize the path anyway when action is fired . On Mon, Jan 5, 2015 at 2:28 AM, Dai, Kevin wrote: > Hi, All > > > > Suppose I want to join two tables A and B as follows: > > > > Select * from A join B on A.id = B.id > > > > A is a file while B is a database which indexed by id and I wrapped it by > Data source API. > > The desired join flow is: > > 1. Generate A’s RDD[Row] > > 2. Generate B’s RDD[Row] from A by using A’s id and B’s data source > api to get row from the database > > 3. Merge these two RDDs to the final RDD[Row] > > > > However it seems existing join strategy doesn’t support it? > > > > Any way to achieve it? > > > > Best Regards, > > Kevin. >
RE: Implement customized Join for SparkSQL
Can you paste the error log? From: Dai, Kevin [mailto:yun...@ebay.com] Sent: Monday, January 5, 2015 6:29 PM To: user@spark.apache.org Subject: Implement customized Join for SparkSQL Hi, All Suppose I want to join two tables A and B as follows: Select * from A join B on A.id = B.id A is a file while B is a database which indexed by id and I wrapped it by Data source API. The desired join flow is: 1. Generate A's RDD[Row] 2. Generate B's RDD[Row] from A by using A's id and B's data source api to get row from the database 3. Merge these two RDDs to the final RDD[Row] However it seems existing join strategy doesn't support it? Any way to achieve it? Best Regards, Kevin.