Re: Simple record matching using Spark SQL

Soumya Simanta Wed, 16 Jul 2014 06:53:40 -0700

Check your executor logs for the output or if your data is not big collect it 
in the driver and print it.




> On Jul 16, 2014, at 9:21 AM, Sarath Chandra 
> <sarathchandra.jos...@algofusiontech.com> wrote:
> 
> Hi All,
> 
> I'm trying to do a simple record matching between 2 files and wrote following 
> code -
> 
> import org.apache.spark.sql.SQLContext;
> import org.apache.spark.rdd.RDD
> object SqlTest {
>   case class Test(fld1:String, fld2:String, fld3:String, fld4:String, 
> fld4:String, fld5:Double, fld6:String);
>   sc.addJar("test1-0.1.jar");
>   val file1 = sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv");
>   val file2 = sc.textFile("hdfs://localhost:54310/user/hduser/file2.csv");
>   val sq = new SQLContext(sc);
>   val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l => Test(l(0), 
> l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));
>   val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s => Test(s(0), 
> s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));
>   val file1_schema = sq.createSchemaRDD(file1_recs);
>   val file2_schema = sq.createSchemaRDD(file2_recs);
>   file1_schema.registerAsTable("file1_tab");
>   file2_schema.registerAsTable("file2_tab");
>   val matched = sq.sql("select * from file1_tab l join file2_tab s on 
> l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and 
> l.fld2=s.fld2");
>   val count = matched.count();
>   System.out.println("Found " + matched.count() + " matching records");
> }
> 
> When I run this program on a standalone spark cluster, it keeps running for 
> long with no output or error. After waiting for few mins I'm forcibly killing 
> it.
> But the same program is working well when executed from a spark shell.
> 
> What is going wrong? What am I missing?
> 
> ~Sarath

Re: Simple record matching using Spark SQL

Reply via email to