What if you just run something like: *sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv").count()*
On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra < sarathchandra.jos...@algofusiontech.com> wrote: > Yes Soumya, I did it. > > First I tried with the example available in the documentation (example > using people table and finding teenagers). After successfully running it, I > moved on to this one which is starting point to a bigger requirement for > which I'm evaluating Spark SQL. > > > On Wed, Jul 16, 2014 at 7:59 PM, Soumya Simanta <soumya.sima...@gmail.com> > wrote: > >> >> >> Can you try submitting a very simple job to the cluster. >> >> On Jul 16, 2014, at 10:25 AM, Sarath Chandra < >> sarathchandra.jos...@algofusiontech.com> wrote: >> >> Yes it is appearing on the Spark UI, and remains there with state as >> "RUNNING" till I press Ctrl+C in the terminal to kill the execution. >> >> Barring the statements to create the spark context, if I copy paste the >> lines of my code in spark shell, runs perfectly giving the desired output. >> >> ~Sarath >> >> On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta <soumya.sima...@gmail.com >> > wrote: >> >>> When you submit your job, it should appear on the Spark UI. Same with >>> the REPL. Make sure you job is submitted to the cluster properly. >>> >>> >>> On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra < >>> sarathchandra.jos...@algofusiontech.com> wrote: >>> >>>> Hi Soumya, >>>> >>>> Data is very small, 500+ lines in each file. >>>> >>>> Removed last 2 lines and placed this at the end >>>> "matched.collect().foreach(println);". Still no luck. It's been more than >>>> 5min, the execution is still running. >>>> >>>> Checked logs, nothing in stdout. In stderr I don't see anything going >>>> wrong, all are info messages. >>>> >>>> What else do I need check? >>>> >>>> ~Sarath >>>> >>>> On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta < >>>> soumya.sima...@gmail.com> wrote: >>>> >>>>> Check your executor logs for the output or if your data is not big >>>>> collect it in the driver and print it. >>>>> >>>>> >>>>> >>>>> On Jul 16, 2014, at 9:21 AM, Sarath Chandra < >>>>> sarathchandra.jos...@algofusiontech.com> wrote: >>>>> >>>>> Hi All, >>>>> >>>>> I'm trying to do a simple record matching between 2 files and wrote >>>>> following code - >>>>> >>>>> *import org.apache.spark.sql.SQLContext;* >>>>> *import org.apache.spark.rdd.RDD* >>>>> *object SqlTest {* >>>>> * case class Test(fld1:String, fld2:String, fld3:String, fld4:String, >>>>> fld4:String, fld5:Double, fld6:String);* >>>>> * sc.addJar("test1-0.1.jar");* >>>>> * val file1 = >>>>> sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv");* >>>>> * val file2 = >>>>> sc.textFile("hdfs://localhost:54310/user/hduser/file2.csv");* >>>>> * val sq = new SQLContext(sc);* >>>>> * val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l => >>>>> Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));* >>>>> * val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s => >>>>> Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));* >>>>> * val file1_schema = sq.createSchemaRDD(file1_recs);* >>>>> * val file2_schema = sq.createSchemaRDD(file2_recs);* >>>>> * file1_schema.registerAsTable("file1_tab");* >>>>> * file2_schema.registerAsTable("file2_tab");* >>>>> * val matched = sq.sql("select * from file1_tab l join file2_tab s on >>>>> l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and >>>>> l.fld2=s.fld2");* >>>>> * val count = matched.count();* >>>>> * System.out.println("Found " + matched.count() + " matching >>>>> records");* >>>>> *}* >>>>> >>>>> When I run this program on a standalone spark cluster, it keeps >>>>> running for long with no output or error. After waiting for few mins I'm >>>>> forcibly killing it. >>>>> But the same program is working well when executed from a spark shell. >>>>> >>>>> What is going wrong? What am I missing? >>>>> >>>>> ~Sarath >>>>> >>>>> >>>> >>> >> >