Re: Simple record matching using Spark SQL

Michael Armbrust Wed, 16 Jul 2014 07:45:39 -0700

What if you just run something like:
*sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv").count()*



On Wed, Jul 16, 2014 at 10:37 AM, Sarath Chandra <
sarathchandra.jos...@algofusiontech.com> wrote:

> Yes Soumya, I did it.
>
> First I tried with the example available in the documentation (example
> using people table and finding teenagers). After successfully running it, I
> moved on to this one which is starting point to a bigger requirement for
> which I'm evaluating Spark SQL.
>
>
> On Wed, Jul 16, 2014 at 7:59 PM, Soumya Simanta <soumya.sima...@gmail.com>
> wrote:
>
>>
>>
>> Can you try submitting a very simple job to the cluster.
>>
>> On Jul 16, 2014, at 10:25 AM, Sarath Chandra <
>> sarathchandra.jos...@algofusiontech.com> wrote:
>>
>> Yes it is appearing on the Spark UI, and remains there with state as
>> "RUNNING" till I press Ctrl+C in the terminal to kill the execution.
>>
>> Barring the statements to create the spark context, if I copy paste the
>> lines of my code in spark shell, runs perfectly giving the desired output.
>>
>> ~Sarath
>>
>> On Wed, Jul 16, 2014 at 7:48 PM, Soumya Simanta <soumya.sima...@gmail.com
>> > wrote:
>>
>>> When you submit your job, it should appear on the Spark UI. Same with
>>> the REPL. Make sure you job is submitted to the cluster properly.
>>>
>>>
>>> On Wed, Jul 16, 2014 at 10:08 AM, Sarath Chandra <
>>> sarathchandra.jos...@algofusiontech.com> wrote:
>>>
>>>> Hi Soumya,
>>>>
>>>> Data is very small, 500+ lines in each file.
>>>>
>>>> Removed last 2 lines and placed this at the end
>>>> "matched.collect().foreach(println);". Still no luck. It's been more than
>>>> 5min, the execution is still running.
>>>>
>>>> Checked logs, nothing in stdout. In stderr I don't see anything going
>>>> wrong, all are info messages.
>>>>
>>>> What else do I need check?
>>>>
>>>> ~Sarath
>>>>
>>>> On Wed, Jul 16, 2014 at 7:23 PM, Soumya Simanta <
>>>> soumya.sima...@gmail.com> wrote:
>>>>
>>>>> Check your executor logs for the output or if your data is not big
>>>>> collect it in the driver and print it.
>>>>>
>>>>>
>>>>>
>>>>> On Jul 16, 2014, at 9:21 AM, Sarath Chandra <
>>>>> sarathchandra.jos...@algofusiontech.com> wrote:
>>>>>
>>>>> Hi All,
>>>>>
>>>>> I'm trying to do a simple record matching between 2 files and wrote
>>>>> following code -
>>>>>
>>>>> *import org.apache.spark.sql.SQLContext;*
>>>>> *import org.apache.spark.rdd.RDD*
>>>>> *object SqlTest {*
>>>>> *  case class Test(fld1:String, fld2:String, fld3:String, fld4:String,
>>>>> fld4:String, fld5:Double, fld6:String);*
>>>>> *  sc.addJar("test1-0.1.jar");*
>>>>> *  val file1 =
>>>>> sc.textFile("hdfs://localhost:54310/user/hduser/file1.csv");*
>>>>> *  val file2 =
>>>>> sc.textFile("hdfs://localhost:54310/user/hduser/file2.csv");*
>>>>> *  val sq = new SQLContext(sc);*
>>>>> *  val file1_recs: RDD[Test] = file1.map(_.split(",")).map(l =>
>>>>> Test(l(0), l(1), l(2), l(3), l(4), l(5).toDouble, l(6)));*
>>>>> *  val file2_recs: RDD[Test] = file2.map(_.split(",")).map(s =>
>>>>> Test(s(0), s(1), s(2), s(3), s(4), s(5).toDouble, s(6)));*
>>>>> *  val file1_schema = sq.createSchemaRDD(file1_recs);*
>>>>> *  val file2_schema = sq.createSchemaRDD(file2_recs);*
>>>>> *  file1_schema.registerAsTable("file1_tab");*
>>>>> *  file2_schema.registerAsTable("file2_tab");*
>>>>> *  val matched = sq.sql("select * from file1_tab l join file2_tab s on
>>>>> l.fld6=s.fld6 where l.fld3=s.fld3 and l.fld4=s.fld4 and l.fld5=s.fld5 and
>>>>> l.fld2=s.fld2");*
>>>>> *  val count = matched.count();*
>>>>> *  System.out.println("Found " + matched.count() + " matching
>>>>> records");*
>>>>> *}*
>>>>>
>>>>> When I run this program on a standalone spark cluster, it keeps
>>>>> running for long with no output or error. After waiting for few mins I'm
>>>>> forcibly killing it.
>>>>> But the same program is working well when executed from a spark shell.
>>>>>
>>>>> What is going wrong? What am I missing?
>>>>>
>>>>> ~Sarath
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Simple record matching using Spark SQL

Reply via email to