Re: Spark SQL taking long time to print records from a table

2015-02-04 Thread Imran Rashid
Many operations in spark are lazy -- most likely your collect() statement
is actually forcing evaluation of severals steps earlier in the pipeline.
The logs  the UI might give you some info about all the stages that are
being run when you get to collect().

I think collect() is just fine if you are trying to pull just one record to
the driver, that shouldn't be a bottleneck.

On Wed, Feb 4, 2015 at 1:32 AM, jguliani jasminkguli...@gmail.com wrote:

 I have 3 text files in hdfs which I am reading using spark sql and
 registering them as table. After that I am doing almost 5-6 operations -
 including joins , group by etc.. And this whole process is taking hardly
 6-7
 secs. ( Source File size - 3 GB with almost 20 million rows ).
 As a final step of my computation, I am expecting only 1 record in my final
 rdd - named as acctNPIScr in below code snippet.

 My question here is that when I am trying to print this rdd either by
 registering as table and printing records from table or by this method -
 acctNPIScr.map(t = Score:  + t(1)).collect().foreach(println). It is
 taking very long time - almost 1.5 minute to print 1 record.

 Can someone pls help me if I am doing something wrong in printing. What is
 the best way to print final result from schemardd.

 .
 val acctNPIScr = sqlContext.sql(SELECT party_id,
 sum(npi_int)/sum(device_priority_new) as  npi_score FROM AcctNPIScoreTemp
 group by party_id )
 acctNPIScr.registerTempTable(AcctNPIScore)

 val endtime = System.currentTimeMillis()
 logger.info(Total sql Time : + (endtime - st))   // this time is
 hardly 5 secs

 println(start printing)

 val result = sqlContext.sql(SELECT * FROM
 AcctNPIScore).collect().foreach(println)

 //acctNPIScr.map(t = Score:  + t(1)).collect().foreach(println)

 logger.info(Total printing Time : + (System.currentTimeMillis() -
 endtime)) // print one record is taking almost 1.5 minute




 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-taking-long-time-to-print-records-from-a-table-tp21493.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Spark SQL taking long time to print records from a table

2015-02-03 Thread jguliani
I have 3 text files in hdfs which I am reading using spark sql and
registering them as table. After that I am doing almost 5-6 operations -
including joins , group by etc.. And this whole process is taking hardly 6-7
secs. ( Source File size - 3 GB with almost 20 million rows ).
As a final step of my computation, I am expecting only 1 record in my final
rdd - named as acctNPIScr in below code snippet.

My question here is that when I am trying to print this rdd either by
registering as table and printing records from table or by this method -
acctNPIScr.map(t = Score:  + t(1)).collect().foreach(println). It is
taking very long time - almost 1.5 minute to print 1 record. 

Can someone pls help me if I am doing something wrong in printing. What is
the best way to print final result from schemardd. 

.
val acctNPIScr = sqlContext.sql(SELECT party_id,
sum(npi_int)/sum(device_priority_new) as  npi_score FROM AcctNPIScoreTemp
group by party_id ) 
acctNPIScr.registerTempTable(AcctNPIScore)

val endtime = System.currentTimeMillis()
logger.info(Total sql Time : + (endtime - st))   // this time is
hardly 5 secs

println(start printing)

val result = sqlContext.sql(SELECT * FROM
AcctNPIScore).collect().foreach(println)

//acctNPIScr.map(t = Score:  + t(1)).collect().foreach(println)

logger.info(Total printing Time : + (System.currentTimeMillis() -
endtime)) // print one record is taking almost 1.5 minute




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-taking-long-time-to-print-records-from-a-table-tp21493.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org