Re: Spark SQL takes unexpected time

2014-11-04 Thread Corey Nolet
Michael,

I should probably look closer myself @ the design of 1.2 vs 1.1 but I've
been curious why Spark's in-memory data uses the heap instead of putting it
off heap? Was this the optimization that was done in 1.2 to alleviate GC?

On Mon, Nov 3, 2014 at 8:52 PM, Shailesh Birari sbir...@wynyardgroup.com
wrote:

 Yes, I am using Spark1.1.0 and have used rdd.registerTempTable().
 I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more
 than
 earlier).

 I also tried by changing schema to use Long data type in some fields but
 seems conversion takes more time.
 Is there any way to specify index ?  Though I checked and didn't found any,
 just want to confirm.

 For your reference here is the snippet of code.


 -
 case class EventDataTbl(EventUID: Long,
 ONum: Long,
 RNum: Long,
 Timestamp: java.sql.Timestamp,
 Duration: String,
 Type: String,
 Source: String,
 OName: String,
 RName: String)

 val format = new java.text.SimpleDateFormat(-MM-dd
 hh:mm:ss)
 val cedFileName =
 hdfs://hadoophost:8020/demo/poc/JoinCsv/output_2
 val cedRdd = sc.textFile(cedFileName).map(_.split(,,
 -1)).map(p =
 EventDataTbl(p(0).toLong, p(1).toLong, p(2).toLong, new
 java.sql.Timestamp(format.parse(p(3)).getTime()), p(4), p(5), p(6), p(7),
 p(8)))

 cedRdd.registerTempTable(EventDataTbl)
 sqlCntxt.cacheTable(EventDataTbl)

 val t1 = System.nanoTime()
 println(\n\n10 Most frequent conversations between the
 Originators and
 Recipients\n)
 sql(SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName
 FROM EventDataTbl
 GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT
 10).collect().foreach(println)
 val t2 = System.nanoTime()
 println(Time taken  + (t2-t1)/10.0 +  Seconds)


 -

 Thanks,
   Shailesh



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925p18017.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: Spark SQL takes unexpected time

2014-11-03 Thread Shailesh Birari
Yes, I am using Spark1.1.0 and have used rdd.registerTempTable().
I tried by adding sqlContext.cacheTable(), but it took 59 seconds (more than
earlier).

I also tried by changing schema to use Long data type in some fields but
seems conversion takes more time. 
Is there any way to specify index ?  Though I checked and didn't found any,
just want to confirm.

For your reference here is the snippet of code.

-
case class EventDataTbl(EventUID: Long, 
ONum: Long,
RNum: Long,
Timestamp: java.sql.Timestamp,
Duration: String,
Type: String,
Source: String,
OName: String,
RName: String)

val format = new java.text.SimpleDateFormat(-MM-dd 
hh:mm:ss)
val cedFileName = 
hdfs://hadoophost:8020/demo/poc/JoinCsv/output_2
val cedRdd = sc.textFile(cedFileName).map(_.split(,, 
-1)).map(p =
EventDataTbl(p(0).toLong, p(1).toLong, p(2).toLong, new
java.sql.Timestamp(format.parse(p(3)).getTime()), p(4), p(5), p(6), p(7),
p(8)))

cedRdd.registerTempTable(EventDataTbl)
sqlCntxt.cacheTable(EventDataTbl)

val t1 = System.nanoTime()
println(\n\n10 Most frequent conversations between the 
Originators and
Recipients\n)
sql(SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName FROM 
EventDataTbl
GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT
10).collect().foreach(println)
val t2 = System.nanoTime()
println(Time taken  + (t2-t1)/10.0 +  Seconds)

-

Thanks,
  Shailesh



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925p18017.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark SQL takes unexpected time

2014-11-02 Thread Shailesh Birari
Hello,

I have written an Spark SQL application which reads data from HDFS  and
query on it.
The data size is around 2GB (30 million records). The schema and query I am
running is as below.
The query takes around 05+ seconds to execute. 
I tried by adding 
   rdd.persist(StorageLevel.MEMORY_AND_DISK)
and
   rdd.cache()
but in both the cases it takes extra time, even if I give the below query as
second the data. (assuming Spark will cache it for first query).

case class EventDataTbl(ID: String, 
ONum: String,
RNum: String,
Timestamp: String,
Duration: String,
Type: String,
Source: String,
OName: String,
RName: String)

sql(SELECT COUNT(*) AS Frequency,ONum,OName,RNum,RName FROM EventDataTbl
GROUP BY ONum,OName,RNum,RName ORDER BY Frequency DESC LIMIT
10).collect().foreach(println)

Can you let me know if I am missing anything ?

Thanks,
  Shailesh




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-takes-unexpected-time-tp17925.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org