Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-17 Thread neeraj
Hi,

When I run given Spark SQL commands in the exercise, it returns with
unexpected results. I'm explaining the results below for quick reference:
1. The output of query : wikiData.count() shows some count in the file.

2. after running following query: 
sqlContext.sql(SELECT username, COUNT(*) AS cnt FROM wikiData WHERE
username  '' GROUP BY username ORDER BY cnt DESC LIMIT
10).collect().foreach(println)

I get output like below. Couple of last lines of this output is shown here.
It doesn't show the actual results of query. I tried increasing the driver
memory as suggested in the exercise, however, id doesn't work. The output is
almost same.
14/10/17 15:29:39 INFO executor.Executor: Finished task 199.0 in stage 2.0
(TID 401). 2170 bytes result sent to driver
14/10/17 15:29:39 INFO executor.Executor: Finished task 198.0 in stage 2.0
(TID 400). 2170 bytes result sent to driver
14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 198.0 in
stage 2.0 (TID 400) in 13 ms on localhost (199/200)
14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 199.0 in
stage 2.0 (TID 401) in 10 ms on localhost (200/200)
14/10/17 15:29:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0,
whose tasks have all completed, from pool
14/10/17 15:29:39 INFO scheduler.DAGScheduler: Stage 2 (takeOrdered at
basicOperators.scala:171) finished in 1.296 s
14/10/17 15:29:39 INFO spark.SparkContext: Job finished: takeOrdered at
basicOperators.scala:171, took 3.150021634 s

3. I tried some other Spark SQL commands as below:
*sqlContext.sql(SELECT username FROM wikiData LIMIT
10).collect().foreach(println)*
*output is* : [[B@787cf559]
[[B@53cfe3db]
[[B@757869d9]
[[B@346d61cf]
[[B@793077ec]
[[B@5d11651c]
[[B@21054100]
[[B@5fee77ef]
[[B@21041d1d]
[[B@15136bda]


*sqlContext.sql(SELECT * FROM wikiData LIMIT
10).collect().foreach(println)*
*output is *:
[12140913,[B@1d74e696,1394582048,[B@65ce90f5,[B@5c8ef90a]
[12154508,[B@2e802eff,1393177457,[B@618d7f32,[B@1099dda7]
[12165267,[B@65a70774,1398418319,[B@38da84cf,[B@12454f32]
[12184073,[B@45264fd,1395243737,[B@3d642042,[B@7881ec8a]
[12194348,[B@19d095d5,1372914018,[B@4d1ce030,[B@22c296dd]
[12212394,[B@153e98ff,1389794332,[B@40ae983e,[B@68d2f9f]
[12224899,[B@1f317315,1396830262,[B@677a77b2,[B@19487c31]
[12240745,[B@65d181ee,1389890826,[B@1da9647b,[B@5c03d673]
[12258034,[B@7ff44736,1385050943,[B@7e6f6bda,[B@4511f60f]
[12279301,[B@1e317636,1382277991,[B@4147e2b6,[B@56753c35]

I'm sure the about output of the queries is not the correct content of
parquet file.. I'm not able to read the content of parquet file directly. 

How to validate the output of these queries with the actual content in the
parquet file.
What is the workaround for this issue. 
How to read the file through Spark SQL. 
Is there a need to change the queries? What changes can be made in the
queries to get the exact result.

Regards,
Neeraj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569p16673.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-17 Thread Michael Armbrust
Looks like this data was encoded with an old version of Spark SQL.  You'll
need to set the flag to interpret binary data as a string.  More info on
configuration can be found here:
http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration

sqlContext.sql(set spark.sql.parquet.binaryAsString=true)

Michael

On Fri, Oct 17, 2014 at 6:32 AM, neeraj neeraj_gar...@infosys.com wrote:

 Hi,

 When I run given Spark SQL commands in the exercise, it returns with
 unexpected results. I'm explaining the results below for quick reference:
 1. The output of query : wikiData.count() shows some count in the file.

 2. after running following query:
 sqlContext.sql(SELECT username, COUNT(*) AS cnt FROM wikiData WHERE
 username  '' GROUP BY username ORDER BY cnt DESC LIMIT
 10).collect().foreach(println)

 I get output like below. Couple of last lines of this output is shown here.
 It doesn't show the actual results of query. I tried increasing the driver
 memory as suggested in the exercise, however, id doesn't work. The output
 is
 almost same.
 14/10/17 15:29:39 INFO executor.Executor: Finished task 199.0 in stage 2.0
 (TID 401). 2170 bytes result sent to driver
 14/10/17 15:29:39 INFO executor.Executor: Finished task 198.0 in stage 2.0
 (TID 400). 2170 bytes result sent to driver
 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 198.0 in
 stage 2.0 (TID 400) in 13 ms on localhost (199/200)
 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 199.0 in
 stage 2.0 (TID 401) in 10 ms on localhost (200/200)
 14/10/17 15:29:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0,
 whose tasks have all completed, from pool
 14/10/17 15:29:39 INFO scheduler.DAGScheduler: Stage 2 (takeOrdered at
 basicOperators.scala:171) finished in 1.296 s
 14/10/17 15:29:39 INFO spark.SparkContext: Job finished: takeOrdered at
 basicOperators.scala:171, took 3.150021634 s

 3. I tried some other Spark SQL commands as below:
 *sqlContext.sql(SELECT username FROM wikiData LIMIT
 10).collect().foreach(println)*
 *output is* : [[B@787cf559]
 [[B@53cfe3db]
 [[B@757869d9]
 [[B@346d61cf]
 [[B@793077ec]
 [[B@5d11651c]
 [[B@21054100]
 [[B@5fee77ef]
 [[B@21041d1d]
 [[B@15136bda]


 *sqlContext.sql(SELECT * FROM wikiData LIMIT
 10).collect().foreach(println)*
 *output is *:
 [12140913,[B@1d74e696,1394582048,[B@65ce90f5,[B@5c8ef90a]
 [12154508,[B@2e802eff,1393177457,[B@618d7f32,[B@1099dda7]
 [12165267,[B@65a70774,1398418319,[B@38da84cf,[B@12454f32]
 [12184073,[B@45264fd,1395243737,[B@3d642042,[B@7881ec8a]
 [12194348,[B@19d095d5,1372914018,[B@4d1ce030,[B@22c296dd]
 [12212394,[B@153e98ff,1389794332,[B@40ae983e,[B@68d2f9f]
 [12224899,[B@1f317315,1396830262,[B@677a77b2,[B@19487c31]
 [12240745,[B@65d181ee,1389890826,[B@1da9647b,[B@5c03d673]
 [12258034,[B@7ff44736,1385050943,[B@7e6f6bda,[B@4511f60f]
 [12279301,[B@1e317636,1382277991,[B@4147e2b6,[B@56753c35]

 I'm sure the about output of the queries is not the correct content of
 parquet file.. I'm not able to read the content of parquet file directly.

 How to validate the output of these queries with the actual content in the
 parquet file.
 What is the workaround for this issue.
 How to read the file through Spark SQL.
 Is there a need to change the queries? What changes can be made in the
 queries to get the exact result.

 Regards,
 Neeraj



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569p16673.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Help required on exercise Data Exploratin using Spark SQL

2014-10-16 Thread neeraj
Hi,

I'm exploring an exercise Data Exploratin using Spark SQL from Spark Summit
2014. While running command val wikiData =
sqlContext.parquetFile(data/wiki_parquet).. I'm getting the following
output which doesn't match with the expected output.

*Output i'm getting*:
 val wikiData1 =
sqlContext.parquetFile(/data/wiki_parquet/part-r-1.parquet)
14/10/16 19:26:49 INFO parquet.ParquetTypesConverter: Falling back to schema
conversion from Parquet types; result: ArrayBuffer(id#5, title#6,
modified#7L, text#8, username#9)
wikiData1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[1] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
ParquetTableScan [id#5,title#6,modified#7L,text#8,username#9],
(ParquetRelation /data/wiki_parquet/part-r-1.parquet, Some(Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
org.apache.spark.sql.SQLContext@27a5dac0, []), []

*Expected Output*:
wikiData: org.apache.spark.sql.SchemaRDD = 
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ParquetTableScan [id#0,title#1,modified#2L,text#3,username#4],
(ParquetRelation data/wiki_parquet), []

Please help with the possible issue.

I'm using pre-built package of Spark with Hadoop 2.4

Please let me know in case of more information is required.

Regards,
Neeraj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Help required on exercise Data Exploratin using Spark SQL

2014-10-16 Thread Cheng Lian

Hi Neeraj,

The Spark Summit 2014 tutorial uses Spark 1.0. I guess you're using 
Spark 1.1? Parquet support got polished quite a bit since then, and 
changed the string representation of the query plan, but this output 
should be OK :)


Cheng

On 10/16/14 10:45 PM, neeraj wrote:

Hi,

I'm exploring an exercise Data Exploratin using Spark SQL from Spark Summit
2014. While running command val wikiData =
sqlContext.parquetFile(data/wiki_parquet).. I'm getting the following
output which doesn't match with the expected output.

*Output i'm getting*:
  val wikiData1 =
sqlContext.parquetFile(/data/wiki_parquet/part-r-1.parquet)
14/10/16 19:26:49 INFO parquet.ParquetTypesConverter: Falling back to schema
conversion from Parquet types; result: ArrayBuffer(id#5, title#6,
modified#7L, text#8, username#9)
wikiData1: org.apache.spark.sql.SchemaRDD =
SchemaRDD[1] at RDD at SchemaRDD.scala:103
== Query Plan ==
== Physical Plan ==
ParquetTableScan [id#5,title#6,modified#7L,text#8,username#9],
(ParquetRelation /data/wiki_parquet/part-r-1.parquet, Some(Configuration:
core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml,
yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml),
org.apache.spark.sql.SQLContext@27a5dac0, []), []

*Expected Output*:
wikiData: org.apache.spark.sql.SchemaRDD =
SchemaRDD[0] at RDD at SchemaRDD.scala:98
== Query Plan ==
ParquetTableScan [id#0,title#1,modified#2L,text#3,username#4],
(ParquetRelation data/wiki_parquet), []

Please help with the possible issue.

I'm using pre-built package of Spark with Hadoop 2.4

Please let me know in case of more information is required.

Regards,
Neeraj



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org




-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org