Re: Help required on exercise Data Exploratin using Spark SQL
Hi, When I run given Spark SQL commands in the exercise, it returns with unexpected results. I'm explaining the results below for quick reference: 1. The output of query : wikiData.count() shows some count in the file. 2. after running following query: sqlContext.sql(SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username '' GROUP BY username ORDER BY cnt DESC LIMIT 10).collect().foreach(println) I get output like below. Couple of last lines of this output is shown here. It doesn't show the actual results of query. I tried increasing the driver memory as suggested in the exercise, however, id doesn't work. The output is almost same. 14/10/17 15:29:39 INFO executor.Executor: Finished task 199.0 in stage 2.0 (TID 401). 2170 bytes result sent to driver 14/10/17 15:29:39 INFO executor.Executor: Finished task 198.0 in stage 2.0 (TID 400). 2170 bytes result sent to driver 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 198.0 in stage 2.0 (TID 400) in 13 ms on localhost (199/200) 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 199.0 in stage 2.0 (TID 401) in 10 ms on localhost (200/200) 14/10/17 15:29:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 14/10/17 15:29:39 INFO scheduler.DAGScheduler: Stage 2 (takeOrdered at basicOperators.scala:171) finished in 1.296 s 14/10/17 15:29:39 INFO spark.SparkContext: Job finished: takeOrdered at basicOperators.scala:171, took 3.150021634 s 3. I tried some other Spark SQL commands as below: *sqlContext.sql(SELECT username FROM wikiData LIMIT 10).collect().foreach(println)* *output is* : [[B@787cf559] [[B@53cfe3db] [[B@757869d9] [[B@346d61cf] [[B@793077ec] [[B@5d11651c] [[B@21054100] [[B@5fee77ef] [[B@21041d1d] [[B@15136bda] *sqlContext.sql(SELECT * FROM wikiData LIMIT 10).collect().foreach(println)* *output is *: [12140913,[B@1d74e696,1394582048,[B@65ce90f5,[B@5c8ef90a] [12154508,[B@2e802eff,1393177457,[B@618d7f32,[B@1099dda7] [12165267,[B@65a70774,1398418319,[B@38da84cf,[B@12454f32] [12184073,[B@45264fd,1395243737,[B@3d642042,[B@7881ec8a] [12194348,[B@19d095d5,1372914018,[B@4d1ce030,[B@22c296dd] [12212394,[B@153e98ff,1389794332,[B@40ae983e,[B@68d2f9f] [12224899,[B@1f317315,1396830262,[B@677a77b2,[B@19487c31] [12240745,[B@65d181ee,1389890826,[B@1da9647b,[B@5c03d673] [12258034,[B@7ff44736,1385050943,[B@7e6f6bda,[B@4511f60f] [12279301,[B@1e317636,1382277991,[B@4147e2b6,[B@56753c35] I'm sure the about output of the queries is not the correct content of parquet file.. I'm not able to read the content of parquet file directly. How to validate the output of these queries with the actual content in the parquet file. What is the workaround for this issue. How to read the file through Spark SQL. Is there a need to change the queries? What changes can be made in the queries to get the exact result. Regards, Neeraj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569p16673.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Help required on exercise Data Exploratin using Spark SQL
Looks like this data was encoded with an old version of Spark SQL. You'll need to set the flag to interpret binary data as a string. More info on configuration can be found here: http://spark.apache.org/docs/latest/sql-programming-guide.html#configuration sqlContext.sql(set spark.sql.parquet.binaryAsString=true) Michael On Fri, Oct 17, 2014 at 6:32 AM, neeraj neeraj_gar...@infosys.com wrote: Hi, When I run given Spark SQL commands in the exercise, it returns with unexpected results. I'm explaining the results below for quick reference: 1. The output of query : wikiData.count() shows some count in the file. 2. after running following query: sqlContext.sql(SELECT username, COUNT(*) AS cnt FROM wikiData WHERE username '' GROUP BY username ORDER BY cnt DESC LIMIT 10).collect().foreach(println) I get output like below. Couple of last lines of this output is shown here. It doesn't show the actual results of query. I tried increasing the driver memory as suggested in the exercise, however, id doesn't work. The output is almost same. 14/10/17 15:29:39 INFO executor.Executor: Finished task 199.0 in stage 2.0 (TID 401). 2170 bytes result sent to driver 14/10/17 15:29:39 INFO executor.Executor: Finished task 198.0 in stage 2.0 (TID 400). 2170 bytes result sent to driver 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 198.0 in stage 2.0 (TID 400) in 13 ms on localhost (199/200) 14/10/17 15:29:39 INFO scheduler.TaskSetManager: Finished task 199.0 in stage 2.0 (TID 401) in 10 ms on localhost (200/200) 14/10/17 15:29:39 INFO scheduler.TaskSchedulerImpl: Removed TaskSet 2.0, whose tasks have all completed, from pool 14/10/17 15:29:39 INFO scheduler.DAGScheduler: Stage 2 (takeOrdered at basicOperators.scala:171) finished in 1.296 s 14/10/17 15:29:39 INFO spark.SparkContext: Job finished: takeOrdered at basicOperators.scala:171, took 3.150021634 s 3. I tried some other Spark SQL commands as below: *sqlContext.sql(SELECT username FROM wikiData LIMIT 10).collect().foreach(println)* *output is* : [[B@787cf559] [[B@53cfe3db] [[B@757869d9] [[B@346d61cf] [[B@793077ec] [[B@5d11651c] [[B@21054100] [[B@5fee77ef] [[B@21041d1d] [[B@15136bda] *sqlContext.sql(SELECT * FROM wikiData LIMIT 10).collect().foreach(println)* *output is *: [12140913,[B@1d74e696,1394582048,[B@65ce90f5,[B@5c8ef90a] [12154508,[B@2e802eff,1393177457,[B@618d7f32,[B@1099dda7] [12165267,[B@65a70774,1398418319,[B@38da84cf,[B@12454f32] [12184073,[B@45264fd,1395243737,[B@3d642042,[B@7881ec8a] [12194348,[B@19d095d5,1372914018,[B@4d1ce030,[B@22c296dd] [12212394,[B@153e98ff,1389794332,[B@40ae983e,[B@68d2f9f] [12224899,[B@1f317315,1396830262,[B@677a77b2,[B@19487c31] [12240745,[B@65d181ee,1389890826,[B@1da9647b,[B@5c03d673] [12258034,[B@7ff44736,1385050943,[B@7e6f6bda,[B@4511f60f] [12279301,[B@1e317636,1382277991,[B@4147e2b6,[B@56753c35] I'm sure the about output of the queries is not the correct content of parquet file.. I'm not able to read the content of parquet file directly. How to validate the output of these queries with the actual content in the parquet file. What is the workaround for this issue. How to read the file through Spark SQL. Is there a need to change the queries? What changes can be made in the queries to get the exact result. Regards, Neeraj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569p16673.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Help required on exercise Data Exploratin using Spark SQL
Hi, I'm exploring an exercise Data Exploratin using Spark SQL from Spark Summit 2014. While running command val wikiData = sqlContext.parquetFile(data/wiki_parquet).. I'm getting the following output which doesn't match with the expected output. *Output i'm getting*: val wikiData1 = sqlContext.parquetFile(/data/wiki_parquet/part-r-1.parquet) 14/10/16 19:26:49 INFO parquet.ParquetTypesConverter: Falling back to schema conversion from Parquet types; result: ArrayBuffer(id#5, title#6, modified#7L, text#8, username#9) wikiData1: org.apache.spark.sql.SchemaRDD = SchemaRDD[1] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == ParquetTableScan [id#5,title#6,modified#7L,text#8,username#9], (ParquetRelation /data/wiki_parquet/part-r-1.parquet, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.SQLContext@27a5dac0, []), [] *Expected Output*: wikiData: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan == ParquetTableScan [id#0,title#1,modified#2L,text#3,username#4], (ParquetRelation data/wiki_parquet), [] Please help with the possible issue. I'm using pre-built package of Spark with Hadoop 2.4 Please let me know in case of more information is required. Regards, Neeraj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Help required on exercise Data Exploratin using Spark SQL
Hi Neeraj, The Spark Summit 2014 tutorial uses Spark 1.0. I guess you're using Spark 1.1? Parquet support got polished quite a bit since then, and changed the string representation of the query plan, but this output should be OK :) Cheng On 10/16/14 10:45 PM, neeraj wrote: Hi, I'm exploring an exercise Data Exploratin using Spark SQL from Spark Summit 2014. While running command val wikiData = sqlContext.parquetFile(data/wiki_parquet).. I'm getting the following output which doesn't match with the expected output. *Output i'm getting*: val wikiData1 = sqlContext.parquetFile(/data/wiki_parquet/part-r-1.parquet) 14/10/16 19:26:49 INFO parquet.ParquetTypesConverter: Falling back to schema conversion from Parquet types; result: ArrayBuffer(id#5, title#6, modified#7L, text#8, username#9) wikiData1: org.apache.spark.sql.SchemaRDD = SchemaRDD[1] at RDD at SchemaRDD.scala:103 == Query Plan == == Physical Plan == ParquetTableScan [id#5,title#6,modified#7L,text#8,username#9], (ParquetRelation /data/wiki_parquet/part-r-1.parquet, Some(Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml), org.apache.spark.sql.SQLContext@27a5dac0, []), [] *Expected Output*: wikiData: org.apache.spark.sql.SchemaRDD = SchemaRDD[0] at RDD at SchemaRDD.scala:98 == Query Plan == ParquetTableScan [id#0,title#1,modified#2L,text#3,username#4], (ParquetRelation data/wiki_parquet), [] Please help with the possible issue. I'm using pre-built package of Spark with Hadoop 2.4 Please let me know in case of more information is required. Regards, Neeraj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Help-required-on-exercise-Data-Exploratin-using-Spark-SQL-tp16569.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org