RE: [SparkR] creating dataframe from json file

Sun, Rui Wed, 15 Jul 2015 05:44:35 -0700

suppose df <- jsonFile(sqlContext, "<json file>")

You can extract hashtags.text as a Column object using the following command:
    t <- getField(df$hashtags, "text")
and then you can perform operations on the column.


You can extract hashtags.text as a DataFrame using the following command:
   t <- select(df, getField(df$hashtags, "text"))
   showDF(t)

Or you can use SQL query to extract the field:
  hiveContext <- sparkRHive.init()
  df <-jsonFile(hiveContext,"<json file>")
  registerTempTable(df, "table")
  t <- sql(hiveContext, "select hashtags.text from table")
  showDF(t)
________________________________________
From: jianshu [jian...@gmail.com]
Sent: Wednesday, July 15, 2015 4:42 PM
To: user@spark.apache.org
Subject: [SparkR] creating dataframe from json file

hi all,

Not sure whether this the right venue to ask. If not, please point me to the
right group, if there is any.

I'm trying to create a Spark DataFrame from JSON file using jsonFile(). The
call was successful, and I can see the DataFrame created. The JSON file I
have contains a number of tweets obtained from Twitter API. Am particularly
interested in pulling the hashtags contains in the tweets. If I use
printSchema(), the schema is something like:

root
 |-- id_str: string (nullable = true)
 |-- hashtags: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- indices: array (nullable = true)
 |    |    |    |-- element: long (containsNull = true)
 |    |    |-- text: string (nullable = true)

showDF() would show something like this :

+--------------------+
|            hashtags|
+--------------------+
|              List()|
|List([List(125, 1...|
|              List()|
|List([List(0, 3),...|
|List([List(76, 86...|
|              List()|
|List([List(74, 84...|
|              List()|
|              List()|
|              List()|
|List([List(85, 96...|
|List([List(125, 1...|
|              List()|
|              List()|
|              List()|
|              List()|
|List([List(14, 17...|
|              List()|
|              List()|
|List([List(14, 17...|
+--------------------+

The question is now how to extract the text of the hashtags for each tweet?
Still new to SparkR. Am thinking maybe I need to loop through the dataframe
to extract for each tweet. But it seems that lapply does not really apply on
Spark DataFrame as more. Any though on how to extract the text, as it will
be inside a JSON array.


Thanks,


-JS




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-creating-dataframe-from-json-file-tp23849.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

RE: [SparkR] creating dataframe from json file

Reply via email to