Andrew Davidson created SPARK-13009:
---------------------------------------

             Summary: spark-streaming-twitter_2.10 does not make it possible to 
access the raw twitter json
                 Key: SPARK-13009
                 URL: https://issues.apache.org/jira/browse/SPARK-13009
             Project: Spark
          Issue Type: Improvement
          Components: Streaming
    Affects Versions: 1.6.0
            Reporter: Andrew Davidson
            Priority: Blocker


The Streaming-twitter package makes it easy for Java programmers to work with 
twitter. The implementation returns the raw twitter data in JSON formate as a 
twitter4J StatusJSONImpl object

JavaDStream<Status> tweets = TwitterUtils.createStream(ssc, twitterAuth);

The status class is different then the raw JSON. I.E. serializing the status 
object will be the same as the original json. I have down stream systems that 
can only process raw tweets not twitter4J Status objects. 

Here is my bug/RFE request made to Twitter4J <twitte...@googlegroups.com>. They 
asked  I create a spark tracking issue.


On Thursday, January 21, 2016 at 6:27:25 PM UTC, Andy Davidson wrote:
Hi All

Quick problem summary:

My system uses the Status objects to do some analysis how ever I need to store 
the raw JSON. There are other systems that process that data that are not 
written in Java.
Currently we are serializing the Status Object. The JSON is going to break down 
stream systems.
I am using the Apache Spark Streaming spark-streaming-twitter_2.10  
http://spark.apache.org/docs/latest/streaming-programming-guide.html#advanced-sources

Request For Enhancement:
I imagine easy access to the raw JSON is a common requirement. Would it be 
possible to add a member function to StatusJSONImpl getRawJson(). By default 
the returned value would be null unless jsonStoreEnabled=True  is set in the 
config.


Alternative implementations:
 

It should be possible to modify the spark-streaming-twitter_2.10 to provide 
this support. The solutions is not very clean

It would required apache spark to define their own Status Pojo. The current 
StatusJSONImpl class is marked final
The Wrapper is not going to work nicely with existing code.
spark-streaming-twitter_2.10  does not expose all of the twitter streaming API 
so many developers are writing their implementations of 
org.apache.park.streaming.twitter.TwitterInputDStream. This make maintenance 
difficult. Its not easy to know when the spark implementation for twitter has 
changed. 
Code listing for 
spark-1.6.0/external/twitter/src/main/scala/org/apache/spark/streaming/twitter/TwitterInputDStream.scala

private[streaming]
class TwitterReceiver(
    twitterAuth: Authorization,
    filters: Seq[String],
    storageLevel: StorageLevel
  ) extends Receiver[Status](storageLevel) with Logging {

  @volatile private var twitterStream: TwitterStream = _
  @volatile private var stopped = false

  def onStart() {
    try {
      val newTwitterStream = new TwitterStreamFactory().getInstance(twitterAuth)
      newTwitterStream.addListener(new StatusListener {
        def onStatus(status: Status): Unit = {
          store(status)
        }
Ref: https://forum.processing.org/one/topic/saving-json-data-from-twitter4j.html

What do people think?

Kind regards

Andy

From: <twit...@googlegroups.com> on behalf of Igor Brigadir 
<igor.b...@ucdconnect.ie>
Reply-To: <twit...@googlegroups.com>
Date: Tuesday, January 19, 2016 at 5:55 AM
To: Twitter4J <twit...@googlegroups.com>
Subject: Re: [Twitter4J] trouble writing unit test

Main issue is that the Json object is in the wrong json format.

eg: "createdAt": 1449775664000 should be "created_at": "Thu Dec 10 19:27:44 
+0000 2015", ...

It looks like the json you have was serialized from a java Status object, which 
makes json objects different to what you get from the API, TwitterObjectFactory 
expects json from Twitter (I haven't had any problems using 
TwitterObjectFactory instead of the Deprecated DataObjectFactory).

You could "fix" it by matching the keys & values you have with the correct, 
twitter API json - it should look like the example here: 
https://dev.twitter.com/rest/reference/get/statuses/show/%3Aid

But it might be easier to download the tweets again, but this time use 
TwitterObjectFactory.getRawJSON(status) to get the Original Json from the 
Twitter API, and save that for later. (You must have jsonStoreEnabled=True in 
your config, and call getRawJSON in the same thread as .showStatus() or 
lookup() or whatever you're using to load tweets.)







--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to