Re: Spark Structured Streaming for Twitter Streaming data

Divya Gehlot Wed, 31 Jan 2018 19:39:38 -0800

Got it Thanks for the clarification TD !

On Thu, 1 Feb 2018 at 11:36 AM, Tathagata Das <tathagata.das1...@gmail.com>
wrote:


> The code uses the format "socket" which is only for text sent over a
> simple socket, which is completely different from how Twitter APIs works.
> So this wont work at all.
> Fundamentally, for Structured Streaming, we have focused only on those
> streaming sources that have the capabilities record-level tracking offsets
> (e.g. Kafka offsets) and replayability in order to give strong exactly-once
> fault-tolerance guarantees. Hence we have focused on files, Kafka, Kinesis
> (socket is just for testing as is documented). Twitter APIs as a source
> does not provide those, hence we have not focused on building one. In
> general, for such sources (ones that are not perfectly replayable), there
> are two possible solutions.
>
> 1. Build your own source: A quick google search shows that others in the
> community have attempted to build structured-streaming sources for Twitter.
> It wont provide the same fault-tolerance guarantees as Kafka, etc. However,
> I dont recommend this now because the DataSource APIs to build streaming
> sources are not public yet, and are in flux.
>
> 2. Use Kafka/Kinesis as an intermediate system: Write something simple
> that uses Twitter APIs directly to read tweets and write them into
> Kafka/Kinesis. And then just read from Kafka/Kinesis.
>
> Hope this helps.
>
> TD
>
> On Wed, Jan 31, 2018 at 7:18 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Hi ,
>> I see ,Does that means Spark structured streaming doesn't work with
>> Twitter streams ?
>> I could see people used kafka or other streaming tools and used spark to
>> process the data in structured streaming .
>>
>> The below doesn't work directly with Twitter Stream until I set up Kafka
>> ?
>>
>>> import org.apache.spark.sql.SparkSession
>>> val spark = SparkSession
>>>   .builder()
>>>   .appName("Spark SQL basic example")
>>>   .config("spark.some.config.option", "some-value")
>>>   .getOrCreate()
>>> // For implicit conversions like converting RDDs to DataFrames
>>> import spark.implicits
>>>>
>>>> / Read text from socket
>>>
>>> val socketDF = spark
>>>
>>>   .readStream
>>>
>>>   .format("socket")
>>>
>>>   .option("host", "localhost")
>>>
>>>   .option("port", 9999)
>>>
>>>   .load()
>>>
>>>
>>>> socketDF.isStreaming    // Returns True for DataFrames that have
>>>> streaming sources
>>>
>>>
>>>> socketDF.printSchema
>>>
>>>
>>>
>>
>>
>> Thanks,
>> Divya
>>
>> On 1 February 2018 at 10:30, Tathagata Das <tathagata.das1...@gmail.com>
>> wrote:
>>
>>> Hello Divya,
>>>
>>> To add further clarification, the Apache Bahir does not have any
>>> Structured Streaming support for Twitter. It only has support for Twitter +
>>> DStreams.
>>>
>>> TD
>>>
>>>
>>>
>>> On Wed, Jan 31, 2018 at 2:44 AM, vermanurag <
>>> anurag.ve...@fnmathlogic.com> wrote:
>>>
>>>> Twitter functionality is not part of Core Spark. We have successfully
>>>> used
>>>> the following packages from maven central in past
>>>>
>>>> org.apache.bahir:spark-streaming-twitter_2.11:2.2.0
>>>>
>>>> Earlier there used to be a twitter package under spark, but I find that
>>>> it
>>>> has not been updated beyond Spark 1.6
>>>> org.apache.spark:spark-streaming-twitter_2.11:1.6.0
>>>>
>>>> Anurag
>>>> www.fnmathlogic.com
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>>>>
>>>>
>>>
>>
>

Re: Spark Structured Streaming for Twitter Streaming data

Reply via email to