Hi All,

I've read the new information about Structured Streaming in Spark, looks
super great.

Resources that I've looked at
- https://spark.apache.org/docs/latest/streaming-programming-guide.html
- https://databricks.com/blog/2016/07/28/structured-streamin
- https://spark.apache.org/docs/latest/streaming-custom-receivers.html
- http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.

+ YouTube videos from Spark Summit 2016/2017

So finally getting to my question:

I have Python code that yields a Python generator... this is a great
streaming approach within Python. I've used it for network packet
processing and a bunch of other stuff. I'd love to simply hook up this
generator (that yields python dictionaries) along with a schema definition
to create an  'unbounded DataFrame' as discussed in https://databricks.com/

Possible approaches:
- Make a custom receiver in Python: https://spark.apache.o
- Use Kafka (this is definitely possible and good but overkill for my use
- Send data out a socket and use socketTextStream to pull back in (seems a
bit silly to me)
- Other???

Since Python Generators so naturally fit into streaming pipelines I'd think
that this would be straightforward to 'couple' a python generator into a
Spark structured streaming pipeline..

I've put together a small notebook just to give a concrete example
(streaming Bro IDS network data) https://github.com/Kitwa

Any thoughts/suggestions/pointers are greatly appreciated.


Reply via email to