Cool stuff! A pattern I have seen is to use our CSV/TSV or JSON support to
read bro logs, rather than a python library. This is likely to have much
better performance since we can do all of the parsing on the JVM without
having to flow it though an external python process.
On Tue, Aug 8, 2017 at 9:35 AM, Brian Wylie <briford.wy...@gmail.com> wrote:
> Hi All,
> I've read the new information about Structured Streaming in Spark, looks
> super great.
> Resources that I've looked at
> - https://spark.apache.org/docs/latest/streaming-programming-guide.html
> - https://databricks.com/blog/2016/07/28/structured-
> - https://spark.apache.org/docs/latest/streaming-custom-receivers.html
> - http://cdn2.hubspot.net/hubfs/438089/notebooks/spark2.0/
> + YouTube videos from Spark Summit 2016/2017
> So finally getting to my question:
> I have Python code that yields a Python generator... this is a great
> streaming approach within Python. I've used it for network packet
> processing and a bunch of other stuff. I'd love to simply hook up this
> generator (that yields python dictionaries) along with a schema definition
> to create an 'unbounded DataFrame' as discussed in
> Possible approaches:
> - Make a custom receiver in Python: https://spark.apache.
> - Use Kafka (this is definitely possible and good but overkill for my use
> - Send data out a socket and use socketTextStream to pull back in (seems a
> bit silly to me)
> - Other???
> Since Python Generators so naturally fit into streaming pipelines I'd
> think that this would be straightforward to 'couple' a python generator
> into a Spark structured streaming pipeline..
> I've put together a small notebook just to give a concrete example
> (streaming Bro IDS network data) https://github.com/
> Any thoughts/suggestions/pointers are greatly appreciated.