Hi, Spark Streaming is a module that builds on top of the core Spark engine, but unfortunately we haven’t added a Python API for it yet. The main problem is actually that Spark Streaming needs to remember the setup of the computation (what streams you created and what functions you applied on them) in a serializable way to provide master fault recovery, and we haven’t yet implemented that in Python. So if you need to do streaming computation, you’ll have to use Scala or Java for now.
The Python API for Spark currently only supports HDFS, but adding another data source there is easier. There is some work in progress to do that, first to allow more InputFormats from HDFS. We’re definitely interested in seeing more input sources for it, so if you might be able to help out with it, we’d appreciate that. Otherwise just knowing the top N sources you’d like would be great. We do want to keep expanding the Python API. The core Spark engine can use any data source for which there is a Hadoop InputFormat class, which I’m pretty sure includes Solr, HBase, and other things. For streaming the input sources are separate but you can find them listed on the streaming page. It’s just a matter of exposing these to Python in a way that reasonably passes around binary data types (that’s why we started with HDFS text). Matei On Feb 5, 2014, at 2:25 PM, cwhiten <[email protected]> wrote: > I'm evaluating whether Spark would be a good fit in my current streaming data > processing pipeline, and I'm just a bit confused about the differentiation > between spark and spark streaming. > > Spark seems to have a mature Python API that I plan on trying out, but Spark > Streaming appears to NOT have a Python API. What is the key differentiator > here? Does this mean that the only possible data source when using Python > is HDFS? Or is it possible to grab data from ZeroMQ to process in Python? > Going even further, can you process data from other interesting data stores > (for example, a Solr index)? > > Thanks in advance for any response. I'm just trying to get a grasp on the > data source possibilities, and how that impacts language/technology choices. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Is-HDFS-the-only-possible-data-source-for-spark-with-python-tp1257.html > Sent from the Apache Spark User List mailing list archive at Nabble.com.
