Re: Is HDFS the only possible data source for spark with python?

Matei Zaharia Wed, 05 Feb 2014 14:35:37 -0800

Hi,

Spark Streaming is a module that builds on top of the core Spark engine, but 
unfortunately we haven’t added a Python API for it yet. The main problem is 
actually that Spark Streaming needs to remember the setup of the computation 
(what streams you created and what functions you applied on them) in a 
serializable way to provide master fault recovery, and we haven’t yet 
implemented that in Python. So if you need to do streaming computation, you’ll 
have to use Scala or Java for now.

The Python API for Spark currently only supports HDFS, but adding another data 
source there is easier. There is some work in progress to do that, first to 
allow more InputFormats from HDFS. We’re definitely interested in seeing more 
input sources for it, so if you might be able to help out with it, we’d 
appreciate that. Otherwise just knowing the top N sources you’d like would be 
great. We do want to keep expanding the Python API.

The core Spark engine can use any data source for which there is a Hadoop 
InputFormat class, which I’m pretty sure includes Solr, HBase, and other 
things. For streaming the input sources are separate but you can find them 
listed on the streaming page. It’s just a matter of exposing these to Python in 
a way that reasonably passes around binary data types (that’s why we started 
with HDFS text).

Matei

On Feb 5, 2014, at 2:25 PM, cwhiten <[email protected]> wrote:

> I'm evaluating whether Spark would be a good fit in my current streaming data
> processing pipeline, and I'm just a bit confused about the differentiation
> between spark and spark streaming.  
> 
> Spark seems to have a mature Python API that I plan on trying out, but Spark
> Streaming appears to NOT have a Python API.  What is the key differentiator
> here?  Does this mean that the only possible data source when using Python
> is HDFS?  Or is it possible to grab data from ZeroMQ to process in Python? 
> Going even further, can you process data from other interesting data stores
> (for example, a Solr index)? 
> 
> Thanks in advance for any response.  I'm just trying to get a grasp on the
> data source possibilities, and how that impacts language/technology choices.
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-user-list.1001560.n3.nabble.com/Is-HDFS-the-only-possible-data-source-for-spark-with-python-tp1257.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Is HDFS the only possible data source for spark with python?

Reply via email to