Re: PySpark sequence file support

Peter Aberline Fri, 18 Oct 2013 04:57:43 -0700

On 18 Oct 2013, at 10:10, Peter Aberline <[email protected]> wrote:


> Hi
> 
> I've just noticed that the ability to read sequence files does not look like 
> it's been implemented yet by the PySpark API? 
> 
> Would it be a difficult task for me to add this feature without being 
> familiar with the code base?
> 
> Alternatively, is there any work around for this? My data is in a single very 
> large sequence file containing > 250,000 elements. My code is already in 
> python. I'm writing the sequence file using Pydoop, so perhaps there is a way 
> to build a RDD by reading in via Pydoop?
> 
> Thanks,
> Peter


Hi again,

I've been taking a look at the source to see how hard it would be to implement 
this and I can see that many python API methods are simply wrappers that call 
methods in the Scala/Java API using a python managed 'JavaSparkContext'.

So far, so good. I think I should be able to add a corresponding sequenceFile 
method to context.py which calls the corresponding method in the 
JavaSparkContext. However, I'm struggling with how to represent the key and 
value types in python and have them automagically mapped to Java types?

Of course, if I get this working a PR will follow.

Thanks
Peter

Re: PySpark sequence file support

Reply via email to