On 18 Oct 2013, at 10:10, Peter Aberline <[email protected]> wrote:
> Hi > > I've just noticed that the ability to read sequence files does not look like > it's been implemented yet by the PySpark API? > > Would it be a difficult task for me to add this feature without being > familiar with the code base? > > Alternatively, is there any work around for this? My data is in a single very > large sequence file containing > 250,000 elements. My code is already in > python. I'm writing the sequence file using Pydoop, so perhaps there is a way > to build a RDD by reading in via Pydoop? > > Thanks, > Peter Hi again, I've been taking a look at the source to see how hard it would be to implement this and I can see that many python API methods are simply wrappers that call methods in the Scala/Java API using a python managed 'JavaSparkContext'. So far, so good. I think I should be able to add a corresponding sequenceFile method to context.py which calls the corresponding method in the JavaSparkContext. However, I'm struggling with how to represent the key and value types in python and have them automagically mapped to Java types? Of course, if I get this working a PR will follow. Thanks Peter
