Hi Peter, just some idea: if you wouldn't mind a preprocessing step you could maybe use Pydoop to write out a sequence of file that contain your pickled Python objects and read these into PySpark (see read_from_pickle_file inside serializers.py).
How large is your input in total? Does it fir in one machine? Do you have "complicated" nested objects? BTW: just our of curiousity, what do you use Pydoop for? Some bioinformatics related things? Andre On 10/18/2013 04:56 AM, Peter Aberline wrote: > > On 18 Oct 2013, at 10:10, Peter Aberline <[email protected]> wrote: > >> Hi >> >> I've just noticed that the ability to read sequence files does not look like >> it's been implemented yet by the PySpark API? >> >> Would it be a difficult task for me to add this feature without being >> familiar with the code base? >> >> Alternatively, is there any work around for this? My data is in a single >> very large sequence file containing > 250,000 elements. My code is already >> in python. I'm writing the sequence file using Pydoop, so perhaps there is a >> way to build a RDD by reading in via Pydoop? >> >> Thanks, >> Peter > > > Hi again, > > I've been taking a look at the source to see how hard it would be to > implement this and I can see that many python API methods are simply wrappers > that call methods in the Scala/Java API using a python managed > 'JavaSparkContext'. > > So far, so good. I think I should be able to add a corresponding sequenceFile > method to context.py which calls the corresponding method in the > JavaSparkContext. However, I'm struggling with how to represent the key and > value types in python and have them automagically mapped to Java types? > > Of course, if I get this working a PR will follow. > > Thanks > Peter >
