Re: PySpark sequence file support

Andre Schumacher Mon, 21 Oct 2013 13:22:54 -0700

Hi Peter,

just some idea: if you wouldn't mind a preprocessing step you could
maybe use Pydoop to write out a sequence of file that contain your
pickled Python objects and read these into PySpark (see
read_from_pickle_file inside serializers.py).


How large is your input in total? Does it fir in one machine? Do you
have "complicated" nested objects?

BTW: just our of curiousity, what do you use Pydoop for? Some
bioinformatics related things?

Andre

On 10/18/2013 04:56 AM, Peter Aberline wrote:
> 
> On 18 Oct 2013, at 10:10, Peter Aberline <[email protected]> wrote:
> 
>> Hi
>>
>> I've just noticed that the ability to read sequence files does not look like 
>> it's been implemented yet by the PySpark API? 
>>
>> Would it be a difficult task for me to add this feature without being 
>> familiar with the code base?
>>
>> Alternatively, is there any work around for this? My data is in a single 
>> very large sequence file containing > 250,000 elements. My code is already 
>> in python. I'm writing the sequence file using Pydoop, so perhaps there is a 
>> way to build a RDD by reading in via Pydoop?
>>
>> Thanks,
>> Peter
> 
> 
> Hi again,
> 
> I've been taking a look at the source to see how hard it would be to 
> implement this and I can see that many python API methods are simply wrappers 
> that call methods in the Scala/Java API using a python managed 
> 'JavaSparkContext'.
> 
> So far, so good. I think I should be able to add a corresponding sequenceFile 
> method to context.py which calls the corresponding method in the 
> JavaSparkContext. However, I'm struggling with how to represent the key and 
> value types in python and have them automagically mapped to Java types?
> 
> Of course, if I get this working a PR will follow.
> 
> Thanks
> Peter
>

Re: PySpark sequence file support

Reply via email to