Re: PySpark still reading only text?

Bertrand Dechoux Tue, 22 Apr 2014 02:10:15 -0700

Cool, thanks for the link.

Bertrand Dechoux



On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath <nick.pentre...@gmail.com>wrote:

> Also see: https://github.com/apache/spark/pull/455
>
> This will add support for reading sequencefile and other inputformat in
> PySpark, as long as the Writables are either simple (primitives, maps and
> arrays of same), or reasonably simple Java objects.
>
> I'm about to push a change from MsgPack to Pyrolite for the serialization.
>
> Support for saving as sequencefile or inputformat could then also come
> after that. It would be based on saving the python pickle able objects as
> sequence file and being able to read those back.
> —
> Sent from Mailbox <https://www.dropbox.com/mailbox> for iPhone
>
>
> On Thu, Apr 17, 2014 at 11:40 AM, Bertrand Dechoux <decho...@gmail.com>wrote:
>
>> According to the Spark SQL documentation, indeed, this project allows
>> python to be used while reading/writing table ie data which not necessarily
>> in text format.
>>
>> Thanks a lot!
>>
>> Bertrand Dechoux
>>
>>
>> On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux <decho...@gmail.com>wrote:
>>
>>> Thanks for the IRA reference. I really need to look at Spark SQL.
>>>
>>> Am I right to understand that due to Spark SQL, hive data can be read
>>> (and it does not need to be a text format) and then 'classical' Spark can
>>> work on this extraction?
>>>
>>> It seems logical but I haven't worked with Spark SQL as of now.
>>>
>>> Does it also imply the reverse is true? That I can write data as hive
>>> data with spark SQL using results from a random (python) Spark application?
>>>
>>> Bertrand Dechoux
>>>
>>>
>>> On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia 
>>> <matei.zaha...@gmail.com>wrote:
>>>
>>>> Yes, this JIRA would enable that. The Hive support also handles HDFS.
>>>>
>>>>  Matei
>>>>
>>>> On Apr 16, 2014, at 9:55 PM, Jesvin Jose <frank.einst...@gmail.com>
>>>> wrote:
>>>>
>>>> When this is implemented, can you load/save an RDD of pickled objects
>>>> to HDFS?
>>>>
>>>>
>>>> On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia <matei.zaha...@gmail.com
>>>> > wrote:
>>>>
>>>>> Hi Bertrand,
>>>>>
>>>>> We should probably add a SparkContext.pickleFile and
>>>>> RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately
>>>>> this is not in yet, but there is an issue up to track it:
>>>>> https://issues.apache.org/jira/browse/SPARK-1161.
>>>>>
>>>>> In 1.0, one feature we do have now is the ability to load binary data
>>>>> from Hive using Spark SQL’s Python API. Later we will also be able to save
>>>>> to Hive.
>>>>>
>>>>> Matei
>>>>>
>>>>> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <decho...@gmail.com>
>>>>> wrote:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I have browsed the online documentation and it is stated that
>>>>> PySpark only read text files as sources. Is it still the case?
>>>>> >
>>>>> > From what I understand, the RDD can after this first step be any
>>>>> serialized python structure if the class definitions are well distributed.
>>>>> >
>>>>> > Is it not possible to read back those RDDs? That is create a flow to
>>>>> parse everything and then, e.g. the next week, start from the binary,
>>>>> structured data?
>>>>> >
>>>>> > Technically, what is the difficulty? I would assume the code reading
>>>>> a binary python RDD or a binary python file to be quite similar. Where can
>>>>> I know more about this subject?
>>>>> >
>>>>> > Thanks in advance
>>>>> >
>>>>> > Bertrand
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> We dont beat the reaper by living longer. We beat the reaper by living
>>>> well and living fully. The reaper will come for all of us. Question is,
>>>> what do we do between the time we are born and the time he shows up? -Randy
>>>> Pausch
>>>>
>>>>
>>>>
>>>
>>
>

Re: PySpark still reading only text?

Reply via email to