Cool, thanks for the link. Bertrand Dechoux
On Mon, Apr 21, 2014 at 7:31 PM, Nick Pentreath <nick.pentre...@gmail.com>wrote: > Also see: https://github.com/apache/spark/pull/455 > > This will add support for reading sequencefile and other inputformat in > PySpark, as long as the Writables are either simple (primitives, maps and > arrays of same), or reasonably simple Java objects. > > I'm about to push a change from MsgPack to Pyrolite for the serialization. > > Support for saving as sequencefile or inputformat could then also come > after that. It would be based on saving the python pickle able objects as > sequence file and being able to read those back. > — > Sent from Mailbox <https://www.dropbox.com/mailbox> for iPhone > > > On Thu, Apr 17, 2014 at 11:40 AM, Bertrand Dechoux <decho...@gmail.com>wrote: > >> According to the Spark SQL documentation, indeed, this project allows >> python to be used while reading/writing table ie data which not necessarily >> in text format. >> >> Thanks a lot! >> >> Bertrand Dechoux >> >> >> On Thu, Apr 17, 2014 at 10:06 AM, Bertrand Dechoux <decho...@gmail.com>wrote: >> >>> Thanks for the IRA reference. I really need to look at Spark SQL. >>> >>> Am I right to understand that due to Spark SQL, hive data can be read >>> (and it does not need to be a text format) and then 'classical' Spark can >>> work on this extraction? >>> >>> It seems logical but I haven't worked with Spark SQL as of now. >>> >>> Does it also imply the reverse is true? That I can write data as hive >>> data with spark SQL using results from a random (python) Spark application? >>> >>> Bertrand Dechoux >>> >>> >>> On Thu, Apr 17, 2014 at 7:23 AM, Matei Zaharia >>> <matei.zaha...@gmail.com>wrote: >>> >>>> Yes, this JIRA would enable that. The Hive support also handles HDFS. >>>> >>>> Matei >>>> >>>> On Apr 16, 2014, at 9:55 PM, Jesvin Jose <frank.einst...@gmail.com> >>>> wrote: >>>> >>>> When this is implemented, can you load/save an RDD of pickled objects >>>> to HDFS? >>>> >>>> >>>> On Thu, Apr 17, 2014 at 1:51 AM, Matei Zaharia <matei.zaha...@gmail.com >>>> > wrote: >>>> >>>>> Hi Bertrand, >>>>> >>>>> We should probably add a SparkContext.pickleFile and >>>>> RDD.saveAsPickleFile that will allow saving pickled objects. Unfortunately >>>>> this is not in yet, but there is an issue up to track it: >>>>> https://issues.apache.org/jira/browse/SPARK-1161. >>>>> >>>>> In 1.0, one feature we do have now is the ability to load binary data >>>>> from Hive using Spark SQL’s Python API. Later we will also be able to save >>>>> to Hive. >>>>> >>>>> Matei >>>>> >>>>> On Apr 16, 2014, at 4:27 AM, Bertrand Dechoux <decho...@gmail.com> >>>>> wrote: >>>>> >>>>> > Hi, >>>>> > >>>>> > I have browsed the online documentation and it is stated that >>>>> PySpark only read text files as sources. Is it still the case? >>>>> > >>>>> > From what I understand, the RDD can after this first step be any >>>>> serialized python structure if the class definitions are well distributed. >>>>> > >>>>> > Is it not possible to read back those RDDs? That is create a flow to >>>>> parse everything and then, e.g. the next week, start from the binary, >>>>> structured data? >>>>> > >>>>> > Technically, what is the difficulty? I would assume the code reading >>>>> a binary python RDD or a binary python file to be quite similar. Where can >>>>> I know more about this subject? >>>>> > >>>>> > Thanks in advance >>>>> > >>>>> > Bertrand >>>>> >>>>> >>>> >>>> >>>> -- >>>> We dont beat the reaper by living longer. We beat the reaper by living >>>> well and living fully. The reaper will come for all of us. Question is, >>>> what do we do between the time we are born and the time he shows up? -Randy >>>> Pausch >>>> >>>> >>>> >>> >> >