Hi

That is great to know, Jakob, Thanks. So can I safely say even there are
future features may be built on top of dataset APIs, same functionality
will be available in Python APIs, eventually?

Going back to older question, if the above statement is true, I do not see
a major feature parity difference between Python and Scala. Accepted Python
may be 1 minor release behind. Is that a safe assumption?

Or I am missing something? How about Streaming & Structured streaming APIs?

Best
Ayan

On Fri, Sep 2, 2016 at 9:08 AM, Jakob Odersky <ja...@odersky.com> wrote:

> Hi Mich,
>
> the functional difference between Datasets and DataFrames is virtually
> non-existant in Spark 2.0. Historically, DataFrames were the first
> implementation of a collection to use Catalyst, Spark SQL's query
> optimizer. Whilst bringing lots of performance benefits, DataFrames came at
> the expense of type safety since they are essentially a collection of "Row"
> objects regardless of the actual data they represented. Datasets were added
> in Spark 1.6 to add back type-safety whilst still taking advantage of
> Catalyst. In Spark 2.0 DataFrame became an alias for Dataset.
>
> Both Datasets and DataFrames are entry points to Catalyst that will run
> queries (aka transformations and actions) "on top of" RDDs. You can think
> of it this way: when applying an action on a Dataset, Catalyst basically
> will try to figure out what sequence of RDD transformations correspond to
> the query and are the most efficient (in practice it is slightly more
> complex). Your statement that "Dataset is [...] basically an RDD with some
> optimization gone into it" is true in that regard :)
>
> best,
> --Jakob
>
> On Thu, Sep 1, 2016 at 3:15 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> > wrote:
>
>> Hi,
>>
>> Thanks I have already seen that link.
>>
>> We were discussing this topic on another thread today.
>>
>> "Difference between Data set and Data Frame in Spark 2
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 1 September 2016 at 23:10, Peyman Mohajerian <mohaj...@gmail.com>
>> wrote:
>>
>>> https://databricks.com/blog/2016/07/14/a-tale-of-three-apach
>>> e-spark-apis-rdds-dataframes-and-datasets.html
>>>
>>> On Thu, Sep 1, 2016 at 3:01 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> Hi Jacob.
>>>>
>>>> My understanding of Dataset is that it is basically an RDD with some
>>>> optimization gone into it. RDD is meant to deal with unstructured data?
>>>>
>>>> Now DataFrame is the tabular format of RDD designed for tabular work,
>>>> csv, SQL stuff etc.
>>>>
>>>> When you mention DataFrame is just an alias for Dataset[Row] does that
>>>> mean  that it converts an RDD to DataSet thus producing a tabular format?
>>>>
>>>> Thanks
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>> On 1 September 2016 at 22:49, Jakob Odersky <ja...@odersky.com> wrote:
>>>>
>>>>> > However, what really worries me is not having Dataset APIs at all in
>>>>> Python. I think thats a deal breaker.
>>>>>
>>>>> What is the functionality you are missing? In Spark 2.0 a DataFrame is
>>>>> just an alias for Dataset[Row] ("type DataFrame = Dataset[Row]" in
>>>>> core/.../o/a/s/sql/package.scala).
>>>>> Since python is dynamically typed, you wouldn't really gain anything
>>>>> by using Datasets anyway.
>>>>>
>>>>> On Thu, Sep 1, 2016 at 2:20 PM, ayan guha <guha.a...@gmail.com> wrote:
>>>>>
>>>>>> Thanks All for your replies.
>>>>>>
>>>>>> Feature Parity:
>>>>>>
>>>>>> MLLib, RDD and dataframes features are totally comparable. Streaming
>>>>>> is now at par in functionality too, I believe. However, what really 
>>>>>> worries
>>>>>> me is not having Dataset APIs at all in Python. I think thats a deal
>>>>>> breaker.
>>>>>>
>>>>>> Performance:
>>>>>> I do  get this bit when RDDs are involved, but not when Data frame is
>>>>>> the only construct I am operating on.  Dataframe supposed to be
>>>>>> language-agnostic in terms of performance.  So why people think python is
>>>>>> slower? is it because of using UDF? Any other reason?
>>>>>>
>>>>>> *Is there any kind of benchmarking/stats around Python UDF vs Scala
>>>>>> UDF comparison? like the one out there  b/w RDDs.*
>>>>>>
>>>>>> @Kant:  I am not comparing ANY applications. I am comparing SPARK
>>>>>> applications only. I would be glad to hear your opinion on why pyspark
>>>>>> applications will not work, if you have any benchmarks please share if
>>>>>> possible.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Sep 2, 2016 at 12:57 AM, kant kodali <kanth...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> c'mon man this is no Brainer..Dynamic Typed Languages for Large Code
>>>>>>> Bases or Large Scale Distributed Systems makes absolutely no sense. I 
>>>>>>> can
>>>>>>> write a 10 page essay on why that wouldn't work so great. you might be
>>>>>>> wondering why would spark have it then? well probably because its ease 
>>>>>>> of
>>>>>>> use for ML (that would be my best guess).
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Wed, Aug 31, 2016 11:45 PM, AssafMendelson
>>>>>>> assaf.mendel...@rsa.com wrote:
>>>>>>>
>>>>>>>> I believe this would greatly depend on your use case and your
>>>>>>>> familiarity with the languages.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> In general, scala would have a much better performance than python
>>>>>>>> and not all interfaces are available in python.
>>>>>>>>
>>>>>>>> That said, if you are planning to use dataframes without any UDF
>>>>>>>> then the performance hit is practically nonexistent.
>>>>>>>>
>>>>>>>> Even if you need UDF, it is possible to write those in scala and
>>>>>>>> wrap them for python and still get away without the performance hit.
>>>>>>>>
>>>>>>>> Python does not have interfaces for UDAFs.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I believe that if you have large structured data and do not
>>>>>>>> generally need UDF/UDAF you can certainly work in python without 
>>>>>>>> losing too
>>>>>>>> much.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> *From:* ayan guha [mailto:[hidden email]
>>>>>>>> <http:///user/SendEmail.jtp?type=node&node=27637&i=0>]
>>>>>>>> *Sent:* Thursday, September 01, 2016 5:03 AM
>>>>>>>> *To:* user
>>>>>>>> *Subject:* Scala Vs Python
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Hi Users
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Thought to ask (again and again) the question: While I am building
>>>>>>>> any production application, should I use Scala or Python?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I have read many if not most articles but all seems pre-Spark 2.
>>>>>>>> Anything changed with Spark 2? Either pro-scala way or pro-python way?
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> I am thinking performance, feature parity and future direction, not
>>>>>>>> so much in terms of skillset or ease of use.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Or, if you think it is a moot point, please say so as well.
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> Any real life example, production experience, anecdotes, personal
>>>>>>>> taste, profanity all are welcome :)
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>>
>>>>>>>> Best Regards,
>>>>>>>> Ayan Guha
>>>>>>>>
>>>>>>>> ------------------------------
>>>>>>>> View this message in context: RE: Scala Vs Python
>>>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/RE-Scala-Vs-Python-tp27637.html>
>>>>>>>> Sent from the Apache Spark User List mailing list archive
>>>>>>>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at
>>>>>>>> Nabble.com.
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Best Regards,
>>>>>> Ayan Guha
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>
>


-- 
Best Regards,
Ayan Guha

Reply via email to