Re: Contributed to spark

2017-04-08 Thread Shuai Lin
Links that was helpful to me during learning about the spark source code:

- Articles with "spark" tag in this blog:
http://hydronitrogen.com/tag/spark.html
- Jacek's "mastering apache spark" git book:
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/

Hope those can help.

On Sat, Apr 8, 2017 at 1:31 AM, Stephen Fletcher  wrote:

> I'd like to eventually contribute to spark, but I'm noticing since spark 2
> the query planner is heavily used throughout Dataset code base. Are there
> any sites I can go to that explain the technical details, more than just
> from a high-level prospective
>


Re: Structured streaming and writing output to Cassandra

2017-04-08 Thread shyla deshpande
Thanks Jules. It was helpful.

On Fri, Apr 7, 2017 at 8:32 PM, Jules Damji  wrote:

> This blog that shows how to write a custom sink: https://databricks.com/
> blog/2017/04/04/real-time-end-to-end-integration-with-
> apache-kafka-in-apache-sparks-structured-streaming.html
>
> Cheers
> Jules
>
> Sent from my iPhone
> Pardon the dumb thumb typos :)
>
> On Apr 7, 2017, at 11:23 AM, shyla deshpande 
> wrote:
>
> Is anyone using structured streaming and writing the results to Cassandra
> database in a production environment?
>
> I do not think I have enough expertise to write a custom sink that can be
> used in production environment. Please help!
>
>


Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Koert Kuipers
how would you use only relational transformations on dataset?

On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan  wrote:

> Hi Spark-users,
> I came across a few sources which mentioned DataFrame can be more
> efficient than Dataset.  I can understand this is true because Dataset
> allows functional transformation which Catalyst cannot look into and hence
> cannot optimize well. But can DataFrame be more efficient than Dataset even
> if we only use the relational transformation on dataset? If so, can anyone
> give some explanation why  it is so? Any benchmark comparing dataset vs.
> dataframe?   Thank you!
>
> Shiyuan
>


Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Koert Kuipers
let me try that again. i left some crap at the bottom of my previous email
as i was editing it. sorry about that. here it goes:

it is because you use Dataset[X] but the actual computations are still done
in Dataset[Row] (so DataFrame). well... the actual computations are done in
RDD[InternalRow] with spark's internal types to represent String, Map, Seq,
structs, etc.

so for example if you do:
scala> val x: Dataset[(String, String)] = ...
scala> val f: (String, String) => Boolean = _._2 != null
scala> x.filter(f)

in this case you are using a lambda function for the filter. this is a
black-box operation to spark (spark cannot see what is inside the
function). so spark will now convert the internal representation it is
actually using (something like an InternalRow of size 2 with inside of it
two objects of type UTF8String) into a Tuple2[String, String], and then
call your function f on it. so for this very simply null comparison you are
doing a relatively expensive conversion.

now compare this to if you have a DataFrame that holds 2 columns of type
String.
scala> val x: DataFrame = ...
x: org.apache.spark.sql.DataFrame = [x: string, y: string]
scala> x.filter($"y" isNotNull)

spark will parse your expression, and since it has an understanding of what
you are trying to do, it can apply the logic directly on the InternalRow,
which avoids the conversion. this will be faster. of course you pay the
price for this in that you are forced to use a much more constrained
framework to express what you want to do, which can lead to some hair
pulling at times.

On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan  wrote:

> Hi Spark-users,
> I came across a few sources which mentioned DataFrame can be more
> efficient than Dataset.  I can understand this is true because Dataset
> allows functional transformation which Catalyst cannot look into and hence
> cannot optimize well. But can DataFrame be more efficient than Dataset even
> if we only use the relational transformation on dataset? If so, can anyone
> give some explanation why  it is so? Any benchmark comparing dataset vs.
> dataframe?   Thank you!
>
> Shiyuan
>


Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Koert Kuipers
it is because you use Dataset[X] but the actual computations are still done
in Dataset[Row] (so DataFrame). well... the actual computations are done in
RDD[InternalRow] with spark's internal types to represent String, Map, Seq,
structs, etc.

so for example if you do:
scala> val x: Dataset[(String, String)] = ...
scala> val f: (String, String) => Boolean = _._2 != null
scala> x.filter(f)

in this case you are using a lambda function for the filter. this is a
black-box operation to spark (spark cannot see what is inside the
function). so spark will now convert the internal representation it is
actually using (something like an InternalRow of size 2 with inside of it
two objects of type UTF8String) into a Tuple2[String, String], and then
call your function f on it. so for this very simply null comparison you are
doing a relatively expensive conversion.

now compare this to if you have a DataFrame that holds 2 columns of type
String.
scala> val x: DataFrame = ...
x: org.apache.spark.sql.DataFrame = [x: string, y: string]
scala> x.filter($"y" isNotNull)

spark will parse your expression, and since it has an understanding of what
you are trying to do, it can apply the logic directly on the InternalRow,
which avoids the conversion. this will be faster. of course you pay the
price for this in that you are forced to use a much more constrained
framework to express what you want to do, which can lead to some hair
pulling at times.



so when you do a lambda operation on type X, this is black

want to use X spark needs to convert these InternalRows to X and then
convert the result back to InternalRows.



 (so DataFrame) using spark's internal types for string, seq, map, etc.
so any time you actually need an X there is conversion from Row to X, and
from internal representations to your representations of the data) and back
going on. this is whats the encoders are used for.

2) some optimizations aren't working yet for Dataset[X]
3) since type X and the lambdas that you define that perform on it are
somewhat of a black box to spark there is less room for optimization.



On Sat, Apr 8, 2017 at 2:15 PM, Shiyuan  wrote:

> Hi Spark-users,
> I came across a few sources which mentioned DataFrame can be more
> efficient than Dataset.  I can understand this is true because Dataset
> allows functional transformation which Catalyst cannot look into and hence
> cannot optimize well. But can DataFrame be more efficient than Dataset even
> if we only use the relational transformation on dataset? If so, can anyone
> give some explanation why  it is so? Any benchmark comparing dataset vs.
> dataframe?   Thank you!
>
> Shiyuan
>


Re: Why dataframe can be more efficient than dataset?

2017-04-08 Thread Jörn Franke
As far as I am aware in newer Spark versions a DataFrame is the same as 
Dataset[Row].
In fact, performance depends on so many factors, so I am not sure such a 
comparison makes sense.

> On 8. Apr 2017, at 20:15, Shiyuan  wrote:
> 
> Hi Spark-users, 
> I came across a few sources which mentioned DataFrame can be more 
> efficient than Dataset.  I can understand this is true because Dataset allows 
> functional transformation which Catalyst cannot look into and hence cannot 
> optimize well. But can DataFrame be more efficient than Dataset even if we 
> only use the relational transformation on dataset? If so, can anyone give 
> some explanation why  it is so? Any benchmark comparing dataset vs. 
> dataframe?   Thank you!
> 
> Shiyuan 

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Assigning a unique row ID

2017-04-08 Thread Everett Anderson
On Fri, Apr 7, 2017 at 8:04 PM, Subhash Sriram 
wrote:

> Hi,
>
> We use monotonically_increasing_id() as well, but just cache the table
> first like Ankur suggested. With that method, we get the same keys in all
> derived tables.
>

Ah, okay, awesome. Let me give that a go.



>
> Thanks,
> Subhash
>
> Sent from my iPhone
>
> On Apr 7, 2017, at 7:32 PM, Everett Anderson 
> wrote:
>
> Hi,
>
> Thanks, but that's using a random UUID. Certainly unlikely to have
> collisions, but not guaranteed.
>
> I'd rather prefer something like monotonically_increasing_id or RDD's
> zipWithUniqueId but with better behavioral characteristics -- so they don't
> surprise people when 2+ outputs derived from an original table end up not
> having the same IDs for the same rows, anymore.
>
> It seems like this would be possible under the covers, but would have the
> performance penalty of needing to do perhaps a count() and then also a
> checkpoint.
>
> I was hoping there's a better way.
>
>
> On Fri, Apr 7, 2017 at 4:24 PM, Tim Smith  wrote:
>
>> http://stackoverflow.com/questions/37231616/add-a-new-column
>> -to-a-dataframe-new-column-i-want-it-to-be-a-uuid-generator
>>
>>
>> On Fri, Apr 7, 2017 at 3:56 PM, Everett Anderson <
>> ever...@nuna.com.invalid> wrote:
>>
>>> Hi,
>>>
>>> What's the best way to assign a truly unique row ID (rather than a hash)
>>> to a DataFrame/Dataset?
>>>
>>> I originally thought that functions.monotonically_increasing_id would
>>> do this, but it seems to have a rather unfortunate property that if you add
>>> it as a column to table A and then derive tables X, Y, Z and save those,
>>> the row ID values in X, Y, and Z may end up different. I assume this is
>>> because it delays the actual computation to the point where each of those
>>> tables is computed.
>>>
>>>
>>
>>
>> --
>>
>> --
>> Thanks,
>>
>> Tim
>>
>
>


Why dataframe can be more efficient than dataset?

2017-04-08 Thread Shiyuan
Hi Spark-users,
I came across a few sources which mentioned DataFrame can be more
efficient than Dataset.  I can understand this is true because Dataset
allows functional transformation which Catalyst cannot look into and hence
cannot optimize well. But can DataFrame be more efficient than Dataset even
if we only use the relational transformation on dataset? If so, can anyone
give some explanation why  it is so? Any benchmark comparing dataset vs.
dataframe?   Thank you!

Shiyuan