RE: Scala Vs Python

2016-09-02 Thread Santoshakhilesh
I have seen a talk by Brian Clapper in NE-SCALA 2016 - RDDs, DataFrames and 
Datasets @ Apache Spark - NE Scala 2016

At 15:00 there is a slide to show a comparison of aggregating 10 Million 
integer pairs using RDD ,  DataFrame with different language bindings like 
Scala , Python , R

As per this slide
DataFrame APIs outperform RDDs and all the Language bindings performance are 
same
RDD with Python is way slower than Scala version So I guess there should be 
some reality in Scala bindings being faster in some case.

@ 30:23 he presents a slide to show the performance of serialization and 
Dataset encoders are way faster than Java and Kyro.

But as always proof of pudding is in eating so why don’t you try some samples 
to see yourself.
I personally have found that my app runs a bit faster with Scala version than 
Java but I am not yet able to figure out the reason.


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 02 September 2016 15:25
To: Tal Grynbaum
Cc: darren; Mich Talebzadeh; Jakob Odersky; kant kodali; AssafMendelson; user
Subject: Re: Scala Vs Python

Tal: I think by nature of the project itself, Python APIs are developed after 
Scala and Java, and it is a fair trade off between speed of getting stuff to 
market. And more and more this discussion is progressing, I see not much issue 
in terms of feature parity.

Coming back to performance, Darren raised a good point: if I can scale out, 
individual VM performance should not matter much. But performance is often 
stated as a definitive downside of using Python over scala/java. I am trying to 
understand the truth and myth behind this claim. Any pointer would be great.

best
Ayan

On Fri, Sep 2, 2016 at 4:10 PM, Tal Grynbaum 
> wrote:

On Fri, Sep 2, 2016 at 1:15 AM, darren 
> wrote:
This topic is a concern for us as well. In the data science world no one uses 
native scala or java by choice. It's R and Python. And python is growing. Yet 
in spark, python is 3rd in line for feature support, if at all.

This is why we have decoupled from spark in our project. It's really 
unfortunate spark team have invested so heavily in scale.

As for speed it comes from horizontal scaling and throughout. When you can 
scale outward, individual VM performance is less an issue. Basic HPC principles.

Darren,

My guess is that data scientist who will decouple themselves from spark, will 
eventually left with more or less nothing. (single process capabilities, or 
purely performing HPC's) (unless, unlikely, some good spark competitor will 
emerge.  unlikely, simply because there is no need for such).
But putting guessing aside - the reason python is 3rd in line for feature 
support, is not because the spark developers were busy with scala, it's because 
the features that are missing are those that support strong typing. which is 
not relevant to python.  in other words, even if spark was rewritten in python, 
and was to focus on python only, you would still not get those features.



--
Tal Grynbaum / CTO & co-founder

m# +972-54-7875797
[cid:image001.png@01D20532.AC944EB0]
mobile retention done right



--
Best Regards,
Ayan Guha


RE: Scala Vs Python

2016-08-31 Thread Santoshakhilesh
Hi ,
I would prefer Scala if you are starting afresh , this is considering both ease 
of usage , features , performance and support.
You will find numerous examples & support with Scala which might not be true 
for any other language.
I had personally developed the first version of my App using Java 1.6 due to 
some unavoidable reasons , and my code is very verbose and ugly.
But now with Java 8 ‘s lambda support I think this is not a problem anymore. 
About Python since there is no compile time safety so If you plan to use Spark 
2.0 , Dataset API are not available.
Given a choice I would prefer to use Scala any day for very simple reason that 
I would get all the future features and optimizations out of box and I need to 
type less ☺.


Regards,
Santosh Akhilesh


From: ayan guha [mailto:guha.a...@gmail.com]
Sent: 01 September 2016 11:03
To: user
Subject: Scala Vs Python

Hi Users

Thought to ask (again and again) the question: While I am building any 
production application, should I use Scala or Python?

I have read many if not most articles but all seems pre-Spark 2. Anything 
changed with Spark 2? Either pro-scala way or pro-python way?

I am thinking performance, feature parity and future direction, not so much in 
terms of skillset or ease of use.

Or, if you think it is a moot point, please say so as well.

Any real life example, production experience, anecdotes, personal taste, 
profanity all are welcome :)

--
Best Regards,
Ayan Guha


RE: Cumulative Sum function using Dataset API

2016-08-09 Thread Santoshakhilesh
You could check following link.
http://stackoverflow.com/questions/35154267/how-to-compute-cumulative-sum-using-spark

From: Jon Barksdale [mailto:jon.barksd...@gmail.com]
Sent: 09 August 2016 08:21
To: ayan guha
Cc: user
Subject: Re: Cumulative Sum function using Dataset API

I don't think that would work properly, and would probably just give me the sum 
for each partition. I'll give it a try when I get home just to be certain.

To maybe explain the intent better, if I have a column (pre sorted) of 
(1,2,3,4), then the cumulative sum would return (1,3,6,10).

Does that make sense? Naturally, if ordering a sum turns it into a cumulative 
sum, I'll gladly use that :)

Jon
On Mon, Aug 8, 2016 at 4:55 PM ayan guha 
> wrote:
You mean you are not able to use sum(col) over (partition by key order by 
some_col) ?

On Tue, Aug 9, 2016 at 9:53 AM, jon 
> wrote:
Hi all,

I'm trying to write a function that calculates a cumulative sum as a column
using the Dataset API, and I'm a little stuck on the implementation.  From
what I can tell, UserDefinedAggregateFunctions don't seem to support
windowing clauses, which I think I need for this use case.  If I write a
function that extends from AggregateWindowFunction, I end up needing classes
that are package private to the sql package, so I need to make my function
under the org.apache.spark.sql package, which just feels wrong.

I've also considered writing a custom transformer, but haven't spend as much
time reading through the code, so I don't know how easy or hard that would
be.

TLDR; What's the best way to write a function that returns a value for every
row, but has mutable state, and gets row in a specific order?

Does anyone have any ideas, or examples?

Thanks,

Jon




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cumulative-Sum-function-using-Dataset-API-tp27496.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



--
Best Regards,
Ayan Guha


RE: GraphX Java API

2016-06-05 Thread Santoshakhilesh
Ok , thanks for letting me know. Yes Since Java and scala programs ultimately 
runs on JVM. So the APIs written in one language can be called from other.
When I had used GraphX (around 2015 beginning) the Java Native APIs were not 
available for GraphX.
So I chose to develop my application in scala and it turned out much simpler to 
develop  in scala due to some of its powerful functions like lambda , map , 
filter etc… which were not available to me in Java 7.
Regards,
Santosh Akhilesh

From: Sonal Goyal [mailto:sonalgoy...@gmail.com]
Sent: 01 June 2016 00:56
To: Santoshakhilesh
Cc: Kumar, Abhishek (US - Bengaluru); user@spark.apache.org; Golatkar, Jayesh 
(US - Bengaluru); Soni, Akhil Dharamprakash (US - Bengaluru); Matta, Rishul (US 
- Bengaluru); Aich, Risha (US - Bengaluru); Kumar, Rajinish (US - Bengaluru); 
Jain, Isha (US - Bengaluru); Kumar, Sandeep (US - Bengaluru)
Subject: Re: GraphX Java API

Its very much possible to use GraphX through Java, though some boilerplate may 
be needed. Here is an example.

Create a graph from edge and vertex RDD (JavaRDD<Tuple2<Object, Long>> 
vertices, JavaRDD<Edge> edges )


ClassTag longTag = scala.reflect.ClassTag$.MODULE$.apply(Long.class);
Graph<Long,Float> graph = Graph.apply(vertices.rdd(),
edges.rdd(), 0L, 
StorageLevel.MEMORY_ONLY(), StorageLevel.MEMORY_ONLY(),
longTag, longTag);



Then basically you can call graph.ops() and do available operations like 
triangleCounting etc,

Best Regards,
Sonal
Founder, Nube Technologies<http://www.nubetech.co>
Reifier at Strata Hadoop World<https://www.youtube.com/watch?v=eD3LkpPQIgM>
Reifier at Spark Summit 
2015<https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>




On Tue, May 31, 2016 at 11:40 AM, Santoshakhilesh 
<santosh.akhil...@huawei.com<mailto:santosh.akhil...@huawei.com>> wrote:
Hi ,
Scala has similar package structure as java and finally it runs on JVM so 
probably you get an impression that its in Java.
As far as I know there are no Java API for GraphX. I had used GraphX last year 
and at that time I had to code in Scala to use the GraphX APIs.
Regards,
Santosh Akhilesh


From: Kumar, Abhishek (US - Bengaluru) 
[mailto:abhishekkuma...@deloitte.com<mailto:abhishekkuma...@deloitte.com>]
Sent: 30 May 2016 13:24
To: Santoshakhilesh; user@spark.apache.org<mailto:user@spark.apache.org>
Cc: Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US - 
Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru); 
Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar, Sandeep 
(US - Bengaluru)
Subject: RE: GraphX Java API

Hey,
•   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
•   
org.apache.spark.graphx<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
•   
org.apache.spark.graphx.impl<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
•   
org.apache.spark.graphx.lib<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
•   
org.apache.spark.graphx.util<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) 
<abhishekkuma...@deloitte.com<mailto:abhishekkuma...@deloitte.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1










RE: GraphX Java API

2016-05-31 Thread Santoshakhilesh
Hi ,
Scala has similar package structure as java and finally it runs on JVM so 
probably you get an impression that its in Java.
As far as I know there are no Java API for GraphX. I had used GraphX last year 
and at that time I had to code in Scala to use the GraphX APIs.
Regards,
Santosh Akhilesh


From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 30 May 2016 13:24
To: Santoshakhilesh; user@spark.apache.org
Cc: Golatkar, Jayesh (US - Bengaluru); Soni, Akhil Dharamprakash (US - 
Bengaluru); Matta, Rishul (US - Bengaluru); Aich, Risha (US - Bengaluru); 
Kumar, Rajinish (US - Bengaluru); Jain, Isha (US - Bengaluru); Kumar, Sandeep 
(US - Bengaluru)
Subject: RE: GraphX Java API

Hey,
•   I see some graphx packages listed here:
http://spark.apache.org/docs/latest/api/java/index.html
•   
org.apache.spark.graphx<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/package-frame.html>
•   
org.apache.spark.graphx.impl<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/impl/package-frame.html>
•   
org.apache.spark.graphx.lib<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/lib/package-frame.html>
•   
org.apache.spark.graphx.util<http://spark.apache.org/docs/latest/api/java/org/apache/spark/graphx/util/package-frame.html>
Aren’t they meant to be used with JAVA?
Thanks

From: Santoshakhilesh [mailto:santosh.akhil...@huawei.com]
Sent: Friday, May 27, 2016 4:52 PM
To: Kumar, Abhishek (US - Bengaluru) 
<abhishekkuma...@deloitte.com<mailto:abhishekkuma...@deloitte.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org>
Subject: RE: GraphX Java API

GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1









RE: GraphX Java API

2016-05-27 Thread Santoshakhilesh
GraphX APis are available only in Scala. If you need to use GraphX you need to 
switch to Scala.

From: Kumar, Abhishek (US - Bengaluru) [mailto:abhishekkuma...@deloitte.com]
Sent: 27 May 2016 19:59
To: user@spark.apache.org
Subject: GraphX Java API

Hi,

We are trying to consume the Java API for GraphX, but there is no documentation 
available online on the usage or examples. It would be great if we could get 
some examples in Java.

Thanks and regards,

Abhishek Kumar
Products & Services | iLab
Deloitte Consulting LLP
Block ‘C’, Divyasree Technopolis, Survey No.: 123 & 132/2, Yemlur Post, Yemlur, 
Bengaluru – 560037, Karnataka, India
Mobile: +91 7736795770
abhishekkuma...@deloitte.com | 
www.deloitte.com

Please consider the environment before printing.






This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

v.E.1









How to setup a long running spark streaming job with continuous window refresh

2016-01-21 Thread Santoshakhilesh
Hi,
I have following scenario in my project;
1.I will continue to get a stream of data from a source
2.I need to calculate mean and variance for a key every minute
3.After minute is over I should restart fresh computing the values for new 
minute
Example:
10:00:00 computation and output
10:00:00 key =1 , mean = 10 , variance =2
10:00:00 key =N , mean = 10 , variance =2
10:00:01 computation and output
10:00:00 key =1 , mean = 11 , variance =2
10:00:00 key =N , mean = 12 , variance =2
10:00:01 data has no dependency with 10:00:00
How to setup such jobs in a single java spark streaming application.
Regards,|
Santosh Akhilesh