[CFP] DataWorks Summit/Hadoop Summit Sydney - Call for abstracts

2017-05-03 Thread Yanbo Liang
The Australia/Pacific version of DataWorks Summit is in Sydney this year,
September 20-21. This is a great place to talk about work you are doing in
Apache Spark or how you are using Spark. Information on submitting an
abstract is at
https://dataworkssummit.com/sydney-2017/abstracts/submit-abstract/



Tracks:

Apache Hadoop

Apache Spark and Data Science

Cloud and Applications

Data Processing and Warehousing

Enterprise Adoption

IoT and Streaming

Operations, Governance and Security



Deadline: Friday, May 26th, 2017.


Spark 2.1.0 and Hive 2.1.1

2017-05-03 Thread Lohith Samaga M
Hi,
Good day.

My setup:

  1.  Single node Hadoop 2.7.3 on Ubuntu 16.04.
  2.  Hive 2.1.1 with metastore in MySQL.
  3.  Spark 2.1.0 configured using hive-site.xml to use MySQL metastore.
  4.  The VERSION table contains SCHEMA_VERSION = 2.1.0

Hive CLI works fine.
However, when I start Spark-shell or Spark-sql, SCHEMA_VERSION 
is set to 1.2.0 by spark.
Hive CLI then fails to start. After manual update of VERSION 
table, it works fine again.

I see in the spark/jars directory that hive related jars are of 
version 1.2.1
I tried building spark from source and as spark uses hive 1.2.1 
by default, I get the same set of jars.

How can we make Spark 2.1.0 work with Hive 2.1.1?

Thanks in advance!

Best regards / Mit freundlichen Grüßen / Sincères salutations
M. Lohith Samaga


Information transmitted by this e-mail is proprietary to Mphasis, its 
associated companies and/ or its customers and is intended 
for use only by the individual or entity to which it is addressed, and may 
contain information that is privileged, confidential or 
exempt from disclosure under applicable law. If you are not the intended 
recipient or it appears that this mail has been forwarded 
to you without proper authority, you are notified that any use or dissemination 
of this information in any manner is strictly 
prohibited. In such cases, please notify us immediately at 
mailmas...@mphasis.com and delete this mail from your records.


Re: Spark books

2017-05-03 Thread Pushkar.Gujar
*"I would suggest do not buy any book, just start with databricks community
edition"*

I dont agree with above , "Learning Spark" book  was definitely stepping
stone for me. All the basics that one beginner can/will need is covered in
very easy to understand format with examples. Great book! highly
recommended ..

Off course, one has to mature their learning curve by moving on to other
resources, Apache documentation and along with github repos are excellent
resources .


Thank you,
*Pushkar Gujar*


On Wed, May 3, 2017 at 8:16 PM, Neelesh Salian 
wrote:

> The Apache Spark documentation is good to begin with.
> All the programming guides, particularly.
>
>
> On Wed, May 3, 2017 at 5:07 PM, ayan guha  wrote:
>
>> I would suggest do not buy any book, just start with databricks community
>> edition
>>
>> On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede  wrote:
>>
>>> Well that is the nature of technology, ever evolving. There will always
>>> be new concepts. If you're trying to get started ASAP and the internet
>>> isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of
>>> production stacks are still on that version and the knowledge from
>>> mastering 1.6 is transferable to 2+. I think that beats waiting forever.
>>>
>>> On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:
>>>
 I'm trying to decide whether to buy the book learning spark, spark for
 machine learning etc. or wait for a new edition covering the new concepts
 like dataframe and datasets. Anyone got any suggestions?

>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Regards,
> Neelesh S. Salian
>
>


Re: Spark books

2017-05-03 Thread Stephen Fletcher
Zeming,

Jacek also has a really good online spark book for spark 2, "mastering
spark". I found it very helpful when trying to understand spark 2's
encoders.

his book is here:
https://www.gitbook.com/book/jaceklaskowski/mastering-apache-spark/details


On Wed, May 3, 2017 at 8:16 PM, Neelesh Salian 
wrote:

> The Apache Spark documentation is good to begin with.
> All the programming guides, particularly.
>
>
> On Wed, May 3, 2017 at 5:07 PM, ayan guha  wrote:
>
>> I would suggest do not buy any book, just start with databricks community
>> edition
>>
>> On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede  wrote:
>>
>>> Well that is the nature of technology, ever evolving. There will always
>>> be new concepts. If you're trying to get started ASAP and the internet
>>> isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of
>>> production stacks are still on that version and the knowledge from
>>> mastering 1.6 is transferable to 2+. I think that beats waiting forever.
>>>
>>> On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:
>>>
 I'm trying to decide whether to buy the book learning spark, spark for
 machine learning etc. or wait for a new edition covering the new concepts
 like dataframe and datasets. Anyone got any suggestions?

>>>
>>>
>>
>>
>> --
>> Best Regards,
>> Ayan Guha
>>
>
>
>
> --
> Regards,
> Neelesh S. Salian
>
>


What is the correct JSON parameter format used to to submit Spark2 apps with YARN REST API?

2017-05-03 Thread Kun Liu
Hi folks,

I am trying to submit a spark app via YARN REST API, by following this
tutorial from Hortonworks:
https://community.hortonworks.com/articles/28070/starting-spark-jobs-directly-via-yarn-rest-api.html
.

Here is the general flow: GET a new app ID, then POST a new app with the ID
and several parameters as a JSON object.

However, since Spark2 the JSON parameter format has changed whereas this
turtoial is about Spark1.6. More specifically, for Spark1.6, the Spark
assembly JAR, the app JAR, and the spark-yarn properties files are provided
as "local-resources" and cached files, according to the tutorial. But for
Spark2, there is no local-resource or cached file, and the Spark assembly
JAR and properties are provided as "resources". See more details on the two
logs attached here.

As a result, the Spark assembly JAR is not visible to the container of
YARN. Thus, although I was able to submit an app, it would always finish as
FAILED. More specifically, "Could not find or load main class
org.apache.spark.executor.CoarseGrainedExecutorBackend" would be complained
on the container log (see attachment), where the
"CoarseGrainedExecutorBackend" was used in the command to submit an app.

So, what is the correct JSON parameter format to be used to submit a Spark2
app via YARN REST API?

Thanks,

Kun
Log Type: AppMaster.stderr
Log Upload Time: Wed May 03 16:45:54 -0700 2017
Log Length: 24362
17/05/03 16:45:42 INFO SignalUtils: Registered signal handler for TERM
17/05/03 16:45:42 INFO SignalUtils: Registered signal handler for HUP
17/05/03 16:45:42 INFO SignalUtils: Registered signal handler for INT
17/05/03 16:45:43 INFO ApplicationMaster: Preparing Local resources
17/05/03 16:45:43 INFO ApplicationMaster: ApplicationAttemptId: 
appattempt_1493660524300_0103_02
17/05/03 16:45:43 INFO SecurityManager: Changing view acls to: yarn,hdfs
17/05/03 16:45:43 INFO SecurityManager: Changing modify acls to: yarn,hdfs
17/05/03 16:45:43 INFO SecurityManager: Changing view acls groups to: 
17/05/03 16:45:43 INFO SecurityManager: Changing modify acls groups to: 
17/05/03 16:45:43 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(yarn, hdfs); 
groups with view permissions: Set(); users  with modify permissions: Set(yarn, 
hdfs); groups with modify permissions: Set()
17/05/03 16:45:43 INFO ApplicationMaster: Starting the user application in a 
separate Thread
17/05/03 16:45:43 INFO ApplicationMaster: Waiting for spark context 
initialization...
17/05/03 16:45:43 INFO SparkContext: Running Spark version 2.1.0
17/05/03 16:45:43 INFO SecurityManager: Changing view acls to: yarn,hdfs
17/05/03 16:45:43 INFO SecurityManager: Changing modify acls to: yarn,hdfs
17/05/03 16:45:43 INFO SecurityManager: Changing view acls groups to: 
17/05/03 16:45:43 INFO SecurityManager: Changing modify acls groups to: 
17/05/03 16:45:43 INFO SecurityManager: SecurityManager: authentication 
disabled; ui acls disabled; users  with view permissions: Set(yarn, hdfs); 
groups with view permissions: Set(); users  with modify permissions: Set(yarn, 
hdfs); groups with modify permissions: Set()
17/05/03 16:45:44 INFO Utils: Successfully started service 'sparkDriver' on 
port 41447.
17/05/03 16:45:44 INFO SparkEnv: Registering MapOutputTracker
17/05/03 16:45:44 INFO SparkEnv: Registering BlockManagerMaster
17/05/03 16:45:44 INFO BlockManagerMasterEndpoint: Using 
org.apache.spark.storage.DefaultTopologyMapper for getting topology information
17/05/03 16:45:44 INFO BlockManagerMasterEndpoint: BlockManagerMasterEndpoint up
17/05/03 16:45:44 INFO DiskBlockManager: Created local directory at 
/hadoop/yarn/local/usercache/hdfs/appcache/application_1493660524300_0103/blockmgr-5231c30a-c7aa-475b-8b45-5a39d2f2bf8a
17/05/03 16:45:44 INFO MemoryStore: MemoryStore started with capacity 912.3 MB
17/05/03 16:45:44 INFO SparkEnv: Registering OutputCommitCoordinator
17/05/03 16:45:44 INFO log: Logging initialized @2511ms
17/05/03 16:45:44 INFO JettyUtils: Adding filter: 
org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
17/05/03 16:45:44 INFO Server: jetty-9.2.z-SNAPSHOT
17/05/03 16:45:44 INFO ContextHandler: Started 
o.s.j.s.ServletContextHandler@3b40bd2c{/jobs,null,AVAILABLE}
17/05/03 16:45:44 INFO ContextHandler: Started 
o.s.j.s.ServletContextHandler@1307a150{/jobs/json,null,AVAILABLE}
17/05/03 16:45:44 INFO ContextHandler: Started 
o.s.j.s.ServletContextHandler@6ad020cc{/jobs/job,null,AVAILABLE}
17/05/03 16:45:44 INFO ContextHandler: Started 
o.s.j.s.ServletContextHandler@129030b2{/jobs/job/json,null,AVAILABLE}
17/05/03 16:45:44 INFO ContextHandler: Started 
o.s.j.s.ServletContextHandler@2c874611{/stages,null,AVAILABLE}
17/05/03 16:45:44 INFO ContextHandler: Started 
o.s.j.s.ServletContextHandler@2247ff0{/stages/json,null,AVAILABLE}
17/05/03 16:45:44 INFO ContextHandler: Started 
o.s.j.s.ServletContextHandler@6919e938{/stages/stage,null,AVAILABLE}
17/05/03 16:45:44 INFO ContextHandler: 

Re: Spark books

2017-05-03 Thread Neelesh Salian
The Apache Spark documentation is good to begin with.
All the programming guides, particularly.


On Wed, May 3, 2017 at 5:07 PM, ayan guha  wrote:

> I would suggest do not buy any book, just start with databricks community
> edition
>
> On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede  wrote:
>
>> Well that is the nature of technology, ever evolving. There will always
>> be new concepts. If you're trying to get started ASAP and the internet
>> isn't enough, I'd recommend buying a book and using Spark 1.6. A lot of
>> production stacks are still on that version and the knowledge from
>> mastering 1.6 is transferable to 2+. I think that beats waiting forever.
>>
>> On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:
>>
>>> I'm trying to decide whether to buy the book learning spark, spark for
>>> machine learning etc. or wait for a new edition covering the new concepts
>>> like dataframe and datasets. Anyone got any suggestions?
>>>
>>
>>
>
>
> --
> Best Regards,
> Ayan Guha
>



-- 
Regards,
Neelesh S. Salian


Re: Spark books

2017-05-03 Thread ayan guha
I would suggest do not buy any book, just start with databricks community
edition

On Thu, May 4, 2017 at 9:30 AM, Tobi Bosede  wrote:

> Well that is the nature of technology, ever evolving. There will always be
> new concepts. If you're trying to get started ASAP and the internet isn't
> enough, I'd recommend buying a book and using Spark 1.6. A lot of
> production stacks are still on that version and the knowledge from
> mastering 1.6 is transferable to 2+. I think that beats waiting forever.
>
> On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:
>
>> I'm trying to decide whether to buy the book learning spark, spark for
>> machine learning etc. or wait for a new edition covering the new concepts
>> like dataframe and datasets. Anyone got any suggestions?
>>
>
>


-- 
Best Regards,
Ayan Guha


Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread Michael Armbrust
>
>  if I do dataset.select("nonExistentColumn") then the Analysis Error is
> thrown at compile time right?
>

if you do df.as[MyClass].map(_.badFieldName) you will get a compile error.
However, if df doesn't have the right columns for MyClass, that error will
only be thrown at runtime (whether DF is backed by something in memory or
some remote database).


Re: Refreshing a persisted RDD

2017-05-03 Thread Tathagata Das
Yes, you will have to recreate the streaming Dataframe along with the
static Dataframe, and restart the query. There isnt a currently feasible to
do this without a query restart. But restarting a query WITHOUT restarting
the whole application + spark cluster, is reasonably fast. If your
applicatoin can tolerate 10 second latencies, then stopping and restarting
a query within the same Spark application is a reasonable solution.

On Wed, May 3, 2017 at 4:13 PM, Lalwani, Jayesh <
jayesh.lalw...@capitalone.com> wrote:

> Thanks, TD for answering this question on the Spark mailing list.
>
>
>
> A follow-up. So, let’s say we are joining a cached dataframe with a
> streaming dataframe, and we recreate the cached dataframe, do we have to
> recreate the streaming dataframe too?
>
>
>
> One possible solution that we have is
>
>
>
> val dfBlackList = spark.read.csv(….) //batch dataframe… assume that this
> dataframe has a single column namedAccountName
> dfBlackList.createOrReplaceTempView(“blacklist”)
> val dfAccount = spark.readStream.kafka(…..) // assume for brevity’s sake
> that we have parsed the kafka payload and have a data frame here with
> multiple columns.. one of them called accountName
>
> dfAccount. createOrReplaceTempView(“account”)
>
> val dfBlackListedAccount = spark.sql(“select * from account inner join
> blacklist on account.accountName = blacklist.accountName”)
>
> df.writeStream(…..).start() // boom started
>
>
>
> Now some time later while the query is running we do
>
>
>
> val dfRefreshedBlackList = spark.read.csv(….)
> dfRefreshedBlackList.createOrReplaceTempView(“blacklist”)
>
>
>
> Now, will dfBlackListedAccount pick up the newly created blacklist? Or
> will it continue to hold the reference to the old dataframe? What if we had
> done RDD operations instead of using Spark SQL to join the dataframes?
>
>
>
> *From: *Tathagata Das 
> *Date: *Wednesday, May 3, 2017 at 6:32 PM
> *To: *"Lalwani, Jayesh" 
> *Cc: *user 
> *Subject: *Re: Refreshing a persisted RDD
>
>
>
> If you want to always get the latest data in files, its best to always
> recreate the DataFrame.
>
>
>
> On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani <
> jayesh.lalw...@capitalone.com> wrote:
>
> We have a Structured Streaming application that gets accounts from Kafka
> into
> a streaming data frame. We have a blacklist of accounts stored in S3 and we
> want to filter out all the accounts that are blacklisted. So, we are
> loading
> the blacklisted accounts into a batch data frame and joining it with the
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we
> wanted to cache the blacklist data frame to prevent going out to S3
> everytime. Since, the blacklist might change, we want to be able to refresh
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple
> data frame. The steps we followed are
> /Create a CSV file
> load CSV into a DF: df = spark.read.csv(filename)
> Persist the data frame: df.persist
> Now when we do df.show, we see the contents of the csv.
> We change the CSV, and call df.show, we can see that the old contents are
> being displayed, proving that the df is cached
> df.unpersist
> df.persist
> df.show/
>
> What we see is that the rows that were modified in the CSV are reloaded..
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data
> without restarting the Spark application?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>
>
> --
>
> The information contained in this e-mail is confidential and/or
> proprietary to Capital One and/or its affiliates and may only be used
> solely in performance of work or services for Capital One. The information
> transmitted herewith is intended only for use by the individual or entity
> to which it is addressed. If the reader of this message is not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying or other use of, or taking of any
> action in reliance upon this information is strictly prohibited. If you
> have received this communication in error, please contact the sender and
> delete the material from your computer.
>


Re: Spark books

2017-05-03 Thread Tobi Bosede
Well that is the nature of technology, ever evolving. There will always be
new concepts. If you're trying to get started ASAP and the internet isn't
enough, I'd recommend buying a book and using Spark 1.6. A lot of
production stacks are still on that version and the knowledge from
mastering 1.6 is transferable to 2+. I think that beats waiting forever.

On Wed, May 3, 2017 at 6:35 PM, Zeming Yu  wrote:

> I'm trying to decide whether to buy the book learning spark, spark for
> machine learning etc. or wait for a new edition covering the new concepts
> like dataframe and datasets. Anyone got any suggestions?
>


Re: Refreshing a persisted RDD

2017-05-03 Thread Lalwani, Jayesh
Thanks, TD for answering this question on the Spark mailing list.

A follow-up. So, let’s say we are joining a cached dataframe with a streaming 
dataframe, and we recreate the cached dataframe, do we have to recreate the 
streaming dataframe too?

One possible solution that we have is

val dfBlackList = spark.read.csv(….) //batch dataframe… assume that this 
dataframe has a single column namedAccountName
dfBlackList.createOrReplaceTempView(“blacklist”)
val dfAccount = spark.readStream.kafka(…..) // assume for brevity’s sake that 
we have parsed the kafka payload and have a data frame here with multiple 
columns.. one of them called accountName
dfAccount. createOrReplaceTempView(“account”)
val dfBlackListedAccount = spark.sql(“select * from account inner join 
blacklist on account.accountName = blacklist.accountName”)
df.writeStream(…..).start() // boom started

Now some time later while the query is running we do

val dfRefreshedBlackList = spark.read.csv(….)
dfRefreshedBlackList.createOrReplaceTempView(“blacklist”)

Now, will dfBlackListedAccount pick up the newly created blacklist? Or will it 
continue to hold the reference to the old dataframe? What if we had done RDD 
operations instead of using Spark SQL to join the dataframes?

From: Tathagata Das 
Date: Wednesday, May 3, 2017 at 6:32 PM
To: "Lalwani, Jayesh" 
Cc: user 
Subject: Re: Refreshing a persisted RDD

If you want to always get the latest data in files, its best to always recreate 
the DataFrame.

On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani 
> wrote:
We have a Structured Streaming application that gets accounts from Kafka into
a streaming data frame. We have a blacklist of accounts stored in S3 and we
want to filter out all the accounts that are blacklisted. So, we are loading
the blacklisted accounts into a batch data frame and joining it with the
streaming data frame to filter out the bad accounts.
Now, the blacklist doesn't change very often.. once a week at max. SO, we
wanted to cache the blacklist data frame to prevent going out to S3
everytime. Since, the blacklist might change, we want to be able to refresh
the cache at a cadence, without restarting the whole app.
So, to begin with we wrote a simple app that caches and refreshes a simple
data frame. The steps we followed are
/Create a CSV file
load CSV into a DF: df = spark.read.csv(filename)
Persist the data frame: df.persist
Now when we do df.show, we see the contents of the csv.
We change the CSV, and call df.show, we can see that the old contents are
being displayed, proving that the df is cached
df.unpersist
df.persist
df.show/

What we see is that the rows that were modified in the CSV are reloaded..
But new rows aren't
Is this expected behavior? Is there a better way to refresh cached data
without restarting the Spark application?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Synonym handling replacement issue with UDF in Apache Spark

2017-05-03 Thread JayeshLalwani
You need to understand how join works to make sense of it. Logically, a join
does a cartesian product of the 2 tables, and then filters the rows that
satisfy the contains UDF. So, let's say you have

Input

Allen Armstrong nishanth hemanth Allen
shivu Armstrong nishanth
shree shivu DeWALT

Replacement of words
The word in LHS has to replace with the words in RHS given in the input
sentence
Allen=> Apex Tool Group
Armstrong => Apex Tool Group
DeWALT=> StanleyBlack

Logically speaking it will first do a cartesian product, which will give you
this

Input x Replacement
Allen Armstrong nishanth hemanth Allen, Allen, Apex Tool Group
Allen Armstrong nishanth hemanth Allen, Armstrong, Apex Tool Group
Allen Armstrong nishanth hemanth Allen, DeWalt, Apex Tool Group
shivu Armstrong nishanth, Allen, Apex Tool Group
shivu Armstrong nishanth, Armstrong, Apex Tool Group
shivu Armstrong nishanth, DeWalt, Apex Tool Group
shree shivu DeWALT, Allen, Apex Tool Group
shree shivu DeWALT, Armstrong, Apex Tool Group
shree shivu DeWALT, DeWalt, Apex Tool Group

Then it will filter and keep only the records that satisfies contains

Join output
Allen Armstrong nishanth hemanth Allen, Allen, Apex Tool Group
Allen Armstrong nishanth hemanth Allen, Armstrong, Apex Tool Group
shivu Armstrong nishanth, Armstrong, Apex Tool Group
shree shivu DeWALT, DeWalt, Apex Tool Group

So, as you can see you have 4 output rows instead of 3. Now when ir performs
the replace WithTerm operation, you get the output that you are getting





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Synonym-handling-replacement-issue-with-UDF-in-Apache-Spark-tp28638p28648.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Spark-SQL collect function

2017-05-03 Thread JayeshLalwani
In any distributed application, you scale up by splitting execution up on
multiple machines. The way Spark does this is by slicing the data into
partitions and spreading them on multiple machines. Logically, an RDD is
exactly that: data is split up and spread around on multiple machines. When
you perform operations on an RDD, Spark tells all the machines to perform
that operation on their own slice of data. SO, for example, if you perform a
filter operation (or if you are using SQL, you do /Select * from tablename
where col=colval/, Spark tells each machine to look for rows that match your
filter criteria in their own slice of data. This operation results in
another distributed dataset that contains the filtered records. Note that
when you do a filter operation, Spark doesn't move data outside of the
machines that they reside in. It keeps the filtered records in the same
machine. This ability of Spark to keep data in place is what provides
scalability. As long as your operations keep data in place, you can scale up
infinitely. If you got 10x more records, you can add 10x more machines, and
you will get the same performance

However, the problem is that a lot of operations cannot be done by keeping
data in place. For example, let's say you have 2 tables/dataframes. Spark
will slice both up and spread them around the machines. Now let's say, you
joined both tables. It may happen that the slice of data that resides in one
machine has matching records in another machine. So, now, Spark has to bring
data over from one machine to another. This is what Spark calls a
/shuffle/Spark does this intelligently. However, whenever data leaves one
machine and goes to other machines, you cannot scale infinitely. There will
be a point at which you will overwhelm the network, and adding more machines
isn't going to improve performance. 

So, the point is that you have to avoid shuffles as much as possible. You
cannot eliminate shuffles altogether, but you can reduce them

Now, /collect/ is the granddaddy of all shuffles. It causes Spark to bring
all the data that it has distributedd over the machines into a single
machine. If you call collect on a large table, it's analogous to drinking
from a firehose. You are going to drown.Calling collect on a small table is
fine, because very little data will move

Usually, it's recommended to run all your aggregations using Spark SQL, and
when you get the data boiled down to a small enough size that can be
presented to a human, you can call collect on it to fetch it and present it
to the human user. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-collect-function-tp28644p28647.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark books

2017-05-03 Thread Zeming Yu
I'm trying to decide whether to buy the book learning spark, spark for
machine learning etc. or wait for a new edition covering the new concepts
like dataframe and datasets. Anyone got any suggestions?


Re: Refreshing a persisted RDD

2017-05-03 Thread Tathagata Das
If you want to always get the latest data in files, its best to always
recreate the DataFrame.

On Wed, May 3, 2017 at 7:30 AM, JayeshLalwani  wrote:

> We have a Structured Streaming application that gets accounts from Kafka
> into
> a streaming data frame. We have a blacklist of accounts stored in S3 and we
> want to filter out all the accounts that are blacklisted. So, we are
> loading
> the blacklisted accounts into a batch data frame and joining it with the
> streaming data frame to filter out the bad accounts.
> Now, the blacklist doesn't change very often.. once a week at max. SO, we
> wanted to cache the blacklist data frame to prevent going out to S3
> everytime. Since, the blacklist might change, we want to be able to refresh
> the cache at a cadence, without restarting the whole app.
> So, to begin with we wrote a simple app that caches and refreshes a simple
> data frame. The steps we followed are
> /Create a CSV file
> load CSV into a DF: df = spark.read.csv(filename)
> Persist the data frame: df.persist
> Now when we do df.show, we see the contents of the csv.
> We change the CSV, and call df.show, we can see that the old contents are
> being displayed, proving that the df is cached
> df.unpersist
> df.persist
> df.show/
>
> What we see is that the rows that were modified in the CSV are reloaded..
> But new rows aren't
> Is this expected behavior? Is there a better way to refresh cached data
> without restarting the Spark application?
>
>
>
> --
> View this message in context: http://apache-spark-user-list.
> 1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>


Re: In-order processing using spark streaming

2017-05-03 Thread JayeshLalwani
Option A

If you can get all the messages in a session into the same Spark partition,
you can use df.mapWithPartition to process the whole partition. This will
allow you to control the order in which the messages are processed within
the partition.
This will work if messages are posted in Kafka in order and are guaranteed
by Kafka to be delivered in order

Option B
If the messages can come out of order, and  have a timestamp associated with
them, you can use window operations to sort messages within a window. You
will need to make sure that messages in the same session land in the same
Spark partition. This will add latency to the system though, because you
won't process the messages until the watermark has expired. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/In-order-processing-using-spark-streaming-tp28457p28646.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread kant kodali
got it! so if I do dataset.select("nonExistentColumn") then the Analysis
Error is thrown at compile time right? but what if I have a dataset in
memory and I go to a database server and execute a delete column query ?
Will the dataset object be in sync with the underlying database table?

Thanks!

On Wed, May 3, 2017 at 2:25 PM, Michael Armbrust 
wrote:

> An analysis exception occurs whenever the scala/java/python program is
> valid, but the dataframe operations being performed are not.  For example,
> df.select("nonExistentColumn") would throw an analysis exception.
>
> On Wed, May 3, 2017 at 1:38 PM, kant kodali  wrote:
>
>> Hi All,
>>
>> I understand the compile time Errors this blog
>> 
>>  is
>> talking about but I don't understand what are Analysis Errors? Any Examples?
>>
>> Thanks!
>>
>
>


Re: What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread Michael Armbrust
An analysis exception occurs whenever the scala/java/python program is
valid, but the dataframe operations being performed are not.  For example,
df.select("nonExistentColumn") would throw an analysis exception.

On Wed, May 3, 2017 at 1:38 PM, kant kodali  wrote:

> Hi All,
>
> I understand the compile time Errors this blog
> 
>  is
> talking about but I don't understand what are Analysis Errors? Any Examples?
>
> Thanks!
>


What are Analysis Errors With respect to Spark Sql DataFrames and DataSets?

2017-05-03 Thread kant kodali
Hi All,

I understand the compile time Errors this blog

is
talking about but I don't understand what are Analysis Errors? Any Examples?

Thanks!


Re: [Spark Streaming] - Killing application from within code

2017-05-03 Thread Tathagata Das
There isnt a clean programmatic way to kill the application running in the
driver from the executor. You will have to set up addition RPC mechanism to
explicitly send a signal from the executors to the application/driver to
quit.

On Wed, May 3, 2017 at 8:44 AM, Sidney Feiner 
wrote:

> Hey, I'm using connections to Elasticsearch from within my Spark Streaming
> application.
>
> I'm using Futures to maximize performance when it sends network requests
> to the ES cluster.
>
> Basically, I want my app to crash if any one of the executors fails to
> connect to ES.
>
>
>
> The exception gets catched and returned in my Future as a Failure(ex:
> NoNodeAvailableException) but when I handle it, I can't seem to kill my app.
>
> I tried using:
>
>
>
> fut andThen {
>   *case **Failure*(ex: NoNodeAvailableException) =>
> *throw *ex
> }
>
> fut andThen {
>   *case **Failure*(ex: NoNodeAvailableException) =>
> System.*exit*(-1)
> }
>
> fut onFailure {
>   *case *ex: NoNodeAvailableException =>
> *throw *ex
> }
>
> fut onFailure {
>   *case *ex: NoNodeAvailableException =>
> System.*exit*(-1)
> }
>
>
>
>
>
> But none of them seem to be killing my app. The System.exit(-1) kills my
> executor but that doesn't seem like the correct way to do it.
>
> And no matter what way I try, the driver stays alive.
>
>
>
> Is there a way to programmatically kill the application from within one of
> the workers?
>
>
>
> Thanks a lot J
>
>
>
>
>
> *Sidney Feiner* */* SW Developer
>
> M: +972.528197720 <+972%2052-819-7720> */* Skype: sidney.feiner.startapp
>
>
>
> [image: emailsignature]
>
>
>
>
>


Pat Ferrel has shared a document on Google Docs with you

2017-05-03 Thread pat
Pat Ferrel has invited you to view the following document:

Open in Docs



Francis Lau has shared a document on Google Docs with you

2017-05-03 Thread francis . lau
Francis Lau has invited you to view the following document:

Open in Docs



[Spark Streaming] - Killing application from within code

2017-05-03 Thread Sidney Feiner
Hey, I'm using connections to Elasticsearch from within my Spark Streaming 
application.
I'm using Futures to maximize performance when it sends network requests to the 
ES cluster.
Basically, I want my app to crash if any one of the executors fails to connect 
to ES.

The exception gets catched and returned in my Future as a Failure(ex: 
NoNodeAvailableException) but when I handle it, I can't seem to kill my app.
I tried using:

fut andThen {
  case Failure(ex: NoNodeAvailableException) =>
throw ex
}

fut andThen {
  case Failure(ex: NoNodeAvailableException) =>
System.exit(-1)
}

fut onFailure {
  case ex: NoNodeAvailableException =>
throw ex
}

fut onFailure {
  case ex: NoNodeAvailableException =>
System.exit(-1)
}


But none of them seem to be killing my app. The System.exit(-1) kills my 
executor but that doesn't seem like the correct way to do it.
And no matter what way I try, the driver stays alive.

Is there a way to programmatically kill the application from within one of the 
workers?

Thanks a lot :)


Sidney Feiner / SW Developer
M: +972.528197720 / Skype: sidney.feiner.startapp

[emailsignature]




Refreshing a persisted RDD

2017-05-03 Thread JayeshLalwani
We have a Structured Streaming application that gets accounts from Kafka into
a streaming data frame. We have a blacklist of accounts stored in S3 and we
want to filter out all the accounts that are blacklisted. So, we are loading
the blacklisted accounts into a batch data frame and joining it with the
streaming data frame to filter out the bad accounts.
Now, the blacklist doesn't change very often.. once a week at max. SO, we
wanted to cache the blacklist data frame to prevent going out to S3
everytime. Since, the blacklist might change, we want to be able to refresh
the cache at a cadence, without restarting the whole app.
So, to begin with we wrote a simple app that caches and refreshes a simple
data frame. The steps we followed are
/Create a CSV file
load CSV into a DF: df = spark.read.csv(filename)
Persist the data frame: df.persist
Now when we do df.show, we see the contents of the csv.
We change the CSV, and call df.show, we can see that the old contents are
being displayed, proving that the df is cached
df.unpersist
df.persist
df.show/

What we see is that the rows that were modified in the CSV are reloaded..
But new rows aren't
Is this expected behavior? Is there a better way to refresh cached data
without restarting the Spark application?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Refreshing-a-persisted-RDD-tp28642.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: parquet optimal file structure - flat vs nested

2017-05-03 Thread Steve Loughran

> On 30 Apr 2017, at 09:19, Zeming Yu  wrote:
> 
> Hi,
> 
> We're building a parquet based data lake. I was under the impression that 
> flat files are more efficient than deeply nested files (say 3 or 4 levels 
> down). Is that correct?
> 
> Thanks,
> Zeming

Where's the data going to live: HDFS or an object store? If it's somewhere like 
Amazon S3 I'd be biased towards the flatter structure as how the client 
libraries mimic treewalking is pretty expensive in terms of HTTP calls, and, as 
those calls all take place during the initial, serialized, query planning 
stage, expensive. 



-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Redefining Spark native UDFs

2017-05-03 Thread Miguel Figueiredo
Hi,

I have pre-defined SQL code that uses the decode UDF which is no compatible
with the existing Spark decode UDF.
I tried to re-define the decode UDF without success. Is there a way to do
this in Spark?

Best regards,
Miguel

-- 
Miguel Figueiredo
Software Developer
http://jaragua.hopto.org 

"I'm a pretty lazy person and am prepared to work quite hard in order to
avoid work."
-- Martin Fowler


Re: map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-03 Thread Tathagata Das
You can apply apply any kind of aggregation on windows. There are some
built in aggregations (e.g. sum and count) as well as there is an API for
user-defined aggregations (scala/Java) that works with both batch and
streaming DFs.
See the programming guide if you havent seen it already
- windowing -
http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time
- UDAFs on typesafe Dataset (scala/java) -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.Aggregator
- UDAFs on generic DataFrames (scala/java) -
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.expressions.UserDefinedAggregateFunction

You can register a UDAF defined in Scala with a name and then call that
function by name in SQL
- https://docs.databricks.com/spark/latest/spark-sql/udaf-scala.html

I feel a combination of these should be sufficient for you. Hope this helps.


On Wed, May 3, 2017 at 1:51 AM, peay  wrote:

> Hello,
>
> I would like to get started on Spark Streaming with a simple window.
>
> I've got some existing Spark code that takes a dataframe, and outputs a
> dataframe. This includes various joins and operations that are not
> supported by structured streaming yet. I am looking to essentially
> map/apply this on the data for each window.
>
> Is there any way to apply a function to a dataframe that would correspond
> to each window? This would mean accumulate data until watermark is reached,
> and then mapping the full corresponding dataframe.
>
> I am using pyspark. I've seen the foreach writer, but it seems to operate
> at partition level instead of a full "window dataframe" and is not
> available for Python anyway.
>
> Thanks!
>


Benchmark of XGBoost, Vowpal Wabbit and Spark ML on Criteo 1TB Dataset

2017-05-03 Thread pklemenkov
Hi!

We've done cool benchmark of popular ML libraries (including Spark ML) on
Criteo 1TB dataset
https://github.com/rambler-digital-solutions/criteo-1tb-benchmark

Spark ML was tested on a real production cluster and showed great results at
scale.

We'd like to see some feedback and tips for improvement. Have a look and
spread the word!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Benchmark-of-XGBoost-Vowpal-Wabbit-and-Spark-ML-on-Criteo-1TB-Dataset-tp28640.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



map/foreachRDD equivalent for pyspark Structured Streaming

2017-05-03 Thread peay
Hello,

I would like to get started on Spark Streaming with a simple window.

I've got some existing Spark code that takes a dataframe, and outputs a 
dataframe. This includes various joins and operations that are not supported by 
structured streaming yet. I am looking to essentially map/apply this on the 
data for each window.

Is there any way to apply a function to a dataframe that would correspond to 
each window? This would mean accumulate data until watermark is reached, and 
then mapping the full corresponding dataframe.

I am using pyspark. I've seen the foreach writer, but it seems to operate at 
partition level instead of a full "window dataframe" and is not available for 
Python anyway.

Thanks!

spark 1.6 .0 and gridsearchcv

2017-05-03 Thread issues solution
Hi ,
 i wonder if we have methode under pyspakr 1.6 to perform  gridsearchCv ?
if yes can i ask example please .
thx