Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

2021-03-05 Thread Felix Cheung
Congrats and thanks!


From: Hyukjin Kwon 
Sent: Wednesday, March 3, 2021 4:09:23 PM
To: Dongjoon Hyun 
Cc: Gabor Somogyi ; Jungtaek Lim 
; angers zhu ; Wenchen Fan 
; Kent Yao ; Takeshi Yamamuro 
; dev ; user @spark 

Subject: Re: [ANNOUNCE] Announcing Apache Spark 3.1.1

Thank you so much guys .. it indeed took a long time and it was pretty tough 
this time :-).
It was all possible because of your guys' support. I sincerely appreciate it .

2021년 3월 4일 (목) 오전 2:26, Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>>님이 작성:
It took a long time. Thank you, Hyukjin and all!

Bests,
Dongjoon.

On Wed, Mar 3, 2021 at 3:23 AM Gabor Somogyi 
mailto:gabor.g.somo...@gmail.com>> wrote:
Good to hear and great work Hyukjin! 

On Wed, 3 Mar 2021, 11:15 Jungtaek Lim, 
mailto:kabhwan.opensou...@gmail.com>> wrote:
Thanks Hyukjin for driving the huge release, and thanks everyone for 
contributing the release!

On Wed, Mar 3, 2021 at 6:54 PM angers zhu 
mailto:angers@gmail.com>> wrote:
Great work, Hyukjin !

Bests,
Angers

Wenchen Fan mailto:cloud0...@gmail.com>> 于2021年3月3日周三 
下午5:02写道:
Great work and congrats!

On Wed, Mar 3, 2021 at 3:51 PM Kent Yao 
mailto:yaooq...@qq.com>> wrote:
Congrats, all!

Bests,
Kent Yao
@ Data Science Center, Hangzhou Research Institute, NetEase Corp.
a spark enthusiast
kyuubiis a unified multi-tenant JDBC 
interface for large-scale data processing and analytics, built on top of Apache 
Spark.
spark-authorizerA Spark SQL 
extension which provides SQL Standard Authorization for Apache 
Spark.
spark-postgres A library for 
reading data from and transferring data to Postgres / Greenplum with Spark SQL 
and DataFrames, 10~100x faster.
spark-func-extrasA library that 
brings excellent and useful functions from various modern database management 
systems to Apache Spark.



On 03/3/2021 15:11,Takeshi 
Yamamuro wrote:
Great work and Congrats, all!

Bests,
Takeshi

On Wed, Mar 3, 2021 at 2:18 PM Mridul Muralidharan 
mailto:mri...@gmail.com>> wrote:

Thanks Hyukjin and congratulations everyone on the release !

Regards,
Mridul

On Tue, Mar 2, 2021 at 8:54 PM Yuming Wang 
mailto:wgy...@gmail.com>> wrote:
Great work, Hyukjin!

On Wed, Mar 3, 2021 at 9:50 AM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
We are excited to announce Spark 3.1.1 today.

Apache Spark 3.1.1 is the second release of the 3.x line. This release adds
Python type annotations and Python dependency management support as part of 
Project Zen.
Other major updates include improved ANSI SQL compliance support, history 
server support
in structured streaming, the general availability (GA) of Kubernetes and node 
decommissioning
in Kubernetes and Standalone. In addition, this release continues to focus on 
usability, stability,
and polish while resolving around 1500 tickets.

We'd like to thank our contributors and users for their contributions and early 
feedback to
this release. This release would not have been possible without you.

To download Spark 3.1.1, head over to the download page:
http://spark.apache.org/downloads.html

To view the release notes:
https://spark.apache.org/releases/spark-release-3-1-1.html



--
---
Takeshi Yamamuro


Fwd: Announcing ApacheCon @Home 2020

2020-07-01 Thread Felix Cheung

-- Forwarded message -

We are pleased to announce that ApacheCon @Home will be held online,
September 29 through October 1.

More event details are available at https://apachecon.com/acah2020 but
there’s a few things that I want to highlight for you, the members.

Yes, the CFP has been reopened. It will be open until the morning of
July 13th. With no restrictions on space/time at the venue, we can
accept talks from a much wider pool of speakers, so we look forward to
hearing from those of you who may have been reluctant, or unwilling, to
travel to the US.
Yes, you can add your project to the event, whether that’s one talk, or
an entire track - we have the room now. Those of you who are PMC members
will be receiving information about how to get your projects represented
at the event.
Attendance is free, as has been the trend in these events in our
industry. We do, however, offer donation options for attendees who feel
that our content is worth paying for.
Sponsorship opportunities are available immediately at
https://www.apachecon.com/acna2020/sponsors.html

If you would like to volunteer to help, we ask that you join the
plann...@apachecon.com mailing list and discuss 
it there, rather than
here, so that we do not have a split discussion, while we’re trying to
coordinate all of the things we have to get done in this very short time
window.

Rich Bowen,
VP Conferences, The Apache Software Foundation




Re: [ANNOUNCE] Apache Spark 3.0.0

2020-06-18 Thread Felix Cheung
Congrats


From: Jungtaek Lim 
Sent: Thursday, June 18, 2020 8:18:54 PM
To: Hyukjin Kwon 
Cc: Mridul Muralidharan ; Reynold Xin ; 
dev ; user 
Subject: Re: [ANNOUNCE] Apache Spark 3.0.0

Great, thanks all for your efforts on the huge step forward!

On Fri, Jun 19, 2020 at 12:13 PM Hyukjin Kwon 
mailto:gurwls...@gmail.com>> wrote:
Yay!

2020년 6월 19일 (금) 오전 4:46, Mridul Muralidharan 
mailto:mri...@gmail.com>>님이 작성:
Great job everyone ! Congratulations :-)

Regards,
Mridul

On Thu, Jun 18, 2020 at 10:21 AM Reynold Xin 
mailto:r...@databricks.com>> wrote:

Hi all,

Apache Spark 3.0.0 is the first release of the 3.x line. It builds on many of 
the innovations from Spark 2.x, bringing new ideas as well as continuing 
long-term projects that have been in development. This release resolves more 
than 3400 tickets.

We'd like to thank our contributors and users for their contributions and early 
feedback to this release. This release would not have been possible without you.

To download Spark 3.0.0, head over to the download page: 
http://spark.apache.org/downloads.html

To view the release notes: 
https://spark.apache.org/releases/spark-release-3-0-0.html





Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Felix Cheung
Maybe it’s the reverse - the package is built to run in latest but not 
compatible with slightly older (3.5.2 was Dec 2018)


From: Jeff Zhang 
Sent: Thursday, December 26, 2019 5:36:50 PM
To: Felix Cheung 
Cc: user.spark 
Subject: Re: Fail to use SparkR of 3.0 preview 2

I use R 3.5.2

Felix Cheung mailto:felixcheun...@hotmail.com>> 
于2019年12月27日周五 上午4:32写道:
It looks like a change in the method signature in R base packages.

Which version of R are you running on?


From: Jeff Zhang mailto:zjf...@gmail.com>>
Sent: Thursday, December 26, 2019 12:46:12 AM
To: user.spark mailto:user@spark.apache.org>>
Subject: Fail to use SparkR of 3.0 preview 2

I tried SparkR of spark 3.0 preview 2, but hit the following issue.

Error in rbind(info, getNamespaceInfo(env, "S3methods")) :
  number of columns of matrices must match (see arg 2)
Error: package or namespace load failed for ‘SparkR’ in rbind(info, 
getNamespaceInfo(env, "S3methods")):
 number of columns of matrices must match (see arg 2)
During startup - Warning messages:
1: package ‘SparkR’ was built under R version 3.6.2
2: package ‘SparkR’ in options("defaultPackages") was not found

Does anyone know what might be wrong ? Thanks



--
Best Regards

Jeff Zhang


--
Best Regards

Jeff Zhang


Re: Fail to use SparkR of 3.0 preview 2

2019-12-26 Thread Felix Cheung
It looks like a change in the method signature in R base packages.

Which version of R are you running on?


From: Jeff Zhang 
Sent: Thursday, December 26, 2019 12:46:12 AM
To: user.spark 
Subject: Fail to use SparkR of 3.0 preview 2

I tried SparkR of spark 3.0 preview 2, but hit the following issue.

Error in rbind(info, getNamespaceInfo(env, "S3methods")) :
  number of columns of matrices must match (see arg 2)
Error: package or namespace load failed for ‘SparkR’ in rbind(info, 
getNamespaceInfo(env, "S3methods")):
 number of columns of matrices must match (see arg 2)
During startup - Warning messages:
1: package ‘SparkR’ was built under R version 3.6.2
2: package ‘SparkR’ in options("defaultPackages") was not found

Does anyone know what might be wrong ? Thanks



--
Best Regards

Jeff Zhang


Re: SparkR integration with Hive 3 spark-r

2019-11-24 Thread Felix Cheung
I think you will get more answer if you ask without SparkR.

You question is independent on SparkR.

Spark support for Hive 3.x (3.1.2) was added here

https://github.com/apache/spark/commit/1b404b9b9928144e9f527ac7b1caa15f932c2649

You should be able to connect Spark to Hive metastore.




From: Alfredo Marquez 
Sent: Friday, November 22, 2019 4:26:49 PM
To: user@spark.apache.org 
Subject: Re: SparkR integration with Hive 3 spark-r

Does anyone else have some insight to this question?

Thanks,

Alfredo

On Mon, Nov 18, 2019, 3:00 PM Alfredo Marquez 
mailto:alfredo.g.marq...@gmail.com>> wrote:
Hello Nicolas,

Well the issue is that with Hive 3, Spark gets it's own metastore, separate 
from the Hive 3 metastore.  So how do you reconcile this separation of 
metastores?

Can you continue to "enableHivemetastore" and be able to connect to Hive 3? 
Does this connection take advantage of Hive's LLAP?

Our team doesn't believe that it's possible to make the connection as you would 
in the past.  But if it is that simple, I would be ecstatic .

Thanks,

Alfredo

On Mon, Nov 18, 2019, 12:53 PM Nicolas Paris 
mailto:nicolas.pa...@riseup.net>> wrote:
Hi Alfredo

my 2 cents:
To my knowlegde and reading the spark3 pre-release note, it will handle
hive metastore 2.3.5 - no mention of hive 3 metastore. I made several
tests on this in the past[1] and it seems to handle any hive metastore
version.

However spark cannot read hive managed table AKA transactional tables.
So I would say you should be able to read any hive 3 regular table with
any of spark, pyspark or sparkR.


[1] https://parisni.frama.io/posts/playing-with-hive-spark-metastore-versions/

On Mon, Nov 18, 2019 at 11:23:50AM -0600, Alfredo Marquez wrote:
> Hello,
>
> Our company is moving to Hive 3, and they are saying that there is no SparkR
> implementation in Spark 2.3.x + that will connect to Hive 3.  Is this true?
>
> If it is true, will this be addressed in the Spark 3 release?
>
> I don't use python, so losing SparkR to get work done on Hadoop is a huge 
> loss.
>
> P.S. This is my first email to this community; if there is something I should
> do differently, please let me know.
>
> Thank you
>
> Alfredo

--
nicolas

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org



Re: JDK11 Support in Apache Spark

2019-08-24 Thread Felix Cheung
That’s great!


From: ☼ R Nair 
Sent: Saturday, August 24, 2019 10:57:31 AM
To: Dongjoon Hyun 
Cc: d...@spark.apache.org ; user @spark/'user 
@spark'/spark users/user@spark 
Subject: Re: JDK11 Support in Apache Spark

Finally!!! Congrats

On Sat, Aug 24, 2019, 11:11 AM Dongjoon Hyun 
mailto:dongjoon.h...@gmail.com>> wrote:
Hi, All.

Thanks to your many many contributions,
Apache Spark master branch starts to pass on JDK11 as of today.
(with `hadoop-3.2` profile: Apache Hadoop 3.2 and Hive 2.3.6)


https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/326/
(JDK11 is used for building and testing.)

We already verified all UTs (including PySpark/SparkR) before.

Please feel free to use JDK11 in order to build/test/run `master` branch and
share your experience including any issues. It will help Apache Spark 3.0.0 
release.

For the follow-ups, please follow 
https://issues.apache.org/jira/browse/SPARK-24417 .
The next step is `how to support JDK8/JDK11 together in a single artifact`.

Bests,
Dongjoon.


Re: [PySpark] [SparkR] Is it possible to invoke a PySpark function with a SparkR DataFrame?

2019-07-16 Thread Felix Cheung
Not currently in Spark.

However, there are systems out there that can share DataFrame between languages 
on top of Spark - it’s not calling the python UDF directly but you can pass the 
DataFrame to python and then .map(UDF) that way.



From: Fiske, Danny 
Sent: Monday, July 15, 2019 6:58:32 AM
To: user@spark.apache.org
Subject: [PySpark] [SparkR] Is it possible to invoke a PySpark function with a 
SparkR DataFrame?

Hi all,

Forgive this naïveté, I’m looking for reassurance from some experts!

In the past we created a tailored Spark library for our organisation, 
implementing Spark functions in Scala with Python and R “wrappers” on top, but 
the focus on Scala has alienated our analysts/statisticians/data scientists and 
collaboration is important for us (yeah… we’re aware that your SDKs are very 
similar across languages… :/ ). We’d like to see if we could forego the Scala 
facet in order to present the source code in a language more familiar to users 
and internal contributors.

We’d ideally write our functions with PySpark and potentially create a SparkR 
“wrapper” over the top, leading to the question:

Given a function written with PySpark that accepts a DataFrame parameter, is 
there a way to invoke this function using a SparkR DataFrame?

Is there any reason to pursue this? Is it even possible?

Many thanks,

Danny

For the latest data on the economy and society, consult our website at 
http://www.ons.gov.uk

***
Please Note:  Incoming and outgoing email messages are routinely monitored for 
compliance with our policy on the use of electronic communications

***

Legal Disclaimer:  Any views expressed by the sender of this message are not 
necessarily those of the Office for National Statistics
***


Re: Spark SQL in R?

2019-06-08 Thread Felix Cheung
I don’t think you should get a hive-xml from the internet.

It should have connection information about a running hive metastore - if you 
don’t have a hive metastore service as you are running locally (from a laptop?) 
then you don’t really need it. You can get spark to work with it’s own.




From: ya 
Sent: Friday, June 7, 2019 8:26:27 PM
To: Rishikesh Gawade; felixcheun...@hotmail.com; user@spark.apache.org
Subject: Spark SQL in R?

Dear Felix and Richikesh and list,

Thank you very much for your previous help. So far I have tried two ways to 
trigger Spark SQL: one is to use R with sparklyr library and SparkR library; 
the other way is to use SparkR shell from Spark. I am not connecting a remote 
spark cluster, but a local one. Both failed with or without hive-site.xml. I 
suspect the content of hive-site.xml I found online was not appropriate for 
this case, as the spark session can not be initialized after adding this 
hive-site.xml. My questions are:

1. Is there any example for the content of hive-site.xml for this case?

2. I used sql() function to call the Spark SQL, is this the right way to do it?

###
##Here is the content in the hive-site.xml:##
###



javax.jdo.option.ConnectionURL
jdbc:mysql://192.168.76.100:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore



javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore



javax.jdo.option.ConnectionUserName
root
username to use against metastore database



javax.jdo.option.ConnectionPassword
123
password to use against metastore database






##Here is the situation happened in R:##


> library(sparklyr) # load sparklyr package
> sc=spark_connect(master="local",spark_home="/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7")
>  # connect sparklyr with spark
> sql('create database learnsql')
Error in sql("create database learnsql") : could not find function "sql"
> library(SparkR)

Attaching package: ‘SparkR’

The following object is masked from ‘package:sparklyr’:

collect

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind,
sample, startsWith, subset, summary, transform, union

> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized
> Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7')
> sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
Spark not found in SPARK_HOME:
Spark package found in SPARK_HOME: 
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7
Launching java with spark-submit command 
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7/bin/spark-submit   
sparkr-shell 
/var/folders/d8/7j6xswf92c3gmhwy_lrk63pmgn/T//Rtmpz22kK9/backend_port103d4cfcfd2c
19/06/08 11:14:57 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Error in handleErrors(returnStatus, conn) :

…... hundreds of lines of information and mistakes here ……

> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized



###
##Here is what happened in SparkR shell:##


Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:145)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:144)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:80)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:79)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.Iterator$class.foreach(Iterator.sca
> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized



Thank you very much.

YA







在 2019年6月8日,上午1:44,Rishikesh Gawade 

Re: sparksql in sparkR?

2019-06-07 Thread Felix Cheung
This seem to be more a question of spark-sql shell? I may suggest you change 
the email title to get more attention.


From: ya 
Sent: Wednesday, June 5, 2019 11:48:17 PM
To: user@spark.apache.org
Subject: sparksql in sparkR?

Dear list,

I am trying to use sparksql within my R, I am having the following questions, 
could you give me some advice please? Thank you very much.

1. I connect my R and spark using the library sparkR, probably some of the 
members here also are R users? Do I understand correctly that SparkSQL can be 
connected and triggered via SparkR and used in R (not in sparkR shell of spark)?

2. I ran sparkR library in R, trying to create a new sql database and a table, 
I could not get the database and the table I want. The code looks like below:

library(SparkR)
Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7')
sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
sql("create database learnsql; use learnsql")
sql("
create table employee_tbl
(emp_id varchar(10) not null,
emp_name char(10) not null,
emp_st_addr char(10) not null,
emp_city char(10) not null,
emp_st char(10) not null,
emp_zip integer(5) not null,
emp_phone integer(10) null,
emp_pager integer(10) null);
insert into employee_tbl values ('0001','john','yanlanjie 
1','gz','jiaoqiaojun','510006','1353');
select*from employee_tbl;
“)

I ran the following code in spark-sql shell, I get the database learnsql, 
however, I still can’t get the table.

spark-sql> create database learnsql;show databases;
19/06/06 14:42:36 INFO HiveMetaStore: 0: create_database: 
Database(name:learnsql, description:, 
locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
19/06/06 14:42:36 INFO audit: ugi=yaip=unknown-ip-addr  
cmd=create_database: Database(name:learnsql, description:, 
locationUri:file:/Users/ya/spark-warehouse/learnsql.db, parameters:{})
Error in query: org.apache.hadoop.hive.metastore.api.AlreadyExistsException: 
Database learnsql already exists;

spark-sql> create table employee_tbl
 > (emp_id varchar(10) not null,
 > emp_name char(10) not null,
 > emp_st_addr char(10) not null,
 > emp_city char(10) not null,
 > emp_st char(10) not null,
 > emp_zip integer(5) not null,
 > emp_phone integer(10) null,
 > emp_pager integer(10) null);
Error in query:
no viable alternative at input 'create table employee_tbl\n(emp_id varchar(10) 
not'(line 2, pos 20)

== SQL ==
create table employee_tbl
(emp_id varchar(10) not null,
^^^
emp_name char(10) not null,
emp_st_addr char(10) not null,
emp_city char(10) not null,
emp_st char(10) not null,
emp_zip integer(5) not null,
emp_phone integer(10) null,
emp_pager integer(10) null)

spark-sql> insert into employee_tbl values ('0001','john','yanlanjie 
1','gz','jiaoqiaojun','510006','1353');
19/06/06 14:43:43 INFO HiveMetaStore: 0: get_table : db=default tbl=employee_tbl
19/06/06 14:43:43 INFO audit: ugi=yaip=unknown-ip-addr  cmd=get_table : 
db=default tbl=employee_tbl
Error in query: Table or view not found: employee_tbl; line 1 pos 0


Does sparkSQL has different coding grammar? What did I miss?

Thank you very much.

Best regards,

YA




-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Should python-2 be supported in Spark 3.0?

2019-05-31 Thread Felix Cheung
Very subtle but someone might take

“We will drop Python 2 support in a future release in 2020”

To mean any / first release in 2020. Whereas the next statement indicates patch 
release is not included in above. Might help reorder the items or clarify the 
wording.



From: shane knapp 
Sent: Friday, May 31, 2019 7:38:10 PM
To: Denny Lee
Cc: Holden Karau; Bryan Cutler; Erik Erlandson; Felix Cheung; Mark Hamstra; 
Matei Zaharia; Reynold Xin; Sean Owen; Wenchen Fen; Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1000  ;)

On Sat, Jun 1, 2019 at 6:53 AM Denny Lee 
mailto:denny.g@gmail.com>> wrote:
+1

On Fri, May 31, 2019 at 17:58 Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
+1

On Fri, May 31, 2019 at 5:41 PM Bryan Cutler 
mailto:cutl...@gmail.com>> wrote:
+1 and the draft sounds good

On Thu, May 30, 2019, 11:32 AM Xiangrui Meng 
mailto:men...@gmail.com>> wrote:
Here is the draft announcement:

===
Plan for dropping Python 2 support

As many of you already knew, Python core development team and many utilized 
Python packages like Pandas and NumPy will drop Python 2 support in or before 
2020/01/01. Apache Spark has supported both Python 2 and 3 since Spark 1.4 
release in 2015. However, maintaining Python 2/3 compatibility is an increasing 
burden and it essentially limits the use of Python 3 features in Spark. Given 
the end of life (EOL) of Python 2 is coming, we plan to eventually drop Python 
2 support as well. The current plan is as follows:

* In the next major release in 2019, we will deprecate Python 2 support. 
PySpark users will see a deprecation warning if Python 2 is used. We will 
publish a migration guide for PySpark users to migrate to Python 3.
* We will drop Python 2 support in a future release in 2020, after Python 2 EOL 
on 2020/01/01. PySpark users will see an error if Python 2 is used.
* For releases that support Python 2, e.g., Spark 2.4, their patch releases 
will continue supporting Python 2. However, after Python 2 EOL, we might not 
take patches that are specific to Python 2.
===

Sean helped make a pass. If it looks good, I'm going to upload it to Spark 
website and announce it here. Let me know if you think we should do a VOTE 
instead.

On Thu, May 30, 2019 at 9:21 AM Xiangrui Meng 
mailto:men...@gmail.com>> wrote:
I created https://issues.apache.org/jira/browse/SPARK-27884 to track the work.

On Thu, May 30, 2019 at 2:18 AM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to 
provide some clarity on that.

We can say the "next major release in 2019" instead of Spark 3.0. Spark 3.0 
timeline certainly requires a new thread to discuss.




From: Reynold Xin mailto:r...@databricks.com>>
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; 
Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support 
for python2.7 for spark 2.x as the feature sets won't be expanding.

that being said, i will be cracking a bottle of champagne when i can delete all 
of the ansible and anaconda configs for python2.x.  :)

On the development side, in a future release that drops Python 2 support we can 
remove code that maintains python 2/3 compatibility and start using python 3 
only features, which is also quite exciting.


shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


--
Twitter: https://twitter.com/holdenkarau
Books (Learning Spark, High Performance Spark, etc.): https://amzn.to/2MaRAG9 
<https://amzn.to/2MaRAG9>
YouTube Live Streams: https://www.youtube.com/user/holdenkarau


--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Should python-2 be supported in Spark 3.0?

2019-05-30 Thread Felix Cheung
We don’t usually reference a future release on website

> Spark website and state that Python 2 is deprecated in Spark 3.0

I suspect people will then ask when is Spark 3.0 coming out then. Might need to 
provide some clarity on that.



From: Reynold Xin 
Sent: Thursday, May 30, 2019 12:59:14 AM
To: shane knapp
Cc: Erik Erlandson; Mark Hamstra; Matei Zaharia; Sean Owen; Wenchen Fen; 
Xiangrui Meng; dev; user
Subject: Re: Should python-2 be supported in Spark 3.0?

+1 on Xiangrui’s plan.

On Thu, May 30, 2019 at 7:55 AM shane knapp 
mailto:skn...@berkeley.edu>> wrote:
I don't have a good sense of the overhead of continuing to support
Python 2; is it large enough to consider dropping it in Spark 3.0?

from the build/test side, it will actually be pretty easy to continue support 
for python2.7 for spark 2.x as the feature sets won't be expanding.

that being said, i will be cracking a bottle of champagne when i can delete all 
of the ansible and anaconda configs for python2.x.  :)

shane
--
Shane Knapp
UC Berkeley EECS Research / RISELab Staff Technical Lead
https://rise.cs.berkeley.edu


Re: Static partitioning in partitionBy()

2019-05-07 Thread Felix Cheung
You could

df.filter(col(“c”) = “c1”).write().partitionBy(“c”).save

It could get some data skew problem but might work for you




From: Burak Yavuz 
Sent: Tuesday, May 7, 2019 9:35:10 AM
To: Shubham Chaurasia
Cc: dev; user@spark.apache.org
Subject: Re: Static partitioning in partitionBy()

It depends on the data source. Delta Lake (https://delta.io) allows you to do 
it with the .option("replaceWhere", "c = c1"). With other file formats, you can 
write directly into the partition directory (tablePath/c=c1), but you lose 
atomicity.

On Tue, May 7, 2019, 6:36 AM Shubham Chaurasia 
mailto:shubh.chaura...@gmail.com>> wrote:
Hi All,

Is there a way I can provide static partitions in partitionBy()?

Like:
df.write.mode("overwrite").format("MyDataSource").partitionBy("c=c1").save

Above code gives following error as it tries to find column `c=c1` in df.

org.apache.spark.sql.AnalysisException: Partition column `c=c1` not found in 
schema struct;

Thanks,
Shubham


Re: ApacheCon NA 2019 Call For Proposal and help promoting Spark project

2019-04-14 Thread Felix Cheung
And a plug for the Graph Processing track -

A discussion of comparison talk between the various Spark options (GraphX, 
GraphFrames, CAPS), or the ongoing work with SPARK-25994 Property Graphs, 
Cypher Queries, and Algorithms

Would be great!




From: Felix Cheung 
Sent: Saturday, April 13, 2019 9:51 AM
To: Spark Dev List; user@spark.apache.org
Subject: ApacheCon NA 2019 Call For Proposal and help promoting Spark project

Hi Spark community!

As you know ApacheCon NA 2019 is coming this Sept and it’s CFP is now open! 
This is an important milestone as we celebrate 20 years of ASF. We have tracks 
like Big Data and Machine Learning among many others. Please submit your 
talks/thoughts/challenges/learnings here:
https://www.apachecon.com/acna19/cfp.html

Second, as a community I think it’d be great if we have a post on 
http://spark.apache.org/ website to promote this event also. We already have a 
logo link up and perhaps we could add a post to talk about:
What is the Spark project, what might you learn, then a few suggestions of talk 
topics, why speak at the ApacheCon etc. This will then be linked to the 
ApacheCon official website. Any volunteer from the community?

Third, Twitter. I’m not sure who has access to the ApacheSpark Twitter account 
but it’d be great to promote this. Use the hashtags #ApacheCon and #ACNA19. 
Mention @Apachecon. Please use
https://www.apachecon.com/acna19/cfp.html to promote the CFP, and
https://www.apachecon.com/acna19 to promote the event as a whole.



ApacheCon NA 2019 Call For Proposal and help promoting Spark project

2019-04-13 Thread Felix Cheung
Hi Spark community!

As you know ApacheCon NA 2019 is coming this Sept and it’s CFP is now open! 
This is an important milestone as we celebrate 20 years of ASF. We have tracks 
like Big Data and Machine Learning among many others. Please submit your 
talks/thoughts/challenges/learnings here:
https://www.apachecon.com/acna19/cfp.html

Second, as a community I think it’d be great if we have a post on 
http://spark.apache.org/ website to promote this event also. We already have a 
logo link up and perhaps we could add a post to talk about:
What is the Spark project, what might you learn, then a few suggestions of talk 
topics, why speak at the ApacheCon etc. This will then be linked to the 
ApacheCon official website. Any volunteer from the community?

Third, Twitter. I’m not sure who has access to the ApacheSpark Twitter account 
but it’d be great to promote this. Use the hashtags #ApacheCon and #ACNA19. 
Mention @Apachecon. Please use
https://www.apachecon.com/acna19/cfp.html to promote the CFP, and
https://www.apachecon.com/acna19 to promote the event as a whole.



Re: spark.submit.deployMode: cluster

2019-03-28 Thread Felix Cheung
If anyone wants to improve docs please create a PR.

lol


But seriously you might want to explore other projects that manage job 
submission on top of spark instead of rolling your own with spark-submit.



From: Pat Ferrel 
Sent: Tuesday, March 26, 2019 2:38 PM
To: Marcelo Vanzin
Cc: user
Subject: Re: spark.submit.deployMode: cluster

Ahh, thank you indeed!

It would have saved us a lot of time if this had been documented. I know, OSS 
so contributions are welcome… I can also imagine your next comment; “If anyone 
wants to improve docs see the Apache contribution rules and create a PR.” or 
something like that.

BTW the code where the context is known and can be used is what I’d call a 
Driver and since all code is copied to nodes and is know in jars, it was not 
obvious to us that this rule existed but it does make sense.

We will need to refactor our code to use spark-submit it appears.

Thanks again.


From: Marcelo Vanzin 
Reply: Marcelo Vanzin 
Date: March 26, 2019 at 1:59:36 PM
To: Pat Ferrel 
Cc: user 
Subject:  Re: spark.submit.deployMode: cluster

If you're not using spark-submit, then that option does nothing.

If by "context creation API" you mean "new SparkContext()" or an
equivalent, then you're explicitly creating the driver inside your
application.

On Tue, Mar 26, 2019 at 1:56 PM Pat Ferrel 
mailto:p...@occamsmachete.com>> wrote:
>
> I have a server that starts a Spark job using the context creation API. It 
> DOES NOY use spark-submit.
>
> I set spark.submit.deployMode = “cluster”
>
> In the GUI I see 2 workers with 2 executors. The link for running application 
> “name” goes back to my server, the machine that launched the job.
>
> This is spark.submit.deployMode = “client” according to the docs. I set the 
> Driver to run on the cluster but it runs on the client, ignoring the 
> spark.submit.deployMode.
>
> Is this as expected? It is documented nowhere I can find.
>


--
Marcelo


Re: Spark - Hadoop custom filesystem service loading

2019-03-23 Thread Felix Cheung
Hmm thanks. Do you have a proposed solution?



From: Jhon Anderson Cardenas Diaz 
Sent: Monday, March 18, 2019 1:24 PM
To: user
Subject: Spark - Hadoop custom filesystem service loading

Hi everyone,

On spark 2.2.0, if you wanted to create a custom file system implementation, 
you just created an extension of org.apache.hadoop.fs.FileSystem and put the 
canonical name of the custom class on the file 
src/main/resources/META-INF/services/org.apache.hadoop.fs.FileSystem.

Once you imported that jar dependency on your spark submit application, the 
custom schema was automatically loaded, and you could start to use it just like 
ds.load("customfs://path").

But on spark 2.4.0 that does not seem to work the same. If you do exactly the 
same you will get an error like "No FileSystem for customfs".

The only way I achieved this on 2.4.0, was specifying the spark property 
spark.hadoop.fs.customfs.impl.

Do you guys consider this as a bug? or is it an intentional change that should 
be documented on somewhere?

Btw, digging a little bit on this, it seems that the cause is that now the 
FileSystem is initialized before the actual dependencies are downloaded from 
Maven repo (see 
here).
 And as that initialization loads the available filesystems at that point and 
only once, the filesystems in the jars downloaded are not taken in account.

Thanks.


Re: Spark-hive integration on HDInsight

2019-02-21 Thread Felix Cheung
You should check with HDInsight support


From: Jay Singh 
Sent: Wednesday, February 20, 2019 11:43:23 PM
To: User
Subject: Spark-hive integration on HDInsight

I am trying to integrate  spark with hive on HDInsight  spark cluster .
I copied hive-site.xml in spark/conf directory. In addition I added hive 
metastore properties like jdbc connection info on Ambari as well. But still the 
database and tables created using spark-sql are not visible in hive. Changed 
‘spark.sql.warehouse.dir’ value also to point to hive warehouse directory.
Spark does work with hive not having LLAP ON. What am I missing in the 
configuration to integrate spark with hive ? Any pointer will be appreciated.

thx


Re: SparkR + binary type + how to get value

2019-02-19 Thread Felix Cheung
from the second image it looks like there is protocol mismatch. I’d check if 
the SparkR package running there on Livy machine matches the Spark java release.

But in any case this seems more an issue with Livy config. I’d suggest checking 
with the community there:




From: Thijs Haarhuis 
Sent: Tuesday, February 19, 2019 5:28 AM
To: Felix Cheung; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Hi Felix,

Thanks. I got it working now by using the unlist function.

I have another question, maybe you can help me with, since I did see your 
naming popping up regarding the spark.lapply function.
I am using Apache Livy and am having troubles using this function, I even 
reported a jira ticket for it at:
https://jira.apache.org/jira/browse/LIVY-558

When I call the spark.lapply function it reports that SparkR is not initialized.
I have looked into the spark.lapply function and it seems there is no spark 
context.
Any idea how I can debug this?

I hope you can help.

Regards,
Thijs


From: Felix Cheung 
Sent: Sunday, February 17, 2019 7:18 PM
To: Thijs Haarhuis; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

A byte buffer in R is the raw vector type, so seems like it is working as 
expected. What do you have in the raw byte? You could convert into other types 
or access individual byte directly...

https://stat.ethz.ch/R-manual/R-devel/library/base/html/raw.html



From: Thijs Haarhuis 
Sent: Thursday, February 14, 2019 4:01 AM
To: Felix Cheung; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Hi Felix,
Sure..

I have the following code:

  printSchema(results)
  cat("\n\n\n")

  firstRow <- first(results)
  value <- firstRow$value

  cat(paste0("Value Type: '",typeof(value),"'\n\n\n"))
  cat(paste0("Value: '",value,"'\n\n\n"))

results is a Spark Data Frame here.

When I run this code the following is printed to console:

[cid:04497e3e-7983-488a-8516-5d2349778f03]

You can there is only a single column in this sdf of type binary
when I collect this value and print the type it prints it is a list.

Any idea how to get the actual value, or how to process the individual bytes?

Thanks
Thijs


From: Felix Cheung 
Sent: Thursday, February 14, 2019 5:31 AM
To: Thijs Haarhuis; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Please share your code



From: Thijs Haarhuis 
Sent: Wednesday, February 13, 2019 6:09 AM
To: user@spark.apache.org
Subject: SparkR + binary type + how to get value


Hi all,



Does anybody have any experience in accessing the data from a column which has 
a binary type in a Spark Data Frame in R?

I have a Spark Data Frame which has a column which is of a binary type. I want 
to access this data and process it.

In my case I collect the spark data frame to a R data frame and access the 
first row.

When I print this row to the console it does print all the hex values correctly.



However when I access the column it prints it is a list of 1 …when I print the 
type of the child element..it again prints it is a list.

I expected this value to be of a raw type.



Anybody has some experience with this?



Thanks

Thijs




Re: SparkR + binary type + how to get value

2019-02-17 Thread Felix Cheung
A byte buffer in R is the raw vector type, so seems like it is working as 
expected. What do you have in the raw byte? You could convert into other types 
or access individual byte directly...

https://stat.ethz.ch/R-manual/R-devel/library/base/html/raw.html



From: Thijs Haarhuis 
Sent: Thursday, February 14, 2019 4:01 AM
To: Felix Cheung; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Hi Felix,
Sure..

I have the following code:

  printSchema(results)
  cat("\n\n\n")

  firstRow <- first(results)
  value <- firstRow$value

  cat(paste0("Value Type: '",typeof(value),"'\n\n\n"))
  cat(paste0("Value: '",value,"'\n\n\n"))

results is a Spark Data Frame here.

When I run this code the following is printed to console:

[cid:04497e3e-7983-488a-8516-5d2349778f03]

You can there is only a single column in this sdf of type binary
when I collect this value and print the type it prints it is a list.

Any idea how to get the actual value, or how to process the individual bytes?

Thanks
Thijs


From: Felix Cheung 
Sent: Thursday, February 14, 2019 5:31 AM
To: Thijs Haarhuis; user@spark.apache.org
Subject: Re: SparkR + binary type + how to get value

Please share your code



From: Thijs Haarhuis 
Sent: Wednesday, February 13, 2019 6:09 AM
To: user@spark.apache.org
Subject: SparkR + binary type + how to get value


Hi all,



Does anybody have any experience in accessing the data from a column which has 
a binary type in a Spark Data Frame in R?

I have a Spark Data Frame which has a column which is of a binary type. I want 
to access this data and process it.

In my case I collect the spark data frame to a R data frame and access the 
first row.

When I print this row to the console it does print all the hex values correctly.



However when I access the column it prints it is a list of 1 …when I print the 
type of the child element..it again prints it is a list.

I expected this value to be of a raw type.



Anybody has some experience with this?



Thanks

Thijs




Re: SparkR + binary type + how to get value

2019-02-13 Thread Felix Cheung
Please share your code



From: Thijs Haarhuis 
Sent: Wednesday, February 13, 2019 6:09 AM
To: user@spark.apache.org
Subject: SparkR + binary type + how to get value

Hi all,

Does anybody have any experience in accessing the data from a column which has 
a binary type in a Spark Data Frame in R?
I have a Spark Data Frame which has a column which is of a binary type. I want 
to access this data and process it.
In my case I collect the spark data frame to a R data frame and access the 
first row.
When I print this row to the console it does print all the hex values correctly.

However when I access the column it prints it is a list of 1 …when I print the 
type of the child element..it again prints it is a list.
I expected this value to be of a raw type.

Anybody has some experience with this?

Thanks
Thijs



Re: java.lang.IllegalArgumentException: Unsupported class file major version 55

2019-02-10 Thread Felix Cheung
And it might not work completely. Spark only officially supports JDK 8.

I’m not sure if JDK 9 and + support is complete?



From: Jungtaek Lim 
Sent: Thursday, February 7, 2019 5:22 AM
To: Gabor Somogyi
Cc: Hande, Ranjit Dilip (Ranjit); user@spark.apache.org
Subject: Re: java.lang.IllegalArgumentException: Unsupported class file major 
version 55

ASM 6 doesn't support Java 11. In master branch (for Spark 3.0) there's 
dependency upgrade on ASM 7 and also some efforts (if my understanding is 
right) to support Java 11, so you may need to use lower version of JDK (8 
safest) for Spark 2.4.0, and try out master branch for preparing Java 11.

Thanks,
Jungtaek Lim (HeartSaVioR)

2019년 2월 7일 (목) 오후 9:18, Gabor Somogyi 
mailto:gabor.g.somo...@gmail.com>>님이 작성:
Hi Hande,

"Unsupported class file major version 55" means java incompatibility.
This error means you're trying to load a Java "class" file that was compiled 
with a newer version of Java than you have installed.
For example, your .class file could have been compiled for JDK 8, and you're 
trying to run it with JDK 7.
Are you sure 11 is the only JDK which is the default?

Small number of peoples playing with JDK 11 but not heavily tested and used.
Spark may or may not work but not suggested for production in general.

BR,
G


On Thu, Feb 7, 2019 at 12:53 PM Hande, Ranjit Dilip (Ranjit) 
mailto:ha...@avaya.com>> wrote:
Hi,

I am developing one java process which will consume data from Kafka using 
Apache Spark Streaming.
For this I am using following:

Java:
openjdk version "11.0.1" 2018-10-16 LTS
OpenJDK Runtime Environment Zulu11.2+3 (build 11.0.1+13-LTS) OpenJDK 64-Bit 
Server VM Zulu11.2+3 (build 11.0.1+13-LTS, mixed mode)

Maven: (Spark Streaming)

org.apache.spark
spark-streaming-kafka-0-10_2.11
2.4.0


org.apache.spark
spark-streaming_2.11
2.4.0


I am able to compile project successfully but when I try to run I get following 
error:

{"@timestamp":"2019-02-07T11:54:30.624+05:30","@version":"1","message":"Application
 run 
failed","logger_name":"org.springframework.boot.SpringApplication","thread_name":"main","level":"ERROR","level_value":4,"stack_trace":"java.lang.IllegalStateException:
 Failed to execute CommandLineRunner at 
org.springframework.boot.SpringApplication.callRunner(SpringApplication.java:816)
 at 
org.springframework.boot.SpringApplication.callRunners(SpringApplication.java:797)
 at org.springframework.boot.SpringApplication.run(SpringApplication.java:324) 
at 
com.avaya.measures.AgentMeasures.AgentMeasuresApplication.main(AgentMeasuresApplication.java:41)
 at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native 
Method) at 
java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at 
java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.base/java.lang.reflect.Method.invoke(Method.java:566) at 
org.springframework.boot.loader.MainMethodRunner.run(MainMethodRunner.java:48) 
at org.springframework.boot.loader.Launcher.launch(Launcher.java:87) at

org.springframework.boot.loader.Launcher.launch(Launcher.java:50) at 
org.springframework.boot.loader.JarLauncher.main(JarLauncher.java:51)\r\nCaused 
by: java.lang.IllegalArgumentException: Unsupported class file major version 55 
at

 org.apache.xbean.asm6.ClassReader.(ClassReader.java:166) at 
org.apache.xbean.asm6.ClassReader.(ClassReader.java:148) at 
org.apache.xbean.asm6.ClassReader.(ClassReader.java:136) at 
org.apache.xbean.asm6.ClassReader.(ClassReader.java:237) at 
org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:49) 
at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:517)
 at 
org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:500)
 at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
 at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
 at 
scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:134)
 at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) 
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at 
scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:134) at 
scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) 
at 
org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:500)
 at org.apache.xbean.asm6.ClassReader.readCode(ClassReader.java:2175) at 
org.apache.xbean.asm6.ClassReader.readMethod(ClassReader.java:1238) at 
org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:631) at 
org.apache.xbean.asm6.ClassReader.accept(ClassReader.java:355) at 

Re: I have trained a ML model, now what?

2019-01-23 Thread Felix Cheung
Please comment in the JIRA/SPIP if you are interested! We can see the community 
support for a proposal like this.



From: Pola Yao 
Sent: Wednesday, January 23, 2019 8:01 AM
To: Riccardo Ferrari
Cc: Felix Cheung; User
Subject: Re: I have trained a ML model, now what?

Hi Riccardo,

Right now, Spark does not support low-latency predictions in Production. MLeap 
is an alternative and it's been used in many scenarios. But it's good to see 
that Spark Community has decided to provide such support.

On Wed, Jan 23, 2019 at 7:53 AM Riccardo Ferrari 
mailto:ferra...@gmail.com>> wrote:
Felix, thank you very much for the link. Much appreciated.

The attached PDF is very interesting, I found myself evaluating many of the 
scenarios described in Q3. It's unfortunate the proposal is not being worked 
on, would be great to see that part of the code base.

It is cool to see big players like Uber trying to make Open Source better, 
thanks!


On Tue, Jan 22, 2019 at 5:24 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
About deployment/serving

SPIP
https://issues.apache.org/jira/browse/SPARK-26247



From: Riccardo Ferrari mailto:ferra...@gmail.com>>
Sent: Tuesday, January 22, 2019 8:07 AM
To: User
Subject: I have trained a ML model, now what?

Hi list!

I am writing here to here about your experience on putting Spark ML models into 
production at scale.

I know it is a very broad topic with many different faces depending on the 
use-case, requirements, user base and whatever is involved in the task. Still 
I'd like to open a thread about this topic that is as important as properly 
training a model and I feel is often neglected.

The task is serving web users with predictions and the main challenge I see is 
making it agile and swift.

I think there are mainly 3 general categories of such deployment that can be 
described as:

  *   Offline/Batch: Load a model, performs the inference, store the results in 
some datasotre (DB, indexes,...)
  *   Spark in the loop: Having a long running Spark context exposed in some 
way, this include streaming as well as some custom application that wraps the 
context.
  *   Use a different technology to load the Spark MLlib model and run the 
inference pipeline. I have read about MLeap and other PMML based solutions.

I would love to hear about opensource solutions and possibly without requiring 
cloud provider specific framework/component.

Again I am aware each of the previous category have benefits and drawback, so 
what would you pick? Why? and how?

Thanks!


Re: I have trained a ML model, now what?

2019-01-22 Thread Felix Cheung
About deployment/serving

SPIP
https://issues.apache.org/jira/browse/SPARK-26247



From: Riccardo Ferrari 
Sent: Tuesday, January 22, 2019 8:07 AM
To: User
Subject: I have trained a ML model, now what?

Hi list!

I am writing here to here about your experience on putting Spark ML models into 
production at scale.

I know it is a very broad topic with many different faces depending on the 
use-case, requirements, user base and whatever is involved in the task. Still 
I'd like to open a thread about this topic that is as important as properly 
training a model and I feel is often neglected.

The task is serving web users with predictions and the main challenge I see is 
making it agile and swift.

I think there are mainly 3 general categories of such deployment that can be 
described as:

  *   Offline/Batch: Load a model, performs the inference, store the results in 
some datasotre (DB, indexes,...)
  *   Spark in the loop: Having a long running Spark context exposed in some 
way, this include streaming as well as some custom application that wraps the 
context.
  *   Use a different technology to load the Spark MLlib model and run the 
inference pipeline. I have read about MLeap and other PMML based solutions.

I would love to hear about opensource solutions and possibly without requiring 
cloud provider specific framework/component.

Again I am aware each of the previous category have benefits and drawback, so 
what would you pick? Why? and how?

Thanks!


Re: Persist Dataframe to HDFS considering HDFS Block Size.

2019-01-19 Thread Felix Cheung
You can call coalesce to combine partitions..



From: Shivam Sharma <28shivamsha...@gmail.com>
Sent: Saturday, January 19, 2019 7:43 AM
To: user@spark.apache.org
Subject: Persist Dataframe to HDFS considering HDFS Block Size.

Hi All,

I wanted to persist dataframe on HDFS. Basically, I am inserting data into a 
HIVE table using Spark. Currently, at the time of writing to HIVE table I have 
set total shuffle partitions = 400 so total 400 files are being created which 
is not even considering HDFS block size. How can I tell spark to persist 
according to HDFS Blocks.

We have something like this HIVE which solves this problem:

set hive.merge.sparkfiles=true;
set hive.merge.smallfiles.avgsize=204800;
set hive.merge.size.per.task=409600;

Thanks

--
Shivam Sharma
Indian Institute Of Information Technology, Design and Manufacturing Jabalpur
Mobile No- (+91) 8882114744
Email:- 28shivamsha...@gmail.com
LinkedIn:-https://www.linkedin.com/in/28shivamsharma


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-19 Thread Felix Cheung
To clarify, yarn actually supports excluding node right when requesting 
resources. It’s spark that doesn’t provide a way to populate such a blacklist.

If you can change yarn config, the equivalent is node label: 
https://hadoop.apache.org/docs/r2.7.4/hadoop-yarn/hadoop-yarn-site/NodeLabel.html




From: Li Gao 
Sent: Saturday, January 19, 2019 8:43 AM
To: Felix Cheung
Cc: Serega Sheypak; user
Subject: Re: Spark on Yarn, is it possible to manually blacklist nodes before 
running spark job?

on yarn it is impossible afaik. on kubernetes you can use taints to keep 
certain nodes outside of spark

On Fri, Jan 18, 2019 at 9:35 PM Felix Cheung 
mailto:felixcheun...@hotmail.com>> wrote:
Not as far as I recall...



From: Serega Sheypak mailto:serega.shey...@gmail.com>>
Sent: Friday, January 18, 2019 3:21 PM
To: user
Subject: Spark on Yarn, is it possible to manually blacklist nodes before 
running spark job?

Hi, is there any possibility to tell Scheduler to blacklist specific nodes in 
advance?


Re: Spark on Yarn, is it possible to manually blacklist nodes before running spark job?

2019-01-18 Thread Felix Cheung
Not as far as I recall...



From: Serega Sheypak 
Sent: Friday, January 18, 2019 3:21 PM
To: user
Subject: Spark on Yarn, is it possible to manually blacklist nodes before 
running spark job?

Hi, is there any possibility to tell Scheduler to blacklist specific nodes in 
advance?


Re: spark2.4 arrow enabled true,error log not returned

2019-01-12 Thread Felix Cheung
Do you mean you run the same code on yarn and standalone? Can you check if they 
are running the same python versions?



From: Bryan Cutler 
Sent: Thursday, January 10, 2019 5:29 PM
To: libinsong1...@gmail.com
Cc: zlist Spark
Subject: Re: spark2.4 arrow enabled true,error log not returned

Hi, could you please clarify if you are running a YARN cluster when you see 
this problem?  I tried on Spark standalone and could not reproduce.  If it's on 
a YARN cluster, please file a JIRA and I can try to investigate further.

Thanks,
Bryan

On Sat, Dec 15, 2018 at 3:42 AM 李斌松 
mailto:libinsong1...@gmail.com>> wrote:
spark2.4 arrow enabled true,error log not returned,in spark 2.3,There's no such 
problem.

1、spark.sql.execution.arrow.enabled=true
[image.png]
yarn log:

18/12/15 14:35:52 INFO CodeGenerator: Code generated in 1030.698785 ms
18/12/15 14:35:54 INFO PythonRunner: Times: total = 1985, boot = 1892, init = 
92, finish = 1
18/12/15 14:35:54 INFO Executor: Finished task 0.0 in stage 0.0 (TID 0). 1799 
bytes result sent to driver
18/12/15 14:35:55 INFO CoarseGrainedExecutorBackend: Got assigned task 1
18/12/15 14:35:55 INFO Executor: Running task 0.0 in stage 1.0 (TID 1)
18/12/15 14:35:55 INFO TorrentBroadcast: Started reading broadcast variable 1
18/12/15 14:35:55 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in 
memory (estimated size 8.3 KB, free 1048.8 MB)
18/12/15 14:35:55 INFO TorrentBroadcast: Reading broadcast variable 1 took 18 ms
18/12/15 14:35:55 INFO MemoryStore: Block broadcast_1 stored as values in 
memory (estimated size 14.0 KB, free 1048.8 MB)
18/12/15 14:35:55 INFO CodeGenerator: Code generated in 30.269745 ms
18/12/15 14:35:55 INFO PythonRunner: Times: total = 13, boot = 5, init = 7, 
finish = 1
18/12/15 14:35:55 INFO Executor: Finished task 0.0 in stage 1.0 (TID 1). 1893 
bytes result sent to driver
18/12/15 14:35:55 INFO CoarseGrainedExecutorBackend: Got assigned task 2
18/12/15 14:35:55 INFO Executor: Running task 1.0 in stage 1.0 (TID 2)
18/12/15 14:35:55 ERROR Executor: Exception in task 1.0 in stage 1.0 (TID 2)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/usr/install/pyspark/2.4.0/pyspark.zip/pyspark/worker.py", line 377, in 
main
process()
  File "/usr/install/pyspark/2.4.0/pyspark.zip/pyspark/worker.py", line 372, in 
process
serializer.dump_stream(func(split_index, iterator), outfile)
  File "/usr/install/pyspark/2.4.0/pyspark.zip/pyspark/serializers.py", line 
390, in dump_stream
vs = list(itertools.islice(iterator, batch))
  File "/usr/install/pyspark/2.4.0/pyspark.zip/pyspark/util.py", line 99, in 
wrapper
return f(*args, **kwargs)
  File 
"/yarn/nm/usercache/admin/appcache/application_1544579748138_0215/container_e43_1544579748138_0215_01_01/python1.py",
 line 435, in mapfunc
ValueError: could not convert string to float: 'a'

at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:452)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:588)
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRunner.scala:571)
at 
org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:406)
at 
org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:439)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.hasNext(ArrowConverters.scala:99)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.foreach(ArrowConverters.scala:97)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at 
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:310)
at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.to(ArrowConverters.scala:97)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:302)
at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.toBuffer(ArrowConverters.scala:97)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:289)
at 
org.apache.spark.sql.execution.arrow.ArrowConverters$$anon$1.toArray(ArrowConverters.scala:97)
at 
org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17$$anonfun$apply$18.apply(Dataset.scala:3314)
at 
org.apache.spark.sql.Dataset$$anonfun$collectAsArrowToPython$1$$anonfun$apply$17$$anonfun$apply$18.apply(Dataset.scala:3314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)

Re: SparkR issue

2018-10-14 Thread Felix Cheung
1 seems like its spending a lot of time in R (slicing the data I guess?) and 
not with Spark
2 could you write it into a csv file locally and then read it from Spark?



From: ayan guha 
Sent: Monday, October 8, 2018 11:21 PM
To: user
Subject: SparkR issue

Hi

We are seeing some weird behaviour in Spark R.

We created a R Dataframe with 600K records and 29 columns. Then we tried to 
convert R DF to SparkDF using

df <- SparkR::createDataFrame(rdf)

from RStudio. It hanged, we had to kill the process after 1-2 hours.

We also tried following:
df <- SparkR::createDataFrame(rdf, numPartition=4000)
df <- SparkR::createDataFrame(rdf, numPartition=300)
df <- SparkR::createDataFrame(rdf, numPartition=10)

Same result. Both scenarios seems RStudio is working and no trace of jobs in 
Spark Application Master view.

Finally, we used this:

df <- SparkR::createDataFrame(rdf, schema=schema) , schema is a StructType.

This tool 25 mins to create the spark DF. However job did show up in 
Application Master view and it shows only 20-30 secs. Then where did rest of 
the time go?

Question:
1. Is this expected behavior? (I hope not). How should we speed up this bit?
2. We understand better options would be to read data from external sources, 
but we need this data to be generated for some simulation purpose. Whats 
possibly going wrong?


Best
Ayan



--
Best Regards,
Ayan Guha


Re: can Spark 2.4 work on JDK 11?

2018-09-29 Thread Felix Cheung
Not officially. We have seen problem with JDK 10 as well. It will be great if 
you or someone would like to contribute to get it to work..



From: kant kodali 
Sent: Tuesday, September 25, 2018 2:31 PM
To: user @spark
Subject: can Spark 2.4 work on JDK 11?

Hi All,

can Spark 2.4 work on JDK 11? I feel like there are lot of features that are 
added in JDK 9, 10, 11 that can make deployment process a whole lot better and 
of course some more syntax sugar similar to Scala.

Thanks!


Re: spark.lapply

2018-09-26 Thread Felix Cheung
It looks like the native R process is terminated from buffer overflow. Do you 
know how much data is involved?



From: Junior Alvarez 
Sent: Wednesday, September 26, 2018 7:33 AM
To: user@spark.apache.org
Subject: spark.lapply

Hi!

I’m using spark.lapply() in sparkR on a mesos service I get the following crash 
randomly (The spark.lapply() function is called around 150 times, some times it 
crashes after 16 calls, other after 25 calls and so on…it is completely random, 
even though the data used in the actual call is always the same the 150 times I 
called that function):

…

18/09/26 07:30:42 INFO TaskSetManager: Finished task 129.0 in stage 78.0 (TID 
1192) in 98 ms on 10.255.0.18 (executor 0) (121/143)

18/09/26 07:30:42 WARN TaskSetManager: Lost task 128.0 in stage 78.0 (TID 1191, 
10.255.0.18, executor 0): org.apache.spark.SparkException: R computation failed 
with

 7f327f4dd000-7f327f50 r-xp  08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f51c000-7f327f6f2000 rw-p  00:00 0

7f327f6fc000-7f327f6fd000 rw-p  00:00 0

7f327f6fd000-7f327f6ff000 rw-p  00:00 0

7f327f6ff000-7f327f70 r--p 00022000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f70-7f327f701000 rw-p 00023000 08:11 174916727  
/lib/x86_64-linux-gnu/ld-2.19.so

7f327f701000-7f327f702000 rw-p  00:00 0

7fff6070f000-7fff60767000 rw-p  00:00 0  [stack]

7fff6077f000-7fff60781000 r-xp  00:00 0  [vdso]

ff60-ff601000 r-xp  00:00 0  
[vsyscall]

*** buffer overflow detected ***: /usr/local/lib/R/bin/exec/R terminated

=== Backtrace: =

/lib/x86_64-linux-gnu/libc.so.6(+0x7329f)[0x7f327db9529f]

/lib/x86_64-linux-gnu/libc.so.6(__fortify_fail+0x5c)[0x7f327dc3087c]

/lib/x86_64-linux-gnu/libc.so.6(+0x10d750)[0x7f327dc2f750]

…

If I of course use the native R lapply() everything works fine.

I wonder if this is a known issue, and/or is there is a way to avoid it when 
using sparkR.

B r
/Junior



Re: Should python-2 be supported in Spark 3.0?

2018-09-16 Thread Felix Cheung
I don’t think we should remove any API even in a major release without 
deprecating it first...



From: Mark Hamstra 
Sent: Sunday, September 16, 2018 12:26 PM
To: Erik Erlandson
Cc: user@spark.apache.org; dev
Subject: Re: Should python-2 be supported in Spark 3.0?

We could also deprecate Py2 already in the 2.4.0 release.

On Sat, Sep 15, 2018 at 11:46 AM Erik Erlandson 
mailto:eerla...@redhat.com>> wrote:
In case this didn't make it onto this thread:

There is a 3rd option, which is to deprecate Py2 for Spark-3.0, and remove it 
entirely on a later 3.x release.

On Sat, Sep 15, 2018 at 11:09 AM, Erik Erlandson 
mailto:eerla...@redhat.com>> wrote:
On a separate dev@spark thread, I raised a question of whether or not to 
support python 2 in Apache Spark, going forward into Spark 3.0.

Python-2 is going EOL at the end 
of 2019. The upcoming release of Spark 3.0 is an opportunity to make breaking 
changes to Spark's APIs, and so it is a good time to consider support for 
Python-2 on PySpark.

Key advantages to dropping Python 2 are:

  *   Support for PySpark becomes significantly easier.
  *   Avoid having to support Python 2 until Spark 4.0, which is likely to 
imply supporting Python 2 for some time after it goes EOL.

(Note that supporting python 2 after EOL means, among other things, that 
PySpark would be supporting a version of python that was no longer receiving 
security patches)

The main disadvantage is that PySpark users who have legacy python-2 code would 
have to migrate their code to python 3 to take advantage of Spark 3.0

This decision obviously has large implications for the Apache Spark community 
and we want to solicit community feedback.




Re: Spark 2.3.1 not working on Java 10

2018-06-21 Thread Felix Cheung
I'm not sure we have completed support for Java 10


From: Rahul Agrawal 
Sent: Thursday, June 21, 2018 7:22:42 AM
To: user@spark.apache.org
Subject: Spark 2.3.1 not working on Java 10

Dear Team,


I have installed Java 10, Scala 2.12.6 and spark 2.3.1 in my desktop having 
Ubuntu 16.04. I am getting error opening spark-shell.

Failed to initialize compiler: object java.lang.Object in compiler mirror not 
found.

Please let me know if there is any way to run spark in Java 10.

Thanks,
Rahul


Re: all calculations finished, but "VCores Used" value remains at its max

2018-05-01 Thread Felix Cheung
Zeppelin keeps the Spark job alive. This is likely a better question for the 
Zeppelin project.


From: Valery Khamenya 
Sent: Tuesday, May 1, 2018 4:30:24 AM
To: user@spark.apache.org
Subject: all calculations finished, but "VCores Used" value remains at its max

Hi all

I experience a strange thing: when Spark 2.3.0 calculations started from 
Zeppelin 0.7.3 are finished, the "VCores Used" value in resource manager stays 
at its maximum, albeit nothing is assumed to be calculated anymore. How come?

if relevant, I experience this issue since AWS EMR 5.13.0

best regards
--
Valery


best regards
--
Valery A.Khamenya


Re: Problem running Kubernetes example v2.2.0-kubernetes-0.5.0

2018-04-22 Thread Felix Cheung
You might want to check with the spark-on-k8s
Or try using kubernetes from the official spark 2.3.0 release. (Yes we don't 
have an official docker image though but you can build with the script)


From: Rico Bergmann 
Sent: Wednesday, April 11, 2018 11:02:38 PM
To: user@spark.apache.org
Subject: Problem running Kubernetes example v2.2.0-kubernetes-0.5.0

Hi!

I was trying to get the SparkPi example running using the spark-on-k8s
distro from kubespark. But I get the following error:
+ /sbin/tini -s -- driver
[FATAL tini (11)] exec driver failed: No such file or directory

Did anyone get the example running on a Kubernetes cluster?

Best,
Rico.

invoked cmd:
bin/spark-submit \
  --deploy-mode cluster \
  --class org.apache.spark.examples.SparkPi \
  --master k8s://https://cluster:port \
  --conf spark.executor.instances=2 \
  --conf spark.app.name=spark-pi \
  --conf
spark.kubernetes.container.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
\
  --conf
spark.kubernetes.driver.docker.image=kubespark/spark-driver:v2.2.0-kubernetes-0.5.0
\
  --conf
spark.kubernetes.executor.docker.image=kubespark/spark-executor:v2.2.0-kubernetes-0.5.0
\

local:///opt/spark/examples/jars/spark-examples_2.11-v2.2.0-kubernetes-0.5.0.jar

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: [Structured Streaming Query] Calculate Running Avg from Kafka feed using SQL query

2018-04-06 Thread Felix Cheung
Instead of write to console you need to write to memory for it to be queryable


 .format("memory")
   .queryName("tableName")
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#output-sinks


From: Aakash Basu 
Sent: Friday, April 6, 2018 3:22:07 AM
To: user
Subject: Fwd: [Structured Streaming Query] Calculate Running Avg from Kafka 
feed using SQL query

Any help?

Need urgent help. Someone please clarify the doubt?


-- Forwarded message --
From: Aakash Basu 
>
Date: Mon, Apr 2, 2018 at 1:01 PM
Subject: [Structured Streaming Query] Calculate Running Avg from Kafka feed 
using SQL query
To: user >, "Bowden, Chris" 
>


Hi,

This is a very interesting requirement, where I am getting stuck at a few 
places.

Requirement -

Col1Col2
1  10
2  11
3  12
4  13
5  14

I have to calculate avg of col1 and then divide each row of col2 by that avg. 
And, the Avg should be updated with every new data being fed through Kafka into 
Spark Streaming.

Avg(Col1) = Running Avg
Col2 = Col2/Avg(Col1)


Queries -


1) I am currently trying to simply run a inner query inside a query and print 
Avg with other Col value and then later do the calculation. But, getting error.

Query -

select t.Col2 , (Select AVG(Col1) as Avg from transformed_Stream_DF) as myAvg 
from transformed_Stream_DF t

Error -

pyspark.sql.utils.StreamingQueryException: u'Queries with streaming sources 
must be executed with writeStream.start();

Even though, I already have writeStream.start(); in my code, it is probably 
throwing the error because of the inner select query (I think Spark is assuming 
it as another query altogether which require its own writeStream.start. Any 
help?


2) How to go about it? I have another point in mind, i.e, querying the table to 
get the avg and store it in a variable. In the second query simply pass the 
variable and divide the second column to produce appropriate result. But, is it 
the right approach?

3) Final question: How to do the calculation over the entire data and not the 
latest, do I need to keep appending somewhere and repeatedly use it? My average 
and all the rows of the Col2 shall change with every new incoming data.


Code -


from pyspark.sql import SparkSession
import time
from pyspark.sql.functions import split, col

class test:


spark = SparkSession.builder \
.appName("Stream_Col_Oper_Spark") \
.getOrCreate()

data = spark.readStream.format("kafka") \
.option("startingOffsets", "latest") \
.option("kafka.bootstrap.servers", "localhost:9092") \
.option("subscribe", "test1") \
.load()

ID = data.select('value') \
.withColumn('value', data.value.cast("string")) \
.withColumn("Col1", split(col("value"), ",").getItem(0)) \
.withColumn("Col2", split(col("value"), ",").getItem(1)) \
.drop('value')

ID.createOrReplaceTempView("transformed_Stream_DF")
aggregate_func = spark.sql(
"select t.Col2 , (Select AVG(Col1) as Avg from transformed_Stream_DF) 
as myAvg from transformed_Stream_DF t")  #  (Col2/(AVG(Col1)) as Col3)")

# ---For Console Print---

query = aggregate_func \
.writeStream \
.format("console") \
.start()
# .outputMode("complete") \
# ---Console Print ends---

query.awaitTermination()
# /home/kafka/Downloads/spark-2.3.0-bin-hadoop2.7/bin/spark-submit 
--packages org.apache.spark:spark-sql-kafka-0-10_2.11:2.3.0 
/home/aakashbasu/PycharmProjects/AllMyRnD/Kafka_Spark/Stream_Col_Oper_Spark.py



Thanks,
Aakash.



Re: [Spark R]: Linear Mixed-Effects Models in Spark R

2018-03-26 Thread Felix Cheung
If your data can be split into groups and you can call into your favorite R 
package on each group of data (in parallel):

https://spark.apache.org/docs/latest/sparkr.html#run-a-given-function-on-a-large-dataset-grouping-by-input-columns-and-using-gapply-or-gapplycollect



From: Nisha Muktewar 
Sent: Monday, March 26, 2018 2:27:52 PM
To: Josh Goldsborough
Cc: user
Subject: Re: [Spark R]: Linear Mixed-Effects Models in Spark R

Look at LinkedIn's Photon ML package: https://github.com/linkedin/photon-ml

One of the caveats is/was that the input data has to be in Avro in a specific 
format.

On Mon, Mar 26, 2018 at 1:46 PM, Josh Goldsborough 
> wrote:
The company I work for is trying to do some mixed-effects regression modeling 
in our new big data platform including SparkR.

We can run via SparkR's support of native R & use lme4.  But it runs single 
threaded.  So we're looking for tricks/techniques to process large data sets.


This was asked a couple years ago:
https://stackoverflow.com/questions/39790820/mixed-effects-models-in-spark-or-other-technology

But I wanted to ask again, in case anyone had an answer now.

Thanks,
Josh Goldsborough



Re: Custom metrics sink

2018-03-16 Thread Felix Cheung
There is a proposal to expose them. See SPARK-14151


From: Christopher Piggott 
Sent: Friday, March 16, 2018 1:09:38 PM
To: user@spark.apache.org
Subject: Custom metrics sink

Just for fun, i want to make a stupid program that makes different frequency 
chimes as each worker becomes active.  That way you can 'hear' what the cluster 
is doing and how it's distributing work.

I thought to do this I would make a custom Sink, but the Sink and everything 
else in org.apache.spark.metrics.sink is private to spark.  What I was hoping 
to do was to just pick up the # of active workers in semi real time (once a 
second?) and have them send a UDP message somewhere... then each worker would 
be assigned to a different frequency chime.  It's just a toy, for fun.

How do you add a custom Sink when these classes don't seem to be exposed?

--C



Re: How to start practicing Python Spark Streaming in Linux?

2018-03-14 Thread Felix Cheung
It’s best to start with Structured Streaming

https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#tab_python_0

https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html#tab_python_0

_
From: Aakash Basu 
Sent: Wednesday, March 14, 2018 1:09 AM
Subject: How to start practicing Python Spark Streaming in Linux?
To: user 


Hi all,

Any guide on how to kich-start learning PySpark Streaming in ubuntu standalone 
system? Step wise, practical hands-on, would be great.

Also, connecting Kafka with Spark and getting real time data and processing it 
in micro-batches...

Any help?

Thanks,
Aakash.




Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
For pyspark specifically IMO should be very high on the list to port back...

As for roadmap - should be sharing more soon.


From: lucas.g...@gmail.com <lucas.g...@gmail.com>
Sent: Friday, March 2, 2018 9:41:46 PM
To: user@spark.apache.org
Cc: Felix Cheung
Subject: Re: Question on Spark-kubernetes integration

Oh interesting, given that pyspark was working in spark on kub 2.2 I assumed it 
would be part of what got merged.

Is there a roadmap in terms of when that may get merged up?

Thanks!



On 2 March 2018 at 21:32, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
That’s in the plan. We should be sharing a bit more about the roadmap in future 
releases shortly.

In the mean time this is in the official documentation on what is coming:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work

This supports started as a fork of the Apache Spark project and this fork has 
dynamic scaling support you can check out here:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dynamic-executor-scaling



From: Lalwani, Jayesh 
<jayesh.lalw...@capitalone.com<mailto:jayesh.lalw...@capitalone.com>>
Sent: Friday, March 2, 2018 8:08:55 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Question on Spark-kubernetes integration

Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



Re: Question on Spark-kubernetes integration

2018-03-02 Thread Felix Cheung
That's in the plan. We should be sharing a bit more about the roadmap in future 
releases shortly.

In the mean time this is in the official documentation on what is coming:
https://spark.apache.org/docs/latest/running-on-kubernetes.html#future-work

This supports started as a fork of the Apache Spark project and this fork has 
dynamic scaling support you can check out here:
https://apache-spark-on-k8s.github.io/userdocs/running-on-kubernetes.html#dynamic-executor-scaling



From: Lalwani, Jayesh 
Sent: Friday, March 2, 2018 8:08:55 AM
To: user@spark.apache.org
Subject: Question on Spark-kubernetes integration

Does the Resource scheduler support dynamic resource allocation? Are there any 
plans to add in the future?



The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates and may only be used solely in performance of 
work or services for Capital One. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed. If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.


Re: Spark on K8s - using files fetched by init-container?

2018-02-27 Thread Felix Cheung
Yes you were pointing to HDFS on a loopback address...


From: Jenna Hoole 
Sent: Monday, February 26, 2018 1:11:35 PM
To: Yinan Li; user@spark.apache.org
Subject: Re: Spark on K8s - using files fetched by init-container?

Oh, duh. I completely forgot that file:// is a prefix I can use. Up and running 
now :)

Thank you so much!
Jenna

On Mon, Feb 26, 2018 at 1:00 PM, Yinan Li 
> wrote:
OK, it looks like you will need to use 
`file:///var/spark-data/spark-files/flights.csv` instead. The 'file://' scheme 
must be explicitly used as it seems it defaults to 'hdfs' in your setup.

On Mon, Feb 26, 2018 at 12:57 PM, Jenna Hoole 
> wrote:
Thank you for the quick response! However, I'm still having problems.

When I try to look for /var/spark-data/spark-files/flights.csv I get told:

Error: Error in loadDF : analysis error - Path does not exist: 
hdfs://192.168.0.1:8020/var/spark-data/spark-files/flights.csv;

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

And when I try to look for local:///var/spark-data/spark-files/flights.csv, I 
get:

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 'local:///var/spark-data/spark-files/flights.csv': No such 
file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

I can see from a kubectl describe that the directory is getting mounted.

Mounts:

  /etc/hadoop/conf from hadoop-properties (rw)

  
/var/run/secrets/kubernetes.io/serviceaccount
 from spark-token-pxz79 (ro)

  /var/spark-data/spark-files from download-files (rw)

  /var/spark-data/spark-jars from download-jars-volume (rw)

  /var/spark/tmp from spark-local-dir-0-tmp (rw)

Is there something else I need to be doing in my set up?

Thanks,
Jenna

On Mon, Feb 26, 2018 at 12:02 PM, Yinan Li 
> wrote:
The files specified through --files are localized by the init-container to 
/var/spark-data/spark-files by default. So in your case, the file should be 
located at /var/spark-data/spark-files/flights.csv locally in the container.

On Mon, Feb 26, 2018 at 10:51 AM, Jenna Hoole 
> wrote:
This is probably stupid user error, but I can't for the life of me figure out 
how to access the files that are staged by the init-container.

I'm trying to run the SparkR example data-manipulation.R which requires the 
path to its datafile. I supply the hdfs location via --files and then the full 
hdfs path.


--files 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 local:///opt/spark/examples/src/main/r/data-manipulation.R 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv

The init-container seems to load my file.

18/02/26 18:29:09 INFO spark.SparkContext: Added file 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 at 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 with timestamp 1519669749519

18/02/26 18:29:09 INFO util.Utils: Fetching 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 to 
/var/spark/tmp/spark-d943dae6-9b95-4df0-87a3-9f7978d6d4d2/userFiles-4112b7aa-b9e7-47a9-bcbc-7f7a01f93e38/fetchFileTemp7872615076522023165.tmp

However, I get an error that my file does not exist.

Error in file(file, "rt") : cannot open the connection

Calls: read.csv -> read.table -> file

In addition: Warning message:

In file(file, "rt") :

  cannot open file 
'hdfs://192.168.0.1:8020/user/jhoole/flights.csv':
 No such file or directory

Execution halted

Exception in thread "main" org.apache.spark.SparkUserAppException: User 
application exited with 1

at org.apache.spark.deploy.RRunner$.main(RRunner.scala:104)

at org.apache.spark.deploy.RRunner.main(RRunner.scala)

If I try supplying just flights.csv, I get a different error

--files 
hdfs://192.168.0.1:8020/user/jhoole/flights.csv
 local:///opt/spark/examples/src/main/r/data-manipulation.R flights.csv


Error: Error in loadDF : analysis error - Path does not exist: 

Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

2018-02-20 Thread Felix Cheung
No it does not support bi directional edges as of now.

_
From: xiaobo <guxiaobo1...@qq.com>
Sent: Tuesday, February 20, 2018 4:35 AM
Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships
To: Felix Cheung <felixcheun...@hotmail.com>, <user@spark.apache.org>


So the question comes to does graphframes support bidirectional relationship 
natively with only one edge?



-- Original ------
From: Felix Cheung <felixcheun...@hotmail.com>
Date: Tue,Feb 20,2018 10:01 AM
To: xiaobo <guxiaobo1...@qq.com>, user@spark.apache.org <user@spark.apache.org>
Subject: Re: [graphframes]how Graphframes Deal With BidirectionalRelationships

Generally that would be the approach.
But since you have effectively double the number of edges this will likely 
affect the scale your job will run.


From: xiaobo <guxiaobo1...@qq.com>
Sent: Monday, February 19, 2018 3:22:02 AM
To: user@spark.apache.org
Subject: [graphframes]how Graphframes Deal With Bidirectional Relationships

Hi,
To represent a bidirectional relationship, one solution is to insert two edges 
for the vertices pair, my question is do the algorithms of graphframes still 
work when we doing this.

Thanks





Re: [graphframes]how Graphframes Deal With Bidirectional Relationships

2018-02-19 Thread Felix Cheung
Generally that would be the approach.
But since you have effectively double the number of edges this will likely 
affect the scale your job will run.


From: xiaobo 
Sent: Monday, February 19, 2018 3:22:02 AM
To: user@spark.apache.org
Subject: [graphframes]how Graphframes Deal With Bidirectional Relationships

Hi,
To represent a bidirectional relationship, one solution is to insert two edges 
for the vertices pair, my question is do the algorithms of graphframes still 
work when we doing this.

Thanks



Re: Does Pyspark Support Graphx?

2018-02-18 Thread Felix Cheung
Hi - I’m maintaining it. As of now there is an issue with 2.2 that breaks 
personalized page rank, and that’s largely the reason there isn’t a release for 
2.2 support.

There are attempts to address this issue - if you are interested we would love 
for your help.


From: Nicolas Paris 
Sent: Sunday, February 18, 2018 12:31:27 AM
To: Denny Lee
Cc: xiaobo; user@spark.apache.org
Subject: Re: Does Pyspark Support Graphx?

> Most likely not as most of the effort is currently on GraphFrames  - a great
> blog post on the what GraphFrames offers can be found at: https://

Is the graphframes package still active ? The github repository
indicates it's not extremelly active. Right now, there is no available
package for spark-2.2 so that one need to compile it from sources.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: SparkR test script issue: unable to run run-tests.h on spark 2.2

2018-02-14 Thread Felix Cheung
Yes it is issue with the newer release of testthat.

To workaround could you install an earlier version with devtools? will follow 
up for a fix.

_
From: Hyukjin Kwon 
Sent: Wednesday, February 14, 2018 6:49 PM
Subject: Re: SparkR test script issue: unable to run run-tests.h on spark 2.2
To: chandan prakash 
Cc: user @spark 


>From a very quick look, I think testthat version issue with SparkR.

I had to fix that version to 1.x before in AppVeyor. There are few details in 
https://github.com/apache/spark/pull/20003

Can you check and lower testthat version?


On 14 Feb 2018 6:09 pm, "chandan prakash" 
> wrote:
Hi All,
I am trying to run test script of R under ./R/run-tests.sh but hitting same 
ERROR everytime.
I tried running on mac as well as centos machine, same issue coming up.
I am using spark 2.2 (branch-2.2)
I followed from apache doc and followed the steps:
1. installed R
2. installed packages like testthat as mentioned in doc
3. run run-tests.h


Every time I am getting this error line:

Error in get(name, envir = asNamespace(pkg), inherits = FALSE) :
  object 'run_tests' not found
Calls: ::: -> get
Execution halted


Any Help?

--
Chandan Prakash





Re: py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet

2018-01-10 Thread Felix Cheung
java.nio.BufferUnderflowException

Can you try reading the same data in Scala?



From: Liana Napalkova 
Sent: Wednesday, January 10, 2018 12:04:00 PM
To: Timur Shenkao
Cc: user@spark.apache.org
Subject: Re: py4j.protocol.Py4JJavaError: An error occurred while calling 
o794.parquet

The DataFrame is not empy.
Indeed, it has nothing to do with serialization. I think that the issue is 
related to this bug: https://issues.apache.org/jira/browse/SPARK-22769
In my question I have not posted the whole error stack trace, but one of the 
error messages says `Could not find CoarseGrainedScheduler`. So, it's probably 
something related to the resources.


From: Timur Shenkao 
Sent: 10 January 2018 20:07:37
To: Liana Napalkova
Cc: user@spark.apache.org
Subject: Re: py4j.protocol.Py4JJavaError: An error occurred while calling 
o794.parquet


Caused by: org.apache.spark.SparkException: Task not serializable


That's the answer :)

What are you trying to save? Is it empty or None / null?

On Wed, Jan 10, 2018 at 4:58 PM, Liana Napalkova 
> wrote:

Hello,

Has anybody faced the following problem in PySpark? (Python 2.7.12):

df.show() # works fine and shows the first 5 rows of DataFrame

df.write.parquet(outputPath + '/data.parquet', mode="overwrite")  # throws 
the error

The last line throws the following error:


py4j.protocol.Py4JJavaError: An error occurred while calling o794.parquet.
: org.apache.spark.SparkException: Job aborted.
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply$mcV$sp(FileFormatWriter.scala:215)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$$anonfun$write$1.apply(FileFormatWriter.scala:173)
at 
org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:65)
at 
org.apache.spark.sql.execution.datasources.FileFormatWriter$.write(FileFormatWriter.scala:173)

Caused by: org.apache.spark.SparkException: Exception thrown in awaitResult:
at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:205)
at 
org.apache.spark.sql.execution.exchange.BroadcastExchangeExec.doExecuteBroadcast(BroadcastExchangeExec.scala:123)
at 
org.apache.spark.sql.execution.InputAdapter.doExecuteBroadcast(WholeStageCodegenExec.scala:248)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeBroadcast$1.apply(SparkPlan.scala:127)
at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$executeQuery$1.apply(SparkPlan.scala:138)

Caused by: org.apache.spark.SparkException: Task not serializable
at 
org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at 
org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2287)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:794)
at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:793)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at 
org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)

Caused by: java.lang.IllegalArgumentException
at java.nio.Buffer.position(Buffer.java:244)
at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:153)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)

Caused by: java.nio.BufferUnderflowException

at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:151)
at java.nio.ByteBuffer.get(ByteBuffer.java:715)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytes(Binary.java:405)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.getBytesUnsafe(Binary.java:414)
at 
org.apache.parquet.io.api.Binary$ByteBufferBackedBinary.writeObject(Binary.java:484)
at sun.reflect.GeneratedMethodAccessor48.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)

Thanks.

L.


DISCLAIMER: Aquest missatge pot contenir informació confidencial. Si vostè no 
n'és el destinatari, si us plau, esborri'l i faci'ns-ho saber immediatament a 
la següent adreça: le...@eurecat.org Si el 
destinatari d'aquest missatge no consent la utilització del correu electrònic 
via Internet i la gravació 

Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

2018-01-08 Thread Felix Cheung
And Hadoop-3.x is not part of the release and sign off for 2.2.1.

Maybe we could update the website to avoid any confusion with "later".


From: Josh Rosen 
Sent: Monday, January 8, 2018 10:17:14 AM
To: akshay naidu
Cc: Saisai Shao; Raj Adyanthaya; spark users
Subject: Re: Is Apache Spark-2.2.1 compatible with Hadoop-3.0.0

My current best guess is that Spark does not fully support Hadoop 3.x because 
https://issues.apache.org/jira/browse/SPARK-18673 (updates to Hive shims for 
Hadoop 3.x) has not been resolved. There are also likely to be transitive 
dependency conflicts which will need to be resolved.

On Mon, Jan 8, 2018 at 8:52 AM akshay naidu 
> wrote:
yes , spark download page does mention that 2.2.1 is for 'hadoop-2.7 and 
later', but my confusion is because spark was released on 1st dec and hadoop-3 
stable version released on 13th Dec. And  to my similar question on 
stackoverflow.com
 , Mr. jacek-laskowski 
replied that spark-2.2.1 doesn't support hadoop-3. so I am just looking for 
more clarity on this doubt before moving on to upgrades.

Thanks all for help.

Akshay.

On Mon, Jan 8, 2018 at 8:47 AM, Saisai Shao 
> wrote:
AFAIK, there's no large scale test for Hadoop 3.0 in the community. So it is 
not clear whether it is supported or not (or has some issues). I think in the 
download page "Pre-Built for Apache Hadoop 2.7 and later" mostly means that it 
supports Hadoop 2.7+ (2.8...), but not 3.0 (IIUC).

Thanks
Jerry

2018-01-08 4:50 GMT+08:00 Raj Adyanthaya 
>:
Hi Akshay

On the Spark Download page when you select Spark 2.2.1 it gives you an option 
to select package type. In that, there is an option to select  "Pre-Built for 
Apache Hadoop 2.7 and later". I am assuming it means that it does support 
Hadoop 3.0.

http://spark.apache.org/downloads.html

Thanks,
Raj A.

On Sat, Jan 6, 2018 at 8:23 PM, akshay naidu 
> wrote:
hello Users,
I need to know whether we can run latest spark on  latest hadoop version i.e., 
spark-2.2.1 released on 1st dec and hadoop-3.0.0 released on 13th dec.
thanks.





Re: Passing an array of more than 22 elements in a UDF

2017-12-26 Thread Felix Cheung
Generally the 22 limitation is from Scala 2.10.

In Scala 2.11, the issue with case class is fixed, but with that said I’m not 
sure if with UDF in Java other limitation might apply.

_
From: Aakash Basu <aakash.spark@gmail.com>
Sent: Monday, December 25, 2017 9:13 PM
Subject: Re: Passing an array of more than 22 elements in a UDF
To: Felix Cheung <felixcheun...@hotmail.com>
Cc: ayan guha <guha.a...@gmail.com>, user <user@spark.apache.org>


What's the privilege of using that specific version for this? Please throw some 
light onto it.

On Mon, Dec 25, 2017 at 6:51 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Or use it with Scala 2.11?


From: ayan guha <guha.a...@gmail.com<mailto:guha.a...@gmail.com>>
Sent: Friday, December 22, 2017 3:15:14 AM
To: Aakash Basu
Cc: user
Subject: Re: Passing an array of more than 22 elements in a UDF

Hi I think you are in correct track. You can stuff all your param in a suitable 
data structure like array or dict and pass this structure as a single param in 
your udf.

On Fri, 22 Dec 2017 at 2:55 pm, Aakash Basu 
<aakash.spark@gmail.com<mailto:aakash.spark@gmail.com>> wrote:
Hi,

I am using Spark 2.2 using Java, can anyone please suggest me how to take more 
than 22 parameters in an UDF? I mean, if I want to pass all the parameters as 
an array of integers?

Thanks,
Aakash.
--
Best Regards,
Ayan Guha





Re: Spark 2.2.1 worker invocation

2017-12-26 Thread Felix Cheung
I think you are looking for spark.executor.extraJavaOptions

https://spark.apache.org/docs/latest/configuration.html#runtime-environment


From: Christopher Piggott 
Sent: Tuesday, December 26, 2017 8:00:56 AM
To: user@spark.apache.org
Subject: Spark 2.2.1 worker invocation

I need to set java.library.path to get access to some native code.  Following 
directions, I made a spark-env.sh:

#!/usr/bin/env bash
export 
LD_LIBRARY_PATH="/usr/local/lib/libcdfNativeLibrary.so:/usr/local/lib/libcdf.so:${LD_LIBRARY_PATH}"
export SPARK_WORKER_OPTS=-Djava.library.path=/usr/local/lib
export SPARK_WORKER_MEMORY=2g

to no avail.  (I tried both with and without exporting the environment).  
Looking at how the worker actually starts up:

 /usr/lib/jvm/default/bin/java -cp /home/spark/conf/:/home/spark/jars/* 
-Xmx1024M -Dspark.driver.port=37219 
org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url 
spark://CoarseGrainedScheduler@10.1.1.1:37219
 --executor-id 113 --hostname 10.2.2.1 --cores 8 --app-id 
app-20171225145607-0003 --worker-url 
spark://Worker@10.2.2.1:35449



It doesn't seem to take any options.  I put an 'echo' in just to confirm that 
spark-env.sh is getting invoked (and it is).

So, just out of curiosity, I tried to troubleshoot this:



spark@node2-1:~$ grep -R SPARK_WORKER_OPTS *
conf/spark-env.sh:export 
SPARK_WORKER_OPTS=-Djava.library.path=/usr/local/lib
conf/spark-env.sh.template:# - SPARK_WORKER_OPTS, to set config properties 
only for the worker (e.g. "-Dx=y")


The variable doesn't seem to get referenced anywhere in the spark distribution. 
 I checked a number of other options in spark-env.sh.template and they didn't 
seem to be referenced either.  I expected to find them in various startup 
scripts.

I can probably "fix" my problem by hacking the lower-level startup scripts, but 
first I'd like to inquire about what's going on here.  How and where are these 
variables actually used?




Re: Passing an array of more than 22 elements in a UDF

2017-12-24 Thread Felix Cheung
Or use it with Scala 2.11?


From: ayan guha 
Sent: Friday, December 22, 2017 3:15:14 AM
To: Aakash Basu
Cc: user
Subject: Re: Passing an array of more than 22 elements in a UDF

Hi I think you are in correct track. You can stuff all your param in a suitable 
data structure like array or dict and pass this structure as a single param in 
your udf.

On Fri, 22 Dec 2017 at 2:55 pm, Aakash Basu 
> wrote:
Hi,

I am using Spark 2.2 using Java, can anyone please suggest me how to take more 
than 22 parameters in an UDF? I mean, if I want to pass all the parameters as 
an array of integers?

Thanks,
Aakash.
--
Best Regards,
Ayan Guha


Re: [Spark R]: dapply only works for very small datasets

2017-11-28 Thread Felix Cheung
You can find more discussions in
https://issues.apache.org/jira/browse/SPARK-18924
And
https://issues.apache.org/jira/browse/SPARK-17634

I suspect the cost is linear - so partitioning the data into smaller chunks 
with more executors (one core each) running in parallel would probably help a 
bit.

Unfortunately this is an area that we really would use some improvements on, 
and I think it *should* be possible (hmm  
https://databricks.com/blog/2017/10/06/accelerating-r-workflows-on-databricks.html.
 ;)

_
From: Kunft, Andreas <andreas.ku...@tu-berlin.de>
Sent: Tuesday, November 28, 2017 3:11 AM
Subject: AW: [Spark R]: dapply only works for very small datasets
To: Felix Cheung <felixcheun...@hotmail.com>, <user@spark.apache.org>



Thanks for the fast reply.


I tried it locally, with 1 - 8 slots on a 8 core machine w/ 25GB memory as well 
as on 4 nodes with the same specifications.

When I shrink the data to around 100MB,

it runs in about 1 hour for 1 core and about 6 min with 8 cores.


I'm aware that the serDe takes time, but it seems there must be something else 
off considering these numbers.


____
Von: Felix Cheung <felixcheun...@hotmail.com>
Gesendet: Montag, 27. November 2017 20:20
An: Kunft, Andreas; user@spark.apache.org
Betreff: Re: [Spark R]: dapply only works for very small datasets

What’s the number of executor and/or number of partitions you are working with?

I’m afraid most of the problem is with the serialization deserialization 
overhead between JVM and R...


From: Kunft, Andreas <andreas.ku...@tu-berlin.de>
Sent: Monday, November 27, 2017 10:27:33 AM
To: user@spark.apache.org
Subject: [Spark R]: dapply only works for very small datasets


Hello,


I tried to execute some user defined functions with R using the airline arrival 
performance dataset.

While the examples from the documentation for the `<-` apply operator work 
perfectly fine on a size ~9GB,

the `dapply` operator fails to finish even after ~4 hours.


I'm using a function similar to the one from the documentation:


df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

I checked Stackoverflow and even asked the question there as well, but till now 
the only answer I got was:
"Avoid using dapply, gapply"

So, do I miss some parameters or is there are general limitation?
I'm using Spark 2.2.0 and read the data from HDFS 2.7.1 and played with several 
DOPs.

Best
Andreas





Re: [Spark R]: dapply only works for very small datasets

2017-11-27 Thread Felix Cheung
What's the number of executor and/or number of partitions you are working with?

I'm afraid most of the problem is with the serialization deserialization 
overhead between JVM and R...


From: Kunft, Andreas 
Sent: Monday, November 27, 2017 10:27:33 AM
To: user@spark.apache.org
Subject: [Spark R]: dapply only works for very small datasets


Hello,


I tried to execute some user defined functions with R using the airline arrival 
performance dataset.

While the examples from the documentation for the `<-` apply operator work 
perfectly fine on a size ~9GB,

the `dapply` operator fails to finish even after ~4 hours.


I'm using a function similar to the one from the documentation:


df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)

I checked Stackoverflow and even asked the question there as well, but till now 
the only answer I got was:
"Avoid using dapply, gapply"

So, do I miss some parameters or is there are general limitation?
I'm using Spark 2.2.0 and read the data from HDFS 2.7.1 and played with several 
DOPs.

Best
Andreas



Re: using R with Spark

2017-09-24 Thread Felix Cheung
There are other approaches like this

Find Livy on the page
https://blog.rstudio.com/2017/01/24/sparklyr-0-5/

Probably will be best to follow up with sparklyr for any support question.


From: Adaryl Wakefield <adaryl.wakefi...@hotmail.com>
Sent: Sunday, September 24, 2017 2:42:19 PM
To: user@spark.apache.org
Subject: RE: using R with Spark

>It is free for use might need r studio server depending on which spark master 
>you choose.
Yeah I think that’s where my confusion is coming from. I’m looking at a cheat 
sheet. For connecting to a Yarn Cluster the first step is;

  1.  Install RStudio Server or RStudio Pro on one of the existing edge nodes.

As a matter of fact, it looks like any instance where you’re connecting to a 
cluster requires the paid version of RStudio. All the links I google are 
suggesting this. And then there is this:
https://stackoverflow.com/questions/39798798/connect-sparklyr-to-remote-spark-connection

That’s about a year old, but I haven’t found anything that contradicts it.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>


From: Georg Heiler [mailto:georg.kf.hei...@gmail.com]
Sent: Sunday, September 24, 2017 3:39 PM
To: Felix Cheung <felixcheun...@hotmail.com>; Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com>; user@spark.apache.org
Subject: Re: using R with Spark

No. It is free for use might need r studio server depending on which spark 
master you choose.
Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> 
schrieb am So. 24. Sep. 2017 um 22:24:
Both are free to use; you can use sparklyr from the R shell without RStudio 
(but you probably want an IDE)


From: Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com<mailto:adaryl.wakefi...@hotmail.com>>
Sent: Sunday, September 24, 2017 11:19:24 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: using R with Spark

There are two packages SparkR and sparklyr. Sparklyr seems to be the more 
useful. However, do you have to pay to use it? Unless I’m not reading this 
right, it seems you have to have the paid version of RStudio to use it.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>




Re: using R with Spark

2017-09-24 Thread Felix Cheung
If you google it you will find posts or info on how to connect it to different 
cloud and hadoop/spark vendors.



From: Georg Heiler <georg.kf.hei...@gmail.com>
Sent: Sunday, September 24, 2017 1:39:09 PM
To: Felix Cheung; Adaryl Wakefield; user@spark.apache.org
Subject: Re: using R with Spark

No. It is free for use might need r studio server depending on which spark 
master you choose.
Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> 
schrieb am So. 24. Sep. 2017 um 22:24:
Both are free to use; you can use sparklyr from the R shell without RStudio 
(but you probably want an IDE)


From: Adaryl Wakefield 
<adaryl.wakefi...@hotmail.com<mailto:adaryl.wakefi...@hotmail.com>>
Sent: Sunday, September 24, 2017 11:19:24 AM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: using R with Spark

There are two packages SparkR and sparklyr. Sparklyr seems to be the more 
useful. However, do you have to pay to use it? Unless I’m not reading this 
right, it seems you have to have the paid version of RStudio to use it.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net<http://www.massstreet.net/>
www.linkedin.com/in/bobwakefieldmba<http://www.linkedin.com/in/bobwakefieldmba>
Twitter: @BobLovesData<http://twitter.com/BobLovesData>




Re: using R with Spark

2017-09-24 Thread Felix Cheung
Both are free to use; you can use sparklyr from the R shell without RStudio 
(but you probably want an IDE)



From: Adaryl Wakefield 
Sent: Sunday, September 24, 2017 11:19:24 AM
To: user@spark.apache.org
Subject: using R with Spark

There are two packages SparkR and sparklyr. Sparklyr seems to be the more 
useful. However, do you have to pay to use it? Unless I’m not reading this 
right, it seems you have to have the paid version of RStudio to use it.

Adaryl "Bob" Wakefield, MBA
Principal
Mass Street Analytics, LLC
913.938.6685
www.massstreet.net
www.linkedin.com/in/bobwakefieldmba
Twitter: @BobLovesData




Re: graphframes on cluster

2017-09-20 Thread Felix Cheung
Could you include the code where it fails?
Generally the best way to use gf is to use the --packages options with 
spark-submit command


From: Imran Rajjad 
Sent: Wednesday, September 20, 2017 5:47:27 AM
To: user @spark
Subject: graphframes on cluster

Trying to run graph frames on a spark cluster. Do I need to include the package 
in spark context settings? or the only the driver program is suppose to have 
the graphframe libraries in its class path? Currently the job is crashing when 
action function is invoked on graphframe classes.

regards,
Imran

--
I.R


Re: Queries with streaming sources must be executed with writeStream.start()

2017-09-09 Thread Felix Cheung
What is newDS?
If it is a Streaming Dataset/DataFrame (since you have writeStream there) then 
there seems to be an issue preventing toJSON to work.


From: kant kodali 
Sent: Saturday, September 9, 2017 4:04:33 PM
To: user @spark
Subject: Queries with streaming sources must be executed with 
writeStream.start()

Hi All,

I  have the following code and I am not sure what's wrong with it? I cannot 
call dataset.toJSON() (which returns a DataSet) ? I am using spark 2.2.0 so I 
am wondering if there is any work around?


 Dataset ds = newDS.toJSON().map(()->{some function that returns a 
string});
 StreamingQuery query = ds.writeStream().start();
 query.awaitTermination();


Re: How to convert Row to JSON in Java?

2017-09-09 Thread Felix Cheung
toJSON on Dataset/DataFrame?


From: kant kodali 
Sent: Saturday, September 9, 2017 4:15:49 PM
To: user @spark
Subject: How to convert Row to JSON in Java?

Hi All,

How to convert Row to JSON in Java? It would be nice to have .toJson() method 
in the Row class.

Thanks,
kant


Re: sparkR 3rd library

2017-09-04 Thread Felix Cheung
Can you include the code you call spark.lapply?



From: patcharee 
Sent: Sunday, September 3, 2017 11:46:40 PM
To: spar >> user@spark.apache.org
Subject: sparkR 3rd library

Hi,

I am using spark.lapply to execute an existing R script in standalone
mode. This script calls a function 'rbga' from a 3rd library 'genalg'.
This rbga function works fine in sparkR env when I call it directly, but
when I apply this to spark.lapply I get the error

could not find function "rbga"
 at org.apache.spark.api.r.RRunner.compute(RRunner.scala:108)
 at org.apache.spark.api.r.BaseRRDD.compute(RRDD.scala:51)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala

Any ideas/suggestions?

BR, Patcharee


-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: With 2.2.0 PySpark is now available for pip install from PyPI :)

2017-07-12 Thread Felix Cheung
Awesome! Congrats!!


From: holden.ka...@gmail.com  on behalf of Holden Karau 

Sent: Wednesday, July 12, 2017 12:26:00 PM
To: user@spark.apache.org
Subject: With 2.2.0 PySpark is now available for pip install from PyPI :)

Hi wonderful Python + Spark folks,

I'm excited to announce that with Spark 2.2.0 we finally have PySpark published 
on PyPI (see https://pypi.python.org/pypi/pyspark / 
https://twitter.com/holdenkarau/status/885207416173756417). This has been a 
long time coming (previous releases included pip installable artifacts that for 
a variety of reasons couldn't be published to PyPI). So if you (or your 
friends) want to be able to work with PySpark locally on your laptop you've got 
an easier path getting started (pip install pyspark).

If you are setting up a standalone cluster your cluster will still need the 
"full" Spark packaging, but the pip installed PySpark should be able to work 
with YARN or an existing standalone cluster installation (of the same version).

Happy Sparking y'all!

Holden :)


--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: How save streaming aggregations on 'Structured Streams' in parquet format ?

2017-06-19 Thread Felix Cheung
And perhaps the error message can be improved here?


From: Tathagata Das 
Sent: Monday, June 19, 2017 8:24:01 PM
To: kaniska Mandal
Cc: Burak Yavuz; user
Subject: Re: How save streaming aggregations on 'Structured Streams' in parquet 
format ?

That is not the write way to use watermark + append output mode. The 
`withWatermark` must be before the aggregation. Something like this.

df.withWatermark("timestamp", "1 hour")
  .groupBy(window("timestamp", "30 seconds"))
  .agg(...)

Read more here - 
https://databricks.com/blog/2017/05/08/event-time-aggregation-watermarking-apache-sparks-structured-streaming.html


On Mon, Jun 19, 2017 at 7:27 PM, kaniska Mandal 
> wrote:
Hi Burak,

Per your suggestion, I have specified
> deviceBasicAgg.withWatermark("eventtime", "30 seconds");
before invoking deviceBasicAgg.writeStream()...

But I am still facing ~

org.apache.spark.sql.AnalysisException: Append output mode not supported when 
there are streaming aggregations on streaming DataFrames/DataSets;

I am Ok with 'complete' output mode.

=

I tried another approach - Creating parquet file from the in-memory dataset ~ 
which seems to work.

But I need only the delta, not the cumulative count. Since 'append' mode not 
supporting streaming Aggregation, I have to use 'complete' outputMode.

StreamingQuery streamingQry = deviceBasicAgg.writeStream()

  .format("memory")

  .trigger(ProcessingTime.create("5 seconds"))

  .queryName("deviceBasicAggSummary")

  .outputMode("complete")

  .option("checkpointLocation", "/tmp/parquet/checkpoints/")

  .start();

streamingQry.awaitTermination();

Thread.sleep(5000);

while(true) {

Dataset deviceBasicAggSummaryData = spark.table("deviceBasicAggSummary");

deviceBasicAggSummaryData.toDF().write().parquet("/data/summary/devices/"+new 
Date().getTime()+"/");

}

=

So whats the best practice for 'low latency query on distributed data' using 
Spark SQL and Structured Streaming ?


Thanks

Kaniska


On Mon, Jun 19, 2017 at 11:55 AM, Burak Yavuz 
> wrote:
Hi Kaniska,

In order to use append mode with aggregations, you need to set an event time 
watermark (using `withWatermark`). Otherwise, Spark doesn't know when to output 
an aggregation result as "final".

Best,
Burak

On Mon, Jun 19, 2017 at 11:03 AM, kaniska Mandal 
> wrote:
Hi,

My goal is to ~
(1) either chain streaming aggregations in a single query OR
(2) run multiple streaming aggregations and save data in some meaningful format 
to execute low latency / failsafe OLAP queries

So my first choice is parquet format , but I failed to make it work !

I am using spark-streaming_2.11-2.1.1

I am facing the following error -
org.apache.spark.sql.AnalysisException: Append output mode not supported when 
there are streaming aggregations on streaming DataFrames/DataSets;

- for the following syntax

 StreamingQuery streamingQry = tagBasicAgg.writeStream()

  .format("parquet")

  .trigger(ProcessingTime.create("10 seconds"))

  .queryName("tagAggSummary")

  .outputMode("append")

  .option("checkpointLocation", "/tmp/summary/checkpoints/")

  .option("path", "/data/summary/tags/")

  .start();

But, parquet doesn't support 'complete' outputMode.

So is parquet supported only for batch queries , NOT for streaming queries ?

- note that console outputmode working fine !

Any help will be much appreciated.

Thanks
Kaniska






Re: problem initiating spark context with pyspark

2017-06-10 Thread Felix Cheung
Curtis, assuming you are running a somewhat recent windows version you would 
not have access to c:\tmp, in your command example

winutils.exe ls -F C:\tmp\hive

Try changing the path to under your user directory.

Running Spark on Windows should work :)


From: Curtis Burkhalter 
Sent: Wednesday, June 7, 2017 7:46:56 AM
To: Doc Dwarf
Cc: user@spark.apache.org
Subject: Re: problem initiating spark context with pyspark

Thanks Doc I saw this on another board yesterday so I've tried this by first 
going to the directory where I've stored the wintutils.exe and then as an admin 
running the command  that you suggested and I get this exception when checking 
the permissions:

C:\winutils\bin>winutils.exe ls -F C:\tmp\hive
FindFileOwnerAndPermission error (1789): The trust relationship between this 
workstation and the primary domain failed.

I'm fairly new to the command line and determining what the different 
exceptions mean. Do you have any advice what this error means and how I might 
go about fixing this?

Thanks again


On Wed, Jun 7, 2017 at 9:51 AM, Doc Dwarf 
> wrote:
Hi Curtis,

I believe in windows, the following command needs to be executed: (will need 
winutils installed)

D:\winutils\bin\winutils.exe chmod 777 D:\tmp\hive



On 6 June 2017 at 09:45, Curtis Burkhalter 
> wrote:
Hello all,

I'm new to Spark and I'm trying to interact with it using Pyspark. I'm using 
the prebuilt version of spark v. 2.1.1 and when I go to the command line and 
use the command 'bin\pyspark' I have initialization problems and get the 
following message:

C:\spark\spark-2.1.1-bin-hadoop2.7> bin\pyspark
Python 3.6.0 |Anaconda 4.3.1 (64-bit)| (default, Dec 23 2016, 11:57:41) [MSC 
v.1900 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
17/06/06 10:30:14 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
17/06/06 10:30:21 WARN ObjectStore: Version information not found in metastore. 
hive.metastore.schema.verification is not enabled so recording the schema 
version 1.2.0
17/06/06 10:30:21 WARN ObjectStore: Failed to get database default, returning 
NoSuchObjectException
Traceback (most recent call last):
  File "C:\spark\spark-2.1.1-bin-hadoop2.7\python\pyspark\sql\utils.py", line 
63, in deco
return f(*a, **kw)
  File 
"C:\spark\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py",
 line 319, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o22.sessionState.
: java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)
at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:978)
... 13 more
Caused by: java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveExternalCatalog':
at 
org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$reflect(SharedState.scala:169)
at 

Re: "java.lang.IllegalStateException: There is no space for new record" in GraphFrames

2017-04-28 Thread Felix Cheung
Can you allocate more memory to the executor?

Also please open issue with gf on its github


From: rok 
Sent: Friday, April 28, 2017 1:42:33 AM
To: user@spark.apache.org
Subject: "java.lang.IllegalStateException: There is no space for new record" in 
GraphFrames

When running the connectedComponents algorithm in GraphFrames on a
sufficiently large dataset, I get the following error I have not encountered
before:

17/04/20 20:35:26 WARN TaskSetManager: Lost task 3.0 in stage 101.0 (TID
53644, 172.19.1.206, executor 40): java.lang.IllegalStateException: There is
no space for new record
at
org.apache.spark.util.collection.unsafe.sort.UnsafeInMemorySorter.insertRecord(UnsafeInMemorySorter.java:225)
at
org.apache.spark.sql.execution.UnsafeKVExternalSorter.(UnsafeKVExternalSorter.java:130)
at
org.apache.spark.sql.execution.UnsafeFixedWidthAggregationMap.destructAndCreateExternalSorter(UnsafeFixedWidthAggregationMap.java:244)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.agg_doAggregateWithKeys$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:377)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at
org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:126)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


Any thoughts on how to avoid this?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-There-is-no-space-for-new-record-in-GraphFrames-tp28635.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: how to create List in pyspark

2017-04-28 Thread Felix Cheung
Why no use sql functions explode and split?
Would perform and be more stable then udf


From: Yanbo Liang 
Sent: Thursday, April 27, 2017 7:34:54 AM
To: Selvam Raman
Cc: user
Subject: Re: how to create List in pyspark

​You can try with UDF, like the following code snippet:

from pyspark.sql.functions import udf
from pyspark.sql.types import ArrayType, StringType
df = spark.read.text("./README.md")​
split_func = udf(lambda text: text.split(" "), ArrayType(StringType()))
df.withColumn("split_value", split_func("value")).show()

Thanks
Yanbo

On Tue, Apr 25, 2017 at 12:27 AM, Selvam Raman 
> wrote:

documentDF = spark.createDataFrame([

("Hi I heard about Spark".split(" "), ),

("I wish Java could use case classes".split(" "), ),

("Logistic regression models are neat".split(" "), )

], ["text"])


How can i achieve the same df while i am reading from source?

doc = spark.read.text("/Users/rs/Desktop/nohup.out")

how can i create array type with "sentences" column from doc(dataframe)


The below one creates more than one column.

rdd.map(lambda rdd: rdd[0]).map(lambda row:row.split(" "))

--
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"



Re: Spark SQL - Global Temporary View is not behaving as expected

2017-04-22 Thread Felix Cheung
Cross session is this context is multiple spark sessions from the same spark 
context. Since you are running two shells, you are having different spark 
context.

Do you have to you a temp view? Could you create a table?

_
From: Hemanth Gudela 
>
Sent: Saturday, April 22, 2017 12:57 AM
Subject: Spark SQL - Global Temporary View is not behaving as expected
To: >


Hi,

According to 
documentation,
 global temporary views are cross-session accessible.

But when I try to query a global temporary view from another spark shell like 
this-->
Instance 1 of spark-shell
--
scala> spark.sql("select 1 as col1").createGlobalTempView("gView1")

Instance 2 of spark-shell (while Instance 1 of spark-shell is still alive)
-
scala> spark.sql("select * from global_temp.gView1").show()
org.apache.spark.sql.AnalysisException: Table or view not found: 
`global_temp`.`gView1`
'Project [*]
+- 'UnresolvedRelation `global_temp`.`gView1`

I am expecting that global temporary view created in shell 1 should be 
accessible in shell 2, but it isn’t!
Please correct me if I missing something here.

Thanks (in advance),
Hemanth




Re: [sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ?

2017-04-21 Thread Felix Cheung
Not currently - how are you planning to use the output from word2vec?


From: Radhwane Chebaane 
Sent: Thursday, April 20, 2017 4:30:14 AM
To: user@spark.apache.org
Subject: [sparkR] [MLlib] : Is word2vec implemented in SparkR MLlib ?

Hi,

I've been experimenting with the Spark Word2vec implementation in the
MLLib package with Scala and it was very nice.
I need to use the same algorithm in R leveraging the power of spark 
distribution with SparkR.
I have been looking on the mailing list and Stackoverflow for any Word2vec 
use-case in SparkR but no luck.

Is there any implementation of Word2vec in SparkR ? Is there any current work 
to support this feature in MLlib with R?

Thanks!
Radhwane Chebaane

--

[photo] Radhwane Chebaane
Distributed systems engineer, Mindlytix

Mail: radhw...@mindlytix.com 
Mobile: +33 695 588 906 

Skype: rad.cheb 
LinkedIn 




Re: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

2017-04-02 Thread Felix Cheung
Interesting!


From: Robert Yokota 
Sent: Sunday, April 2, 2017 9:40:07 AM
To: user@spark.apache.org
Subject: Graph Analytics on HBase with HGraphDB and Spark GraphFrames

Hi,

In case anyone is interested in analyzing graphs in HBase with Apache Spark 
GraphFrames, this might be helpful:

https://yokota.blog/2017/04/02/graph-analytics-on-hbase-with-hgraphdb-and-spark-graphframes/


Re: Getting exit code of pipe()

2017-02-12 Thread Felix Cheung
I mean if you are running a script instead of exiting with a code it could 
print out something.

Sounds like checkCode is what you want though.


_
From: Xuchen Yao <yaoxuc...@gmail.com<mailto:yaoxuc...@gmail.com>>
Sent: Sunday, February 12, 2017 8:33 AM
Subject: Re: Getting exit code of pipe()
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>


Cool that's exactly what I was looking for! Thanks!

How does one output the status into stdout? I mean, how does one capture the 
status output of pipe() command?

On Sat, Feb 11, 2017 at 9:50 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Do you want the job to fail if there is an error exit code?

You could set checkCode to True
spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe<http://spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe>

Otherwise maybe you want to output the status into stdout so you could process 
it individually.


_
From: Xuchen Yao <yaoxuc...@gmail.com<mailto:yaoxuc...@gmail.com>>
Sent: Friday, February 10, 2017 11:18 AM
Subject: Getting exit code of pipe()
To: <user@spark.apache.org<mailto:user@spark.apache.org>>



Hello Community,

I have the following Python code that calls an external command:

rdd.pipe('run.sh', env=os.environ).collect()

run.sh can either exit with status 1 or 0, how could I get the exit code from 
Python? Thanks!

Xuchen







Re: Getting exit code of pipe()

2017-02-11 Thread Felix Cheung
Do you want the job to fail if there is an error exit code?

You could set checkCode to True
spark.apache.org/docs/latest/api/python/pyspark.html?highlight=pipe#pyspark.RDD.pipe

Otherwise maybe you want to output the status into stdout so you could process 
it individually.


_
From: Xuchen Yao >
Sent: Friday, February 10, 2017 11:18 AM
Subject: Getting exit code of pipe()
To: >


Hello Community,

I have the following Python code that calls an external command:

rdd.pipe('run.sh', env=os.environ).collect()

run.sh can either exit with status 1 or 0, how could I get the exit code from 
Python? Thanks!

Xuchen




Re: Examples in graphx

2017-01-29 Thread Felix Cheung
Which graph do you are thinking about?
Here's one for neo4j

https://neo4j.com/blog/neo4j-3-0-apache-spark-connector/


From: Deepak Sharma 
Sent: Sunday, January 29, 2017 4:28:19 AM
To: spark users
Subject: Examples in graphx

Hi There,
Are there any examples of using GraphX along with any graph DB?
I am looking to persist the graph in graph based DB and then read it back in 
spark , process using graphx.

--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net


Re: Creating UUID using SparksSQL

2017-01-18 Thread Felix Cheung
spark.apache.org/docs/2.0.0/api/python/pyspark.sql.html#pyspark.sql.functions.monotonically_increasing_id

?



From: Ninad Shringarpure 
Sent: Wednesday, January 18, 2017 11:23:15 AM
To: user
Subject: Creating UUID using SparksSQL

Hi Team,

Is there a standard way of generating a unique id for each row in from Spark 
SQL. I am looking for functionality similar to UUID generation in hive.

Let me know if you need any additional information.

Thanks,
Ninad


Re: what does dapply actually do?

2017-01-18 Thread Felix Cheung
With Spark, the processing is performed lazily. This means nothing much is 
really happening until you call an "action" - an example that is collect(). 
Another way is to write the output in a distributed manner - see write.df() in 
R.

With SparkR dapply() passing the data from Spark to R to process by your UDF 
could have significant overhead. Could you provide more information on your 
case?


_
From: Xiao Liu1 >
Sent: Wednesday, January 18, 2017 11:30 AM
Subject: what does dapply actually do?
To: >



Hi,
I'm really new and trying to learn sparkR. I have defined a relatively 
complicated user-defined function, and use dapply() to apply the function on a 
SparkDataFrame. It was very fast. But I am not sure what has actually been done 
by dapply(). Because when I used collect() to see the output, which is very 
simple, it took a long time to get the result. I suppose maybe I don't need to 
use collect(), but without using it, how can I output the final results, say, 
in a .csv file?
Thank you very much for the help.

Best Regards,
Xiao


[Inactive hide details for Ninad Shringarpure ---01/18/2017 02:24:08 PM---Hi 
Team, Is there a standard way of generating a uniqu]Ninad Shringarpure 
---01/18/2017 02:24:08 PM---Hi Team, Is there a standard way of generating a 
unique id for each row in from

From: Ninad Shringarpure >
To: user >
Date: 01/18/2017 02:24 PM
Subject: Creating UUID using SparksSQL





Hi Team,

Is there a standard way of generating a unique id for each row in from Spark 
SQL. I am looking for functionality similar to UUID generation in hive.

Let me know if you need any additional information.

Thanks,
Ninad






Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
This is likely a factor of your hadoop config and Spark rather then anything 
specific with GraphFrames.

You might have better luck getting assistance if you could isolate the code to 
a simple case that manifests the problem (without GraphFrames), and repost.



From: Ankur Srivastava <ankur.srivast...@gmail.com>
Sent: Thursday, January 5, 2017 3:45:59 PM
To: Felix Cheung; d...@spark.apache.org
Cc: user@spark.apache.org
Subject: Re: Spark GraphFrame ConnectedComponents

Adding DEV mailing list to see if this is a defect with ConnectedComponent or 
if they can recommend any solution.

Thanks
Ankur

On Thu, Jan 5, 2017 at 1:10 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Yes I did try it out and it choses the local file system as my checkpoint 
location starts with s3n://

I am not sure how can I make it load the S3FileSystem.

On Thu, Jan 5, 2017 at 12:12 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Right, I'd agree, it seems to be only with delete.

Could you by chance run just the delete to see if it fails

FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)

From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Thursday, January 5, 2017 10:05:03 AM
To: Felix Cheung
Cc: user@spark.apache.org<mailto:user@spark.apache.org>

Subject: Re: Spark GraphFrame ConnectedComponents

Yes it works to read the vertices and edges data from S3 location and is also 
able to write the checkpoint files to S3. It only fails when deleting the data 
and that is because it tries to use the default file system. I tried looking up 
how to update the default file system but could not find anything in that 
regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>



This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.d

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
Right, I'd agree, it seems to be only with delete.

Could you by chance run just the delete to see if it fails

FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(somepath), true)

From: Ankur Srivastava <ankur.srivast...@gmail.com>
Sent: Thursday, January 5, 2017 10:05:03 AM
To: Felix Cheung
Cc: user@spark.apache.org
Subject: Re: Spark GraphFrame ConnectedComponents

Yes it works to read the vertices and edges data from S3 location and is also 
able to write the checkpoint files to S3. It only fails when deleting the data 
and that is because it tries to use the default file system. I tried looking up 
how to update the default file system but could not find anything in that 
regard.

Thanks
Ankur

On Thu, Jan 5, 2017 at 12:55 AM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>



This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)

Also I think the error is happening in this part of the code 
"ConnectedComponents.scala:339" I am referring the code 
@https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala

  if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
// TODO: remove this after DataFrame.checkpoint is implemented
val out = s"${checkpointDir.get}/$iteration"
ee.write.parquet(out)
// may hit S3 eventually consistent issue
ee = sqlContext.read.parquet(out)

// remove previous checkpoint
if (iteration > checkpointInterval) {
  FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(s"${checkpointDir.get}/${iteration - 
checkpointInterval}"), true)
    }

System.gc() // hint Spark to clean shuffle directories
  }


Thanks
Ankur

On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Do you have more of the exception stack?


___

Re: Spark GraphFrame ConnectedComponents

2017-01-05 Thread Felix Cheung
>From the stack it looks to be an error from the explicit call to 
>hadoop.fs.FileSystem.

Is the URL scheme for s3n registered?
Does it work when you try to read from s3 from Spark?

_
From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 9:23 PM
Subject: Re: Spark GraphFrame ConnectedComponents
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>


This is the exact trace from the driver logs

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n:///8ac233e4-10f9-4eb3-aa53-df6d9d7ea7be/connected-components-c1dbc2b0/3,
 expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:645)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:80)
at 
org.apache.hadoop.fs.RawLocalFileSystem.deprecatedGetFileStatus(RawLocalFileSystem.java:529)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileLinkStatusInternal(RawLocalFileSystem.java:747)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:524)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:534)
at 
org.graphframes.lib.ConnectedComponents$.org$graphframes$lib$ConnectedComponents$$run(ConnectedComponents.scala:340)
at org.graphframes.lib.ConnectedComponents.run(ConnectedComponents.scala:139)
at GraphTest.main(GraphTest.java:31) --- Application Class
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at 
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

And I am running spark v 1.6.2 and graphframes v 0.3.0-spark1.6-s_2.10

Thanks
Ankur

On Wed, Jan 4, 2017 at 8:03 PM, Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>> wrote:
Hi

I am rerunning the pipeline to generate the exact trace, I have below part of 
trace from last run:

Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///
at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)
at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)
at 
org.apache.hadoop.fs.RawLocalFileSystem.getFileStatus(RawLocalFileSystem.java:516)
at org.apache.hadoop.fs.ChecksumFileSystem.delete(ChecksumFileSystem.java:528)

Also I think the error is happening in this part of the code 
"ConnectedComponents.scala:339" I am referring the code 
@https://github.com/graphframes/graphframes/blob/master/src/main/scala/org/graphframes/lib/ConnectedComponents.scala

  if (shouldCheckpoint && (iteration % checkpointInterval == 0)) {
// TODO: remove this after DataFrame.checkpoint is implemented
val out = s"${checkpointDir.get}/$iteration"
ee.write.parquet(out)
// may hit S3 eventually consistent issue
ee = sqlContext.read.parquet(out)

// remove previous checkpoint
if (iteration > checkpointInterval) {
  FileSystem.get(sc.hadoopConfiguration)
.delete(new Path(s"${checkpointDir.get}/${iteration - 
checkpointInterval}"), true)
}

System.gc() // hint Spark to clean shuffle directories
  }


Thanks
Ankur

On Wed, Jan 4, 2017 at 5:19 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Do you have more of the exception stack?



From: Ankur Srivastava 
<ankur.srivast...@gmail.com<mailto:ankur.srivast...@gmail.com>>
Sent: Wednesday, January 4, 2017 4:40:02 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Spark GraphFrame ConnectedComponents

Hi,

I am trying to use the ConnectedComponent algorithm of GraphFrames but by 
default it needs a checkpoint directory. As I am running my spark cluster with 
S3 as the DFS and do not have access to HDFS file system I tried using a s3 
directory as checkpoint directory but I run into below exception:


Exception in thread "main"java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)

at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)

If I set checkpoint interval to -1 to avoid chec

Re: Spark GraphFrame ConnectedComponents

2017-01-04 Thread Felix Cheung
Do you have more of the exception stack?



From: Ankur Srivastava 
Sent: Wednesday, January 4, 2017 4:40:02 PM
To: user@spark.apache.org
Subject: Spark GraphFrame ConnectedComponents

Hi,

I am trying to use the ConnectedComponent algorithm of GraphFrames but by 
default it needs a checkpoint directory. As I am running my spark cluster with 
S3 as the DFS and do not have access to HDFS file system I tried using a s3 
directory as checkpoint directory but I run into below exception:


Exception in thread "main" java.lang.IllegalArgumentException: Wrong FS: 
s3n://, expected: file:///

at org.apache.hadoop.fs.FileSystem.checkPath(FileSystem.java:642)

at 
org.apache.hadoop.fs.RawLocalFileSystem.pathToFile(RawLocalFileSystem.java:69)

If I set checkpoint interval to -1 to avoid checkpointing the driver just hangs 
after 3 or 4 iterations.

Is there some way I can set the default FileSystem to S3 for Spark or any other 
option?

Thanks
Ankur



Re: Issue with SparkR setup on RStudio

2017-01-02 Thread Felix Cheung
Perhaps it is with

spark.sql.warehouse.dir="E:/Exp/"

That you have in the sparkConfig parameter.

Unfortunately the exception stack is fairly far away from the actual error, but 
from the top of my head spark.sql.warehouse.dir and HADOOP_HOME are the two 
different pieces that is not set in the Windows tests.


_
From: Md. Rezaul Karim 
<rezaul.ka...@insight-centre.org<mailto:rezaul.ka...@insight-centre.org>>
Sent: Monday, January 2, 2017 7:58 AM
Subject: Re: Issue with SparkR setup on RStudio
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: spark users <user@spark.apache.org<mailto:user@spark.apache.org>>


Hello Cheung,

Happy New Year!

No, I did not configure Hive on my machine. Even I have tried not setting the 
HADOOP_HOME but getting the same error.



Regards,
_
Md. Rezaul Karim BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web:http://www.reza-analytics.eu/index.html<http://139.59.184.114/index.html>

On 29 December 2016 at 19:16, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Any reason you are setting HADOOP_HOME?

>From the error it seems you are running into issue with Hive config likely 
>with trying to load hive-site.xml. Could you try not setting HADOOP_HOME



From: Md. Rezaul Karim 
<rezaul.ka...@insight-centre.org<mailto:rezaul.ka...@insight-centre.org>>
Sent: Thursday, December 29, 2016 10:24:57 AM
To: spark users
Subject: Issue with SparkR setup on RStudio

Dear Spark users,
I am trying to setup SparkR on RStudio to perform some basic data manipulations 
and MLmodeling.  However, I am a strange error while creating SparkR session or 
DataFrame that says:java.lang.IllegalArgumentException Error while 
instantiating 'org.apache.spark.sql.hive.HiveSessionState.
According to Spark documentation 
athttp://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession, I 
don’t need to configure Hive path or related variables.
I have the following source code:

SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]", 
sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/", spark.driver.memory = 
"8g"), enableHiveSupport = TRUE)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
print(df)
head(df)
sparkR.session.stop()
Please note that the HADOOP_HOME contains the ‘winutils.exe’ file. The details 
of the eror is as follows:

Error in handleErrors(returnStatus, conn) :  
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':

   at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)

   at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)

   at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)

   at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

   at 
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

   at scala.collection.Traversabl



 Any kind of help would be appreciated.


Regards,
_
Md. Rezaul Karim BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web:http://www.reza-analytics.eu/index.html<http://139.59.184.114/index.html>





Re: How to load a big csv to dataframe in Spark 1.6

2016-12-31 Thread Felix Cheung
Hmm this would seem unrelated? Does it work on the same box without the 
package? Do you have more of the error stack you can share?


_
From: Raymond Xie <xie3208...@gmail.com<mailto:xie3208...@gmail.com>>
Sent: Saturday, December 31, 2016 8:09 AM
Subject: Re: How to load a big csv to dataframe in Spark 1.6
To: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Cc: <user@spark.apache.org<mailto:user@spark.apache.org>>


Hello Felix,

I followed the instruction and ran the command:

> $SPARK_HOME/bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0

and I received the following error message:
java.lang.RuntimeException: java.net.ConnectException: Call From 
xie1/192.168.112.150<http://192.168.112.150> to localhost:9000 failed on 
connection exception: java.net.ConnectException: Connection refused; For more 
details see:  http://wiki.apache.org/hadoop/ConnectionRefused

any thought?




Sincerely yours,


Raymond

On Fri, Dec 30, 2016 at 10:08 PM, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
Have you tried the spark-csv package?

https://spark-packages.org/package/databricks/spark-csv



From: Raymond Xie <xie3208...@gmail.com<mailto:xie3208...@gmail.com>>
Sent: Friday, December 30, 2016 6:46:11 PM
To: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: How to load a big csv to dataframe in Spark 1.6

Hello,

I see there is usually this way to load a csv to dataframe:


sqlContext = SQLContext(sc)Employee_rdd = sc.textFile("\..\Employee.csv")   
.map(lambda line: line.split(","))Employee_df = 
Employee_rdd.toDF(['Employee_ID','Employee_name'])Employee_df.show()

However in my case my csv has 100+ fields, which means toDF() will be very 
lengthy.

Can anyone tell me a practical method to load the data?

Thank you very much.


Raymond






Re: Spark Graphx with Database

2016-12-30 Thread Felix Cheung
You might want to check out GraphFrames - to load database data (as Spark 
DataFrame) and build graphs with them


https://github.com/graphframes/graphframes

_
From: balaji9058 >
Sent: Monday, December 26, 2016 9:27 PM
Subject: Spark Graphx with Database
To: >


Hi All,

I would like to know about spark graphx execution/processing with
database.Yes, i understand spark graphx is in-memory processing but some
extent we can manage querying but would like to do much more complex query
or processing.Please suggest me the usecase or steps for the same.




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Graphx-with-Database-tp28253.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Re: Difference in R and Spark Output

2016-12-30 Thread Felix Cheung
Could you elaborate more on the huge difference you are seeing?



From: Saroj C 
Sent: Friday, December 30, 2016 5:12:04 AM
To: User
Subject: Difference in R and Spark Output

Dear All,
 For the attached input file, there is a huge difference between the Clusters 
in R and Spark(ML). Any idea, what could be the difference ?

Note we wanted to create Five(5) clusters.

Please find the snippets in Spark and R

Spark

//Load the Data file

// Create K means Cluster
KMeans kmeans = new KMeans().setK(5).setMaxIter(500)

.setFeaturesCol("features").setPredictionCol("prediction");


In R

//Load the Data File into df

//Create the K Means Cluster

model <- kmeans(df, 5)



Thanks & Regards
Saroj

=-=-=
Notice: The information contained in this e-mail
message and/or attachments to it may contain
confidential or privileged information. If you are
not the intended recipient, any dissemination, use,
review, distribution, printing or copying of the
information contained in this e-mail message
and/or attachments to it are strictly prohibited. If
you have received this communication in error,
please notify us by reply e-mail or telephone and
immediately and permanently delete the message
and any attachments. Thank you


Re: How to load a big csv to dataframe in Spark 1.6

2016-12-30 Thread Felix Cheung
Have you tried the spark-csv package?

https://spark-packages.org/package/databricks/spark-csv



From: Raymond Xie 
Sent: Friday, December 30, 2016 6:46:11 PM
To: user@spark.apache.org
Subject: How to load a big csv to dataframe in Spark 1.6

Hello,

I see there is usually this way to load a csv to dataframe:


sqlContext = SQLContext(sc)

Employee_rdd = sc.textFile("\..\Employee.csv")
   .map(lambda line: line.split(","))

Employee_df = Employee_rdd.toDF(['Employee_ID','Employee_name'])

Employee_df.show()

However in my case my csv has 100+ fields, which means toDF() will be very 
lengthy.

Can anyone tell me a practical method to load the data?

Thank you very much.


Raymond



Re: Issue with SparkR setup on RStudio

2016-12-29 Thread Felix Cheung
Any reason you are setting HADOOP_HOME?

>From the error it seems you are running into issue with Hive config likely 
>with trying to load hive-site.xml. Could you try not setting HADOOP_HOME



From: Md. Rezaul Karim 
Sent: Thursday, December 29, 2016 10:24:57 AM
To: spark users
Subject: Issue with SparkR setup on RStudio

Dear Spark users,
I am trying to setup SparkR on RStudio to perform some basic data manipulations 
and ML modeling.  However, I am a strange error while creating SparkR session 
or DataFrame that says: java.lang.IllegalArgumentException Error while 
instantiating 'org.apache.spark.sql.hive.HiveSessionState.
According to Spark documentation at 
http://spark.apache.org/docs/latest/sparkr.html#starting-up-sparksession, I 
don't need to configure Hive path or related variables.
I have the following source code:

SPARK_HOME = "C:/spark-2.1.0-bin-hadoop2.7"
HADOOP_HOME= "C:/spark-2.1.0-bin-hadoop2.7/bin/"

library(SparkR, lib.loc = c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib")))
sparkR.session(appName = "SparkR-DataFrame-example", master = "local[*]", 
sparkConfig = list(spark.sql.warehouse.dir="E:/Exp/", spark.driver.memory = 
"8g"), enableHiveSupport = TRUE)

# Create a simple local data.frame
localDF <- data.frame(name=c("John", "Smith", "Sarah"), age=c(19, 23, 18))
# Convert local data frame to a SparkDataFrame
df <- createDataFrame(localDF)
print(df)
head(df)
sparkR.session.stop()
Please note that the HADOOP_HOME  contains the 'winutils.exe' file. The details 
of the eror is as follows:

Error in handleErrors(returnStatus, conn) :  
java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionState':

   at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$reflect(SparkSession.scala:981)

   at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:110)

   at 
org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:109)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:67)

   at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:66)

   at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)

   at scala.collection.Iterator$class.foreach(Iterator.scala:893)

   at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)

   at 
scala.collection.IterableLike$class.foreach(IterableLike.scala:72)

   at scala.collection.AbstractIterable.foreach(Iterable.scala:54)

   at scala.collection.Traversabl



 Any kind of help would be appreciated.


Regards,
_
Md. Rezaul Karim BSc, MSc
PhD Researcher, INSIGHT Centre for Data Analytics
National University of Ireland, Galway
IDA Business Park, Dangan, Galway, Ireland
Web: http://www.reza-analytics.eu/index.html


Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
There is not a GraphLoader for GraphFrames but you could load and convert from 
GraphX:


http://graphframes.github.io/user-guide.html#graphx-to-graphframe


From: zjp_j...@163.com <zjp_j...@163.com>
Sent: Sunday, December 18, 2016 9:39:49 PM
To: Felix Cheung; user
Subject: Re: Re: GraphFrame not init vertices when load edges

I'm sorry, i  didn't expressed clearly.  Reference to the following Blod 
Underlined text.

 cite from http://spark.apache.org/docs/latest/graphx-programming-guide.html
" 
GraphLoader.edgeListFile<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]>
 provides a way to load a graph from a list of edges on disk. It parses an 
adjacency list of (source vertex ID, destination vertex ID) pairs of the 
following form, skipping comment lines that begin with #:

# This is a comment
2 1
4 1
1 2


It creates a Graph from the specified edges, automatically creating any 
vertices mentioned by edges."


Creating any vertices when create a Graph from specified edges that I think 
it's a good way, but now  
GraphLoader.edgeListFile<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.GraphLoader$@edgeListFile(SparkContext,String,Boolean,Int):Graph[Int,Int]>
 load format is not allowed to set edge attribute in edge file, So I want to 
know GraphFrames has any plan about it or better ways.

Thannks





____
zjp_j...@163.com

From: Felix Cheung<mailto:felixcheun...@hotmail.com>
Date: 2016-12-19 12:57
To: zjp_j...@163.com<mailto:zjp_j...@163.com>; 
user<mailto:user@spark.apache.org>
Subject: Re: GraphFrame not init vertices when load edges
Or this is a better link:
http://graphframes.github.io/quick-start.html

_____
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Sunday, December 18, 2016 8:46 PM
Subject: Re: GraphFrame not init vertices when load edges
To: <zjp_j...@163.com<mailto:zjp_j...@163.com>>, user 
<user@spark.apache.org<mailto:user@spark.apache.org>>


Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com<mailto:zjp_j...@163.com> 
<zjp_j...@163.com<mailto:zjp_j...@163.com>>
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(  ("a", "b", "friend"),  ("b", "c", 
"follow"),  ("c", "b", "follow"),  ("f", "c", "follow"),  ("e", "f", "follow"), 
 ("e", "d", "friend"),  ("d", "a", "friend"),  ("a", "e", 
"friend"))).toDF("src", "dst", "relationship")


zjp_j...@163.com<mailto:zjp_j...@163.com>




Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Or this is a better link:
http://graphframes.github.io/quick-start.html

_
From: Felix Cheung <felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>>
Sent: Sunday, December 18, 2016 8:46 PM
Subject: Re: GraphFrame not init vertices when load edges
To: <zjp_j...@163.com<mailto:zjp_j...@163.com>>, user 
<user@spark.apache.org<mailto:user@spark.apache.org>>


Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com<mailto:zjp_j...@163.com> 
<zjp_j...@163.com<mailto:zjp_j...@163.com>>
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(  ("a", "b", "friend"),  ("b", "c", 
"follow"),  ("c", "b", "follow"),  ("f", "c", "follow"),  ("e", "f", "follow"), 
 ("e", "d", "friend"),  ("d", "a", "friend"),  ("a", "e", 
"friend"))).toDF("src", "dst", "relationship")


zjp_j...@163.com<mailto:zjp_j...@163.com>




Re: GraphFrame not init vertices when load edges

2016-12-18 Thread Felix Cheung
Can you clarify?

Vertices should be another DataFrame as you can see in the example here: 
https://github.com/graphframes/graphframes/blob/master/docs/quick-start.md



From: zjp_j...@163.com 
Sent: Sunday, December 18, 2016 6:25:50 PM
To: user
Subject: GraphFrame not init vertices when load edges

Hi,
I fond GraphFrame when create edges not init vertiecs by default, has any plan 
about it or better ways?   Thanks

val e = sqlContext.createDataFrame(List(
  ("a", "b", "friend"),
  ("b", "c", "follow"),
  ("c", "b", "follow"),
  ("f", "c", "follow"),
  ("e", "f", "follow"),
  ("e", "d", "friend"),
  ("d", "a", "friend"),
  ("a", "e", "friend")
)).toDF("src", "dst", "relationship")


zjp_j...@163.com


Re: Spark Dataframe: Save to hdfs is taking long time

2016-12-15 Thread Felix Cheung
What is the format?



From: KhajaAsmath Mohammed 
Sent: Thursday, December 15, 2016 7:54:27 PM
To: user @spark
Subject: Spark Dataframe: Save to hdfs is taking long time

Hi,

I am using issue while saving the dataframe back to HDFS. It's taking long time 
to run.


val results_dataframe = sqlContext.sql("select gt.*,ct.* from PredictTempTable 
pt,ClusterTempTable ct,GamificationTempTable gt where gt.vin=pt.vin and 
pt.cluster=ct.cluster")
results_dataframe.coalesce(numPartitions)
results_dataframe.persist(StorageLevel.MEMORY_AND_DISK)

dataFrame.write.mode(saveMode).format(format)
  .option(Codec, compressCodec) //"org.apache.hadoop.io.compress.snappyCodec"
  .save(outputPath)

It was taking long time and total number of records for  this dataframe is 
4903764

I even increased number of partitions from 10 to 20, still no luck. Can anyone 
help me in resolving this performance issue

Thanks,

Asmath


Re: How to load edge with properties file useing GraphX

2016-12-15 Thread Felix Cheung
Have you checked out https://github.com/graphframes/graphframes?

It might be easier to work with DataFrame.



From: zjp_j...@163.com 
Sent: Thursday, December 15, 2016 7:23:57 PM
To: user
Subject: How to load edge with properties file useing GraphX

Hi,
   I want to load a edge file  and vertex attriInfos file as follow ,how can i 
use these two files create Graph ?
  edge file -> "SrcId,DestId,propertis...  "   vertex attriInfos file-> "VID, 
properties..."

   I learned about there have a GraphLoader object  that can load edge file 
with no properties  and then join Vertex properties to create Graph. So the 
issue is how to then attach edge properties.

   Thanks.


zjp_j...@163.com


Re: [GraphFrame, Pyspark] Weighted Edge in PageRank

2016-12-01 Thread Felix Cheung
That's correct - currently GraphFrame does not compute PageRank with weighted 
edges.


_
From: Weiwei Zhang >
Sent: Thursday, December 1, 2016 2:41 PM
Subject: [GraphFrame, Pyspark] Weighted Edge in PageRank
To: user >


Hi guys,

I am trying to compute the pagerank for the locations in the following dummy 
dataframe,

srcdes  shared_gas_stations
 A   B   2
 A   C  10
 C   E   3
 D   E  12
 E   G   5
...

I have tried the function graphframe.pageRank(resetProbability=0.01, 
maxIter=20) in GraphFrame but it seems like this function doesn't take weighted 
edges. Maybe I am not using it correctly. How can I pass the weighted edges to 
this function? Also I am not sure if this function works for the undirected 
graph.


Thanks a lot!

- Weiwei




Re: PySpark to remote cluster

2016-11-30 Thread Felix Cheung
Spark 2.0.1 is running with a different py4j library than Spark 1.6.

You will probably run into other problems mixing versions though - is there a 
reason you can't run Spark 1.6 on the client?


_
From: Klaus Schaefers 
>
Sent: Wednesday, November 30, 2016 2:44 AM
Subject: PySpark to remote cluster
To: >


Hi,

I want to connect with a local Jupyter Notebook to a remote Spark cluster.
The Cluster is running Spark 2.0.1 and the Jupyter notebook is based on
Spark 1.6 and running in a docker image (Link). I try to init the
SparkContext like this:

import pyspark
sc = pyspark.SparkContext('spark://:7077')

However, this gives me the following exception:


ERROR:py4j.java_gateway:Error while sending or receiving.
Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 746, in send_command
raise Py4JError("Answer from Java side is empty")
py4j.protocol.Py4JError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 626, in send_command
response = connection.send_command(command)
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 750, in send_command
raise Py4JNetworkError("Error while sending or receiving", e)
py4j.protocol.Py4JNetworkError: Error while sending or receiving

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 740, in send_command
answer = smart_decode(self.stream.readline()[:-1])
File "/opt/conda/lib/python3.5/socket.py", line 575, in readinto
return self._sock.recv_into(b)
ConnectionResetError: [Errno 104] Connection reset by peer
ERROR:py4j.java_gateway:An error occurred while trying to connect to the
Java server
Traceback (most recent call last):
File "/usr/local/spark/python/lib/py4j-0.9-src.zip/py4j/java_gateway.py",
line 746, in send_command
raise Py4JError("Answer from Java side is empty")
py4j.protocol.Py4JError: Answer from Java side is empty

...

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/opt/conda/lib/python3.5/site-packages/IPython/utils/PyColorize.py",
line 262, in format2
for atoken in generate_tokens(text.readline):
File "/opt/conda/lib/python3.5/tokenize.py", line 597, in _tokenize
raise TokenError("EOF in multi-line statement", (lnum, 0))
tokenize.TokenError: ('EOF in multi-line statement', (2, 0))


Is this error caused by the different spark versions?

Best,

Klaus




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/PySpark-to-remote-cluster-tp28147.html
Sent from the Apache Spark User List mailing list archive at 
Nabble.com.

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Re: How to propagate R_LIBS to sparkr executors

2016-11-17 Thread Felix Cheung
Have you tried
spark.executorEnv.R_LIBS?

spark.apache.org/docs/latest/configuration.html#runtime-environment

_
From: Rodrick Brown >
Sent: Wednesday, November 16, 2016 1:01 PM
Subject: How to propagate R_LIBS to sparkr executors
To: >


I'm having an issue with a R module not getting picked up on the slave nodes in 
mesos. I have the following environment value R_LIBS set and for some reason 
this environment is only set in the driver context and not the executor is 
their a way to pass environment values down the executor level in sparkr?

I'm using Mesos 1.0.1 and Spark 2.0.1

Thanks.


--

[Orchard Platform]

Rodrick Brown / Site Reliability Engineer
+1 917 445 6839 / 
rodr...@orchardplatform.com

Orchard Platform
101 5th Avenue, 4th Floor, New York, NY 10003
http://www.orchardplatform.com

Orchard Blog | Marketplace Lending 
Meetup


NOTICE TO RECIPIENTS: This communication is confidential and intended for the 
use of the addressee only. If you are not an intended recipient of this 
communication, please delete it immediately and notify the sender by return 
email. Unauthorized reading, dissemination, distribution or copying of this 
communication is prohibited. This communication does not constitute an offer to 
sell or a solicitation of an indication of interest to purchase any loan, 
security or any other financial product or instrument, nor is it an offer to 
sell or a solicitation of an indication of interest to purchase any products or 
services to any persons who are prohibited from receiving such information 
under applicable law. The contents of this communication may not be accurate or 
complete and are subject to change without notice. As such, Orchard App, Inc. 
(including its subsidiaries and affiliates, "Orchard") makes no representation 
regarding the accuracy or completeness of the information contained herein. The 
intended recipient is advised to consult its own professional advisors, 
including those specializing in legal, tax and accounting matters. Orchard does 
not provide legal, tax or accounting advice.




Re: Strongly Connected Components

2016-11-10 Thread Felix Cheung
It is possible it is dead. Could you check the Spark UI to see if there is any 
progress?


_
From: Shreya Agarwal >
Sent: Thursday, November 10, 2016 12:45 AM
Subject: RE: Strongly Connected Components
To: >


Bump. Anyone? Its been running for 10 hours now. No results.

From: Shreya Agarwal
Sent: Tuesday, November 8, 2016 9:05 PM
To: user@spark.apache.org
Subject: Strongly Connected Components

Hi,

I am running this on a graph with >5B edges and >3B edges and have 2 questions -


  1.  What is the optimal number of iterations?
  2.  I am running it for 1 iteration right now on a beefy 100 node cluster, 
with 300 executors each having 30GB RAM and 5 cores. I have persisted the graph 
to MEMORY_AND_DISK. And it has been running for 3 hours already. Any ideas on 
how to speed this up?

Regards,
Shreya




Re: Issue Running sparkR on YARN

2016-11-09 Thread Felix Cheung
It maybe the Spark executor is running as a different user and it can't see 
where RScript is?

You might want to try putting Rscript path to PATH.

Also please see this for the config property to set for the R command to use:
https://spark.apache.org/docs/latest/configuration.html#sparkr



_
From: ian.malo...@tdameritrade.com
Sent: Wednesday, November 9, 2016 12:12 PM
Subject: Issue Running sparkR on YARN
To: >


Hi,

I'm trying to run sparkR (1.5.2) on YARN and I get:

java.io.IOException: Cannot run program "Rscript": error=2, No such file or 
directory

This strikes me as odd, because I can go to each node and various users and 
type Rscript and it works. I've done this on each node and spark-env.sh as 
well: export R_HOME=/path/to/R

This is how I'm setting it on the nodes (/etc/profile.d/path_edit.sh):

export R_HOME=/app/hdp_app/anaconda/bin/R
PATH=$PATH:/app/hdp_app/anaconda/bin

Any ideas?

Thanks,

Ian

-
To unsubscribe e-mail: 
user-unsubscr...@spark.apache.org





Re: Substitute Certain Rows a data Frame using SparkR

2016-10-19 Thread Felix Cheung
It's a bit less concise but this works:

> a <- as.DataFrame(cars)
> head(a)
  speed dist
1 4 2
2 4 10
3 7 4
4 7 22
5 8 16
6 9 10

> b <- withColumn(a, "speed", ifelse(a$speed > 15, a$speed, 3))
> head(b)
  speed dist
1 3 2
2 3 10
3 3 4
4 3 22
5 3 16
6 3 10

I think your example could be something we support though. Please feel free to 
open a JIRA for that.
_
From: shilp >
Sent: Monday, October 17, 2016 7:38 AM
Subject: Substitute Certain Rows a data Frame using SparkR
To: >


I have a sparkR Data frame and I want to Replace certain Rows of a Column which 
satisfy certain condition with some value.If it was a simple R data frame then 
I would do something as follows:df$Column1[df$Column1 == "Value"] = "NewValue" 
How would i perform similar operation on a SparkR data frame. ??

View this message in context: Substitute Certain Rows a data Frame using 
SparkR
Sent from the Apache Spark User List mailing list 
archive at 
Nabble.com.




Re: Is Spark 2.0 master node compatible with Spark 1.5 work node?

2016-09-18 Thread Felix Cheung
Well, uber jar works in YARN, but not with standalone ;)





On Sun, Sep 18, 2016 at 12:44 PM -0700, "Chris Fregly" 
<ch...@fregly.com<mailto:ch...@fregly.com>> wrote:

you'll see errors like this...

"java.lang.RuntimeException: java.io.InvalidClassException: 
org.apache.spark.rpc.netty.RequestMessage; local class incompatible: stream 
classdesc serialVersionUID = -2221986757032131007, local class serialVersionUID 
= -5447855329526097695"

...when mixing versions of spark.

i'm actually seeing this right now while testing across Spark 1.6.1 and Spark 
2.0.1 for my all-in-one, hybrid cloud/on-premise Spark + Zeppelin + Kafka + 
Kubernetes + Docker + One-Click Spark ML Model Production Deployments 
initiative documented here:

https://github.com/fluxcapacitor/pipeline/wiki/Kubernetes-Docker-Spark-ML

and check out my upcoming meetup on this effort either in-person or online:

http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/events/233978839/

we're throwing in some GPU/CUDA just to sweeten the offering!  :)

On Sat, Sep 10, 2016 at 2:57 PM, Holden Karau 
<hol...@pigscanfly.ca<mailto:hol...@pigscanfly.ca>> wrote:
I don't think a 2.0 uber jar will play nicely on a 1.5 standalone cluster.


On Saturday, September 10, 2016, Felix Cheung 
<felixcheun...@hotmail.com<mailto:felixcheun...@hotmail.com>> wrote:
You should be able to get it to work with 2.0 as uber jar.

What type cluster you are running on? YARN? And what distribution?





On Sun, Sep 4, 2016 at 8:48 PM -0700, "Holden Karau" <hol...@pigscanfly.ca> 
wrote:

You really shouldn't mix different versions of Spark between the master and 
worker nodes, if your going to upgrade - upgrade all of them. Otherwise you may 
get very confusing failures.

On Monday, September 5, 2016, Rex X <dnsr...@gmail.com> wrote:
Wish to use the Pivot Table feature of data frame which is available since 
Spark 1.6. But the spark of current cluster is version 1.5. Can we install 
Spark 2.0 on the master node to work around this?

Thanks!


--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau



--
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau




--
Chris Fregly
Research Scientist @ PipelineIO<http://pipeline.io>
Advanced Spark and TensorFlow 
Meetup<http://www.meetup.com/Advanced-Spark-and-TensorFlow-Meetup/>
San Francisco | Chicago | Washington DC






Re: SparkR error: reference is ambiguous.

2016-09-10 Thread Felix Cheung
Could you provide more information on how df in your example is created?
Also please include the output from printSchema(df)?

This example works:
> c <- createDataFrame(cars)
> c
SparkDataFrame[speed:double, dist:double]
> c$speed <- c$dist*0
> c
SparkDataFrame[speed:double, dist:double]
> head(c)
  speed dist
1 0 2
2 0 10
3 0 4
4 0 22
5 0 16
6 0 10


_
From: Bedrytski Aliaksandr >
Sent: Friday, September 9, 2016 9:13 PM
Subject: Re: SparkR error: reference is ambiguous.
To: xingye >
Cc: >


Hi,

Can you use full-string queries in SparkR?
Like (in Scala):

df1.registerTempTable("df1")
df2.registerTempTable("df2")
val df3 = sparkContext.sql("SELECT * FROM df1 JOIN df2 ON df1.ra = df2.ra")

explicitly mentioning table names in the query often solves ambiguity problems.

Regards
--
  Bedrytski Aliaksandr
  sp...@bedryt.ski



On Fri, Sep 9, 2016, at 19:33, xingye wrote:

Not sure whether this is the right distribution list that I can ask questions. 
If not, can someone give a distribution list that can find someone to help?


I kept getting error of reference is ambiguous when implementing some sparkR 
code.


1. when i tried to assign values to a column using the existing column:

df$c_mon<- df$ra*0

  1.  16/09/09 15:11:28 ERROR RBackendHandler: col on 3101 failed
  2.  Error in invokeJava(isStatic = FALSE, objId$id, methodName, ...) :
  3.org.apache.spark.sql.AnalysisException: Reference 'ra' is ambiguous, 
could be: ra#8146, ra#13501.;

2. when I joined two spark dataframes using the key:

df3<-join(df1, df2, df1$ra == df2$ra, "left")

  1.  16/09/09 14:48:07 WARN Column: Constructing trivially true equals 
predicate, 'ra#8146 = ra#8146'. Perhaps you need to use aliases.

Actually column "ra" is the column name, I don't know why sparkR keeps having 
errors about ra#8146 or ra#13501..

Can someone help?

Thanks





Re: questions about using dapply

2016-09-10 Thread Felix Cheung
You might need MARGIN capitalized, this example works though:

c <- as.DataFrame(cars)
# rename the columns to c1, c2
c <- selectExpr(c, "speed as c1", "dist as c2")
cols_in <- dapplyCollect(c,
function(x) {apply(x[, paste("c", 1:2, sep = "")], MARGIN=2, FUN = function(y){ 
y %in% c(61, 99)})})
# dapplyCollect does not require the schema parameter


_
From: xingye >
Sent: Friday, September 9, 2016 10:35 AM
Subject: questions about using dapply
To: >



I have a question about using UDF in SparkR. I'm converting some R code into 
SparkR.


* The original R code is :

cols_in <- apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = 
"%in%", c(61, 99))


* If I use dapply and put the original apply function as a function for dapply,

cols_in <-dapply(df,

function(x) {apply(x[, paste("cr_cd", 1:12, sep = "")], Margin=2, function(y){ 
y %in% c(61, 99)})},

schema )

The error shows Error in match.fun(FUN) : argument "FUN" is missing, with no 
default


* If I use spark.lapply, it still shows the error. It seems in spark, the 
column cr_cd1 is ambiguous.

cols_in <-spark.lapply(df[, paste("cr_cd", 1:12, sep = "")], function(x){ x 
%in% c(61, 99)})

 16/09/08 ERROR RBackendHandler: select on 3101 failed Error in 
invokeJava(isStatic = FALSE, objId$id, methodName, ...) : 
org.apache.spark.sql.AnalysisException: Reference 'cr_cd1' is ambiguous, could 
be: cr_cd1#2169L, cr_cd1#17787L.;



  *   If I use dapplycollect, it works but it will lead to memory issue if data 
is big. how can the dapply work in my case?

wrapper = function(df){

out = apply(df[, paste("cr_cd", 1:12, sep = "")], MARGIN = 2, FUN = "%in%", 
c(61, 99))

return(out)

}

cols_in <-dapplyCollect(df,wrapper)




Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Felix Cheung
How are you calling dirs()? What would be x? Is dat a SparkDataFrame?

With SparkR, i in dat[i, 4] should be an logical expression for row, eg. 
df[df$age %in% c(19, 30), 1:2]





On Sat, Sep 10, 2016 at 11:02 AM -0700, "Bene" 
> wrote:

Here are a few code snippets:

The data frame looks like this:

kfzzeit   datum
latitude longitude
1 # 2015-02-09 07:18:33 2015-02-09 52.35234  9.881965
2 # 2015-02-09 07:18:34 2015-02-09 52.35233  9.881970
3 # 2015-02-09 07:18:35 2015-02-09 52.35232  9.881975
4 # 2015-02-09 07:18:36 2015-02-09 52.35232  9.881972
5 # 2015-02-09 07:18:37 2015-02-09 52.35231  9.881973
6 # 2015-02-09 07:18:38 2015-02-09 52.35231  9.881978

I call this function with a number (position in the data frame) and a data
frame:

dirs <- function(x, dat){
  direction(startLat = dat[x,4], endLat = dat[x+1,4], startLon = dat[x,5],
endLon = dat[x+1,5])
}

Here I get the error with the S4 class not subsettable. This function calls
another function which does the actual calculation:

direction <- function(startLat, endLat, startLon, endLon){
  startLat <- degrees.to.radians(startLat);
  startLon <- degrees.to.radians(startLon);
  endLat <- degrees.to.radians(endLat);
  endLon <- degrees.to.radians(endLon);
  dLon <- endLon - startLon;

  dPhi <- log(tan(endLat / 2 + pi / 4) / tan(startLat / 2 + pi / 4));
  if (abs(dLon) > pi) {
if (dLon > 0) {
  dLon <- -(2 * pi - dLon);
} else {
  dLon <- (2 * pi + dLon);
}
  }
  bearing <- radians.to.degrees((atan2(dLon, dPhi) + 360 )) %% 360;
  return (bearing);
}


Anything more you need?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-API-problem-with-subsetting-distributed-data-frame-tp27688p27691.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Assign values to existing column in SparkR

2016-09-10 Thread Felix Cheung
If you are to set a column to 0 (essentially remove and replace the existing 
one) you would need to put a column on the right hand side:


> df <- as.DataFrame(iris)
> head(df)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
> df$Petal_Length <- 0
Error: class(value) == "Column" || is.null(value) is not TRUE
> df$Petal_Length <- lit(0)
> head(df)
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1 5.1 3.5 0 0.2 setosa
2 4.9 3.0 0 0.2 setosa
3 4.7 3.2 0 0.2 setosa
4 4.6 3.1 0 0.2 setosa
5 5.0 3.6 0 0.2 setosa
6 5.4 3.9 0 0.4 setosa

_
From: Deepak Sharma >
Sent: Friday, September 9, 2016 12:29 PM
Subject: Re: Assign values to existing column in SparkR
To: xingye >
Cc: >


Data frames are immutable in nature , so i don't think you can directly assign 
or change values on the column.

Thanks
Deepak

On Fri, Sep 9, 2016 at 10:59 PM, xingye 
> wrote:

I have some questions about assign values to a spark dataframe. I want to 
assign values to an existing column of a spark dataframe but if I assign the 
value directly, I got the following error.

  1.  df$c_mon<-0
  2.  Error: class(value) == "Column" || is.null(value) is not TRUE

Is there a way to solve this?



--
Thanks
Deepak
www.bigdatabig.com
www.keosha.net




Re: SparkR API problem with subsetting distributed data frame

2016-09-10 Thread Felix Cheung
Could you include code snippets you are running?





On Sat, Sep 10, 2016 at 1:44 AM -0700, "Bene" 
> wrote:

Hi,

I am having a problem with the SparkR API. I need to subset a distributed
data so I can extract single values from it on which I can then do
calculations.

Each row of my df has two integer values, I am creating a vector of new
values calculated as a series of sin, cos, tan functions on these two
values. Does anyone have an idea how to do this in SparkR?

So far I tried subsetting with [], [[]], subset(), but mostly I get the
error

object of type 'S4' is not subsettable

Is there any way to do such a thing in SparkR? Any help would be greatly
appreciated! Also let me know if you need more information, code etc.
Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SparkR-API-problem-with-subsetting-distributed-data-frame-tp27688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



  1   2   >