date:20190608

Re: Spark SQL in R?

2019-06-08 Thread Felix Cheung

I don’t think you should get a hive-xml from the internet.

It should have connection information about a running hive metastore - if you 
don’t have a hive metastore service as you are running locally (from a laptop?) 
then you don’t really need it. You can get spark to work with it’s own.




From: ya 
Sent: Friday, June 7, 2019 8:26:27 PM
To: Rishikesh Gawade; felixcheun...@hotmail.com; user@spark.apache.org
Subject: Spark SQL in R?

Dear Felix and Richikesh and list,

Thank you very much for your previous help. So far I have tried two ways to 
trigger Spark SQL: one is to use R with sparklyr library and SparkR library; 
the other way is to use SparkR shell from Spark. I am not connecting a remote 
spark cluster, but a local one. Both failed with or without hive-site.xml. I 
suspect the content of hive-site.xml I found online was not appropriate for 
this case, as the spark session can not be initialized after adding this 
hive-site.xml. My questions are:

1. Is there any example for the content of hive-site.xml for this case?

2. I used sql() function to call the Spark SQL, is this the right way to do it?

###
##Here is the content in the hive-site.xml:##
###



javax.jdo.option.ConnectionURL
jdbc:mysql://192.168.76.100:3306/hive?createDatabaseIfNotExist=true
JDBC connect string for a JDBC metastore



javax.jdo.option.ConnectionDriverName
com.mysql.jdbc.Driver
Driver class name for a JDBC metastore



javax.jdo.option.ConnectionUserName
root
username to use against metastore database



javax.jdo.option.ConnectionPassword
123
password to use against metastore database






##Here is the situation happened in R:##


> library(sparklyr) # load sparklyr package
> sc=spark_connect(master="local",spark_home="/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7")
>  # connect sparklyr with spark
> sql('create database learnsql')
Error in sql("create database learnsql") : could not find function "sql"
> library(SparkR)

Attaching package: ‘SparkR’

The following object is masked from ‘package:sparklyr’:

collect

The following objects are masked from ‘package:stats’:

cov, filter, lag, na.omit, predict, sd, var, window

The following objects are masked from ‘package:base’:

as.data.frame, colnames, colnames<-, drop, endsWith, intersect, rank, rbind,
sample, startsWith, subset, summary, transform, union

> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized
> Sys.setenv(SPARK_HOME='/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7')
> sparkR.session(sparkHome=Sys.getenv('/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7'))
Spark not found in SPARK_HOME:
Spark package found in SPARK_HOME: 
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7
Launching java with spark-submit command 
/Users/ya/Downloads/soft/spark-2.4.3-bin-hadoop2.7/bin/spark-submit   
sparkr-shell 
/var/folders/d8/7j6xswf92c3gmhwy_lrk63pmgn/T//Rtmpz22kK9/backend_port103d4cfcfd2c
19/06/08 11:14:57 WARN NativeCodeLoader: Unable to load native-hadoop library 
for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use 
setLogLevel(newLevel).
Error in handleErrors(returnStatus, conn) :

…... hundreds of lines of information and mistakes here ……

> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized



###
##Here is what happened in SparkR shell:##


Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Error while instantiating 
'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at 
org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1107)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:145)
at 
org.apache.spark.sql.SparkSession$$anonfun$sessionState$2.apply(SparkSession.scala:144)
at scala.Option.getOrElse(Option.scala:121)
at 
org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:144)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:141)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:80)
at 
org.apache.spark.sql.api.r.SQLUtils$$anonfun$setSparkContextSessionConf$2.apply(SQLUtils.scala:79)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.Iterator$class.foreach(Iterator.sca
> sql('create database learnsql')
Error in getSparkSession() : SparkSession not initialized



Thank you very much.

YA







在 2019年6月8日，上午1:44，Rishikesh Gawade

[ANNOUNCE] Apache Bahir 2.3.3 Released

2019-06-08 Thread Luciano Resende

Apache Bahir provides extensions to multiple distributed analytic
platforms, extending their reach with a diversity of streaming
connectors and SQL data sources.
The Apache Bahir community is pleased to announce the release of
Apache Bahir 2.3.3 which provides the following extensions for Apache
Spark 2.3.3:

   - Apache CouchDB/Cloudant SQL data source
   - Apache CouchDB/Cloudant Streaming connector
   - Akka Streaming connector
   - Akka Structured Streaming data source
   - Google Cloud Pub/Sub Streaming connector
   - Cloud PubNub Streaming connector (new)
   - MQTT Streaming connector
   - MQTT Structured Streaming data source (new sink)
   - Twitter Streaming connector
   - ZeroMQ Streaming connector (new enhanced implementation)

For more information about Apache Bahir and to download the
latest release go to:

https://bahir.apache.org

For more details on how to use Apache Bahir extensions in your
application please visit our documentation page

   https://bahir.apache.org/docs/spark/overview/

The Apache Bahir PMC

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

[ANNOUNCE] Apache Bahir 2.2.3 Released

2019-06-08 Thread Luciano Resende

Apache Bahir provides extensions to multiple distributed analytic
platforms, extending their reach with a diversity of streaming
connectors and SQL data sources.
The Apache Bahir community is pleased to announce the release of
Apache Bahir 2.2.3 which provides the following extensions for Apache
Spark 2.2.3:

   - Apache CouchDB/Cloudant SQL data source
   - Apache CouchDB/Cloudant Streaming connector
   - Akka Streaming connector
   - Akka Structured Streaming data source
   - Google Cloud Pub/Sub Streaming connector
   - Cloud PubNub Streaming connector (new)
   - MQTT Streaming connector
   - MQTT Structured Streaming data source (new sink)
   - Twitter Streaming connector
   - ZeroMQ Streaming connector (new enhanced implementation)

For more information about Apache Bahir and to download the
latest release go to:

https://bahir.apache.org

For more details on how to use Apache Bahir extensions in your
application please visit our documentation page

   https://bahir.apache.org/docs/spark/overview/

The Apache Bahir PMC

-- 
Luciano Resende
http://people.apache.org/~lresende
http://twitter.com/lresende1975
http://lresende.blogspot.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 2.2 With Column usage

2019-06-08 Thread anbutech

Thanks Jacek Laskowski Sir.but i didn't get the point here

please advise the below one are you expecting:

dataset1.as("t1) 

join(dataset3.as("t2"), 

col(t1.col1) === col(t2.col1), JOINTYPE.Inner ) 

.join(dataset4.as("t3"), col(t3.col1) === col(t1.col1), 

JOINTYPE.Inner) 
.select("id",lit(referenceFiltered)) 
.selectexpr( 
"id"
)



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark logging questions

2019-06-08 Thread Jacek Laskowski

Hi,

What are "the spark driver and executor threads information" and "spark
application logging"?

Spark uses log4j so set up logging levels appropriately and you should be
done.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Fri, Jun 7, 2019 at 1:13 PM test test  wrote:

> Hello,
>
> How can we dump the spark driver and executor threads information in spark
> application logging.?
>
>
> PS: submitting spark job using spark submit
>
> Regards
> Rohit
>

Re: Spark 2.2 With Column usage

2019-06-08 Thread Jacek Laskowski

Hi,

> val referenceFiltered = dataset2.filter(.dataDate ==
date).filter.someColumn).select("id").toString
> .withColumn("new_column",lit(referenceFiltered))

That won't work since lit is a function (adapter) to convert Scala values
to Catalyst expressions.

Unless I'm mistaken, in your case, what you really need is to replace
`withColumn` with `select("id")` itself and you're done.

When I'm writing this (I'm saying exactly what you actually have already)
and I'm feeling confused.

Pozdrawiam,
Jacek Laskowski

https://about.me/JacekLaskowski
The Internals of Spark SQL https://bit.ly/spark-sql-internals
The Internals of Spark Structured Streaming
https://bit.ly/spark-structured-streaming
The Internals of Apache Kafka https://bit.ly/apache-kafka-internals
Follow me at https://twitter.com/jaceklaskowski



On Sat, Jun 8, 2019 at 6:05 AM anbutech  wrote:

> Hi Sir,
>
> Could you please advise to fix the below issue in the withColumn in the
> spark 2.2 scala 2.11 joins
>
> def processing(spark:SparkSession,
>
> dataset1:Dataset[Reference],
>
> dataset2:Dataset[DataCore],
>
> dataset3:Dataset[ThirdPartyData] ,
>
> dataset4:Dataset[OtherData]
>
> date:String):Dataset[DataMerge] {
>
> val referenceFiltered = dataset2.filter(.dataDate ==
> date).filter.someColumn).select("id").toString
>
> dataset1.as("t1)
>
> join(dataset3.as("t2"),
>
> col(t1.col1) === col(t2.col1), JOINTYPE.Inner )
>
> .join(dataset4.as("t3"), col(t3.col1) === col(t1.col1),
>
> JOINTYPE.Inner)
>
> .withColumn("new_column",lit(referenceFiltered))
>
> .selectexpr(
>
> "id", ---> want to get this value
>
> "column1,
>
> "column2,
>
> "column3",
>
> "column4" )
>
> }
>
> how do i get the String value ,let say the value"124567"
> ("referenceFiltered") inside the withColumn?
>
> im getting the withColumn output as "id:BigInt" . I want to get the same
> value for all the records.
>
> Note:
>
> I have asked not use cross join in the code. Any other way to fix this
> issue.
>
>
>
> --
> Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
>
> -
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org
>
>

Re: Spark SQL in R?

[ANNOUNCE] Apache Bahir 2.3.3 Released

[ANNOUNCE] Apache Bahir 2.2.3 Released

Re: Spark 2.2 With Column usage

Re: Spark logging questions

Re: Spark 2.2 With Column usage

6 matches

Site Navigation

Mail list logo

Footer information