Hello Everyone,
I am trying to set up a yarn cluster with three nodes (one master and two
workers).
I followed this tutorial :
https://linode.com/docs/databases/hadoop/how-to-install-and-set-up-hadoop-cluster/
I also try to execute the yarn exmaple at the end of this tutorial with the
wordcount.
Hi there,
I'm trying to run Spark on EKS. Created an EKS cluster, added nodes and
then trying to submit a Spark job from an EC2 instance.
Ran following commands for access. kubectl create serviceaccount spark
kubectl create clusterrolebinding spark-role --clusterrole=admin
--serviceaccount=defaul
Hi Marcelo,
Maybe the spark.sql.functions.explode give what you need?
// Bruno
> Le 6 juin 2019 à 16:02, Marcelo Valle a écrit :
>
> Generating the city id (child) is easy, monotonically increasing id worked
> for me.
>
> The problem is the country (parent) which has to be in both countrie
Hello,
I'm running Thrift server with PostgresSQL persistence for hive metastore.
I'm using Postgres 9.6 and spark 2.4.3 in this environment.
When I start Thrift server I get lots of errors while creating the schema
and it happen everytime I reach postgres, like:
19/06/06 15:51:59 WARN Datastore
Hi Magnus, Thanks for replying.
I didn't get the partition solution, tbh, but indeed, I was trying to
figure a way of solving only with data frames without rejoining.
I can't have a global list of countries in my real scenario, as the real
scenario is not reference data, countries was just an exa
Hi all,
We are facing a challenge where a simple use case seems not trivial to
implement in structured streaming: an aggregation should be calculated
and then some other aggregations should further aggregate on the first
aggregation. Something like:
1st aggregation: val df = dfIn.groupBy(a,b,c,d).
Generating the city id (child) is easy, monotonically increasing id worked
for me.
The problem is the country (parent) which has to be in both countries and
cities data frames.
On Thu, 6 Jun 2019 at 14:57, Magnus Nilsson wrote:
> Well, you could do a repartition on cityname/nrOfCities and use
Well, you could do a repartition on cityname/nrOfCities and use the
spark_partition_id function or the mappartitionswithindex dataframe method
to add a city Id column. Then just split the dataframe into two subsets. Be
careful of hashcollisions on the reparition Key though, or more than one
city mi
Great!
Thanks a lot.
Best,
Pablo.
--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/
-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org
Akshay,
First of all, thanks for the answer. I *am* using monotonically increasing
id, but that's not my problem.
My problem is I want to output 2 tables from 1 data frame, 1 parent table
with ID for the group by and 1 child table with the parent id without the
group by.
I was able to solve this
Hi,
This has been fixed here: https://github.com/apache/spark/pull/23546. Will
be available with Spark 3.0.0
Best,
Stavros
On Wed, Jun 5, 2019 at 11:18 PM pacuna wrote:
> I'm trying to run a sample code that reads a file from s3 so I need the aws
> sdk and aws hadoop dependencies.
> If I assem
Additionally there is "uuid" function available as well if that helps your
use case.
Akshay Bhardwaj
+91-97111-33849
On Thu, Jun 6, 2019 at 3:18 PM Akshay Bhardwaj <
akshay.bhardwaj1...@gmail.com> wrote:
> Hi Marcelo,
>
> If you are using spark 2.3+ and dataset API/SparkSQL,you can use this
>
Hi Marcelo,
If you are using spark 2.3+ and dataset API/SparkSQL,you can use this
inbuilt function "monotonically_increasing_id" in Spark.
A little tweaking using Spark sql inbuilt functions can enable you to
achieve this without having to write code or define RDDs with map/reduce
functions.
Aksh
14 matches
Mail list logo