Maybe you could try “--conf spark.sql.statistics.fallBackToHdfs=true"
On 2019/05/11 01:54:27, V0lleyBallJunki3 wrote:
> Hello,>
>I have set spark.sql.autoBroadcastJoinThreshold=1GB and I am running the>
> spark job. However, my application is failing with:>
>
> at
Answering my own question. Looks like this can be done by implementing
SparkListener with method
def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit
as the SparkListenerExecutorAdded object has the info.
--
Nick
Am using
I threw together a quick example that replicates what you see, then looked
at the physical plan:
from pyspark.sql.functions import *
from pyspark.sql.types import *
from pyspark.sql import Row
df = spark.createDataFrame([Row(list_names=['a', 'b', 'c', 'd'],
name=None), Row(list_names=['a', 'b',
Kumar sp, collect() brings in all the data represented by the rdd/dataframe
into the memory of the single machine which is acting like driver. You will
run out of memory if the underlying rdd/dataframe represents large volume
of data distributed on several machines.
If your data is huge even
I have a use case where i am using collect().toMap (Group by certain column
and finding count ,creating map with a key) and use that map to enable some
further calculations.
I am getting Out of memory errors and is there any alternative than
.collect() to create a structure like Map or some
Hi,
Am using Spark 2.3 and looking for an API in Java to fetch the list of
executors. Need host and Id info for the executors.
Thanks for any pointers,
--
Nick
-
To unsubscribe e-mail:
Hi All,
Is there a better way to drop duplicates, and keep first record based on
sorted column?
simple sorting on dataframe and dropping duplicates is quite slow!
--
Regards,
Rishi Shah
Hi All,
I am using or operator "|" in withColumn clause on a DataFrame in pyspark.
However it looks like it always evaluates all the conditions regardless of
first condition being true. Please find a sample below:
contains = udf(lambda s, arr : s in arr, BooleanType())
> Where exactly would I see the start/end offset values per batch, is that
in the spark logs?
Yes, it's in the Spark logs. Please see this:
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reading-metrics-interactively
On Mon, May 13, 2019 at 10:53 AM Austin
Hi Akshay,
Thanks very much for the reply!
1) The topics have 12 partitions (both input and output)
2-3) I read that "trigger" is used for microbatching, but it you would like
the stream to truly process as a "stream" as quickly as possible, then to
leave this opted out? In any case, I am using
Hello,
I tested Spark Fair Scheduler and found that the scheduler did not work
well.
According to Spark doc, Fair Scheduler assigns tasks in a round-robin
fashion.
https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application
In my understanding, if there are 2
we use Spark version 2.3.1,this version use the hive-jdbc-1.2.1.spark2.jar to
connect Spark Thriftserver server.
our hive version is 2.3.3, and we use hive-jdbc-2.3.3.jar to connect hive server
we want to reduce client jar files, so we use hive-jdbc-2.3.3.jar as the
clinet jar, for hive
12 matches
Mail list logo