Re: spark.sql.autoBroadcastJoinThreshold not taking effect

2019-05-13 Thread Lantao Jin
Maybe you could try “--conf spark.sql.statistics.fallBackToHdfs=true" On 2019/05/11 01:54:27, V0lleyBallJunki3 wrote: > Hello,> >I have set spark.sql.autoBroadcastJoinThreshold=1GB and I am running the> > spark job. However, my application is failing with:> > > at

Re: Getting List of Executor Id's

2019-05-13 Thread Afshartous, Nick
Answering my own question. Looks like this can be done by implementing SparkListener with method def onExecutorAdded(executorAdded: SparkListenerExecutorAdded): Unit as the SparkListenerExecutorAdded object has the info. -- Nick Am using

Re: [Pyspark 2.3] Logical operators (and/or) in pyspark

2019-05-13 Thread Nicholas Hakobian
I threw together a quick example that replicates what you see, then looked at the physical plan: from pyspark.sql.functions import * from pyspark.sql.types import * from pyspark.sql import Row df = spark.createDataFrame([Row(list_names=['a', 'b', 'c', 'd'], name=None), Row(list_names=['a', 'b',

Re: Error using .collect()

2019-05-13 Thread Shahab Yunus
Kumar sp, collect() brings in all the data represented by the rdd/dataframe into the memory of the single machine which is acting like driver. You will run out of memory if the underlying rdd/dataframe represents large volume of data distributed on several machines. If your data is huge even

Error using .collect()

2019-05-13 Thread Kumar sp
I have a use case where i am using collect().toMap (Group by certain column and finding count ,creating map with a key) and use that map to enable some further calculations. I am getting Out of memory errors and is there any alternative than .collect() to create a structure like Map or some

Getting List of Executor Id's

2019-05-13 Thread Afshartous, Nick
Hi, Am using Spark 2.3 and looking for an API in Java to fetch the list of executors. Need host and Id info for the executors. Thanks for any pointers, -- Nick - To unsubscribe e-mail:

[pyspark 2.3] drop_duplicates, keep first record based on sorted records

2019-05-13 Thread Rishi Shah
Hi All, Is there a better way to drop duplicates, and keep first record based on sorted column? simple sorting on dataframe and dropping duplicates is quite slow! -- Regards, Rishi Shah

[Pyspark 2.3] Logical operators (and/or) in pyspark

2019-05-13 Thread Rishi Shah
Hi All, I am using or operator "|" in withColumn clause on a DataFrame in pyspark. However it looks like it always evaluates all the conditions regardless of first condition being true. Please find a sample below: contains = udf(lambda s, arr : s in arr, BooleanType())

Re: Structured Streaming Kafka - Weird behavior with performance and logs

2019-05-13 Thread Gabor Somogyi
> Where exactly would I see the start/end offset values per batch, is that in the spark logs? Yes, it's in the Spark logs. Please see this: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reading-metrics-interactively On Mon, May 13, 2019 at 10:53 AM Austin

Re: Structured Streaming Kafka - Weird behavior with performance and logs

2019-05-13 Thread Austin Weaver
Hi Akshay, Thanks very much for the reply! 1) The topics have 12 partitions (both input and output) 2-3) I read that "trigger" is used for microbatching, but it you would like the stream to truly process as a "stream" as quickly as possible, then to leave this opted out? In any case, I am using

Spark Fair Scheduler does not work correctly

2019-05-13 Thread Yuta Morisawa
Hello, I tested Spark Fair Scheduler and found that the scheduler did not work well. According to Spark doc, Fair Scheduler assigns tasks in a round-robin fashion. https://spark.apache.org/docs/latest/job-scheduling.html#scheduling-within-an-application In my understanding, if there are 2

spark jar version problem

2019-05-13 Thread wangyongqiang0...@163.com
we use Spark version 2.3.1,this version use the hive-jdbc-1.2.1.spark2.jar to connect Spark Thriftserver server. our hive version is 2.3.3, and we use hive-jdbc-2.3.3.jar to connect hive server we want to reduce client jar files, so we use hive-jdbc-2.3.3.jar as the clinet jar, for hive