Dependency management using https in spark on kubernetes

2020-05-12 Thread Pradeepta Choudhury
Hey guys , I have an external api from which i can download the main jar from . when i do a spark-submit ...all confs...https:api.call.com/somefile.jar . it gives an error file already exist in the tmp directory and file content doesn't match error . how can i fix this? Do i need to use an

Re: XPATH_INT behavior - XML - Function in Spark

2020-05-12 Thread Jeff Evans
It sounds like you're expecting the XPath expression to evaluate embedded Spark SQL expressions? From the documentation , there appears to be no reason to expect that to work. On Tue, May 12, 2020 at 2:09 PM Chetan Khatri wrote: >

to_avro/from_avro inserts extra values from Kafka

2020-05-12 Thread Alex Nastetsky
Hi all, I create a dataframe, convert it to Avro with to_avro and write it to Kafka. Then I read it back out with from_avro. (Not using Schema Registry.) The problem is that the values skip every other field in the result. I expect: +-++-+---+ |firstName|lastName|color|

RE: [Spark SQL][reopen SPARK-16951]:Alternative implementation of NOT IN to Anti-join

2020-05-12 Thread Shuang, Linna1
Hi Talebzadeh, Thank you reply, the background is to use a common benchmark(here we use TPC-H) to compare different platform’s performance. Our current solution is 1. remove Q16 out of test 2. rewrite Q16 without using “NOT IN” Both solutions are not perfect. For Solution b which is

Re: XPATH_INT behavior - XML - Function in Spark

2020-05-12 Thread Chetan Khatri
Thank you for the clarification. What do you suggest to get this use case achieved. On Tue, May 12, 2020 at 5:35 PM Jeff Evans wrote: > It sounds like you're expecting the XPath expression to evaluate embedded > Spark SQL expressions? From the documentation >

Re: dynamic executor scalling spark on kubernetes client mode

2020-05-12 Thread Steven Stetzler
Oh, thanks for mentioning that, it looks l dynamic allocation on Kubernetes works in client mode in Spark 3.0.0. I just had to set the following configurations: spark.dynamicAllocation.enabled=true spark.dynamicAllocation.shuffleTracking.enabled=true to enable dynamic allocation and disable

Re: dynamic executor scalling spark on kubernetes client mode

2020-05-12 Thread Steven Stetzler
Hi all, I am interested in this as well. My use-case could benefit from dynamic executor scaling but we are restricted to using client mode since we are only using Spark shells. Could anyone help me understand the barriers to getting dynamic executor scaling to work in client mode on Kubernetes?

Re: [PySpark] Tagging descriptions

2020-05-12 Thread Rishi Shah
Thanks ZHANG! Please find details below: # of rows: ~25B, row size would be somewhere around ~3-5MB (it's a parquet formatted data so, need to worry about only the columns to be tagged) avg length of the text to be parsed : ~300 Unfortunately don't have sample data or regex which I can share

Re: dynamic executor scalling spark on kubernetes client mode

2020-05-12 Thread Pradeepta Choudhury
Hey guys i was able to run dynamic scaling in both cluster and client mode . would document and send it over this weekend On Tue 12 May, 2020, 1:26 PM Roland Johann, wrote: > Hi all, > > don’t want to interrupt the conversation but are keen where I can find > information regarding dynamic

unsubscribe

2020-05-12 Thread Kiran B
Thank you, Kiran,

Re: [PySpark] Tagging descriptions

2020-05-12 Thread ZHANG Wei
May I get some requirement details? Such as: 1. The row count and one row data size 2. The avg length of text to be parsed by RegEx 3. The sample format of text to be parsed 4. The sample of current RegEx -- Cheers, -z On Mon, 11 May 2020 18:40:49 -0400 Rishi Shah wrote: > Hi All, > > I

Re: java.lang.OutOfMemoryError Spark Worker

2020-05-12 Thread Hrishikesh Mishra
Configuration: Driver memory we tried: 2GB / 4GB / 5GB Executor memory we tried: 4G / 5GB Even reduced: *spark.memory.fraction *to 0.2 (we are not using cache) VM Memory: 32 GB and 8 core We tried for SPARK_WORKER_MEMORY: 30GB / 24GB SPARK_WORKER_CORES: 32 (because jobs are not CPU bound )

Re: [Spark SQL][reopen SPARK-16951]:Alternative implementation of NOT IN to Anti-join

2020-05-12 Thread Mich Talebzadeh
Hi Linna, Please provide a background to it and your solution. The assumption is that there is a solution. as suggested. Thanks, Dr Mich Talebzadeh LinkedIn * https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw

Re: GrupState limits

2020-05-12 Thread Srinivas V
If you are talking about total number of objects the state can hold, that depends on the executor memory you have on your cluster apart from rest of the memory required for processing. The state is stored in hdfs and retrieved while processing the next events. If you maintain million objects with

Re: XPATH_INT behavior - XML - Function in Spark

2020-05-12 Thread Chetan Khatri
Can someone please help.. Thanks in advance. On Mon, May 11, 2020 at 5:29 PM Chetan Khatri wrote: > Hi Spark Users, > > I want to parse xml coming in the query columns and get the value, I am > using *xpath_int* which works as per my requirement but When I am > embedding in the Spark SQL query

Re: dynamic executor scalling spark on kubernetes client mode

2020-05-12 Thread Roland Johann
Hi all, don’t want to interrupt the conversation but are keen where I can find information regarding dynamic allocation on kubernetes. As far as I know the docs just point to future work. Thanks a lot, Roland > Am 12.05.2020 um 09:25 schrieb Steven Stetzler : > > Hi all, > > I am