Re: Incremental Value dependents on another column of Data frame Spark

2023-05-24 Thread Enrico Minack
Hi, given your dataset: val df=Seq( (1, 20230523, "M01"), (2, 20230523, "M01"), (3, 20230523, "M01"), (4, 20230523, "M02"), (5, 20230523, "M02"), (6, 20230523, "M02"), (7, 20230523, "M01"), (8, 20230523, "M01"), (9, 20230523, "M02"), (10, 20230523, "M02"), (11, 20230523, "M02"), (12, 20230523

Re: Incremental Value dependents on another column of Data frame Spark

2023-05-23 Thread Raghavendra Ganesh
Given, you are already stating the above can be imagined as a partition, I can think of mapPartitions iterator. val inputSchema = inputDf.schema val outputRdd = inputDf.rdd.mapPartitions(rows => new SomeClass(rows)) val outputDf = sparkSession.createDataFrame(outputRdd, inputSchema.add("coun

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread ashok34...@yahoo.com.INVALID
Thanks great Rauf. Regards On Tuesday, 23 May 2023 at 13:18:55 BST, Rauf Khan wrote: Hi , PartitionBy() is analogous to group by, all rows  that will have the same value in the specified column will form one window.The data will be shuffled to form group. RegardsRaouf On Fri, May 12,

Re: Shuffle with Window().partitionBy()

2023-05-23 Thread Rauf Khan
Hi , PartitionBy() is analogous to group by, all rows that will have the same value in the specified column will form one window. The data will be shuffled to form group. Regards Raouf On Fri, May 12, 2023, 18:48 ashok34...@yahoo.com.INVALID wrote: > Hello, > > In Spark windowing does call wi

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
he basics here first > My thoughtsSpark replicates the partitions among multiple nodes. If one > executor fails, it moves the processing over to the other executor. > However, if the data is lost, it re-executes the processing that generated > the data, > and might have to go back to th

Re: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Mich Talebzadeh
Hi Maksym. Let us understand the basics here first My thoughtsSpark replicates the partitions among multiple nodes. If one executor fails, it moves the processing over to the other executor. However, if the data is lost, it re-executes the processing that generated the data, and might have to go

RE: Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-22 Thread Maksym M
Hey vaquar, The link does't explain the crucial detail we're interested in - does executor re-use the data that exists on a node from previous executor and if not, how can we configure it to do so? We are not running on kubernetes, so EKS/Kubernetes-specific advice isn't very re

Re: Spark shuffle and inevitability of writing to Disk

2023-05-17 Thread Mich Talebzadeh
Ok, I did a bit of a test that shows that the shuffle does spill to memory then to disk if my assertion is valid. The sample code I wrote is as follows: import sys from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql import func

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-17 Thread vaquar khan
host is >> about to get evicted, a new host is created and the EBS volume is attached >> to it >> >> When Spark assigns a new executor to the newly created instance, it >> basically can recover all the shuffle files that are already persisted in >> the migrat

RE: Understanding Spark S3 Read Performance

2023-05-16 Thread info
Hi,For clarification, are those 12 / 14 minutes cumulative cpu time or wall clock time? How many executors executed those 1 / 375 tasks?Cheers,Enrico Ursprüngliche Nachricht Von: Shashank Rao Datum: 16.05.23 19:48 (GMT+01:00) An: user@spark.apache.org Betreff: Understandi

Re: [spark-core] Can executors recover/reuse shuffle files upon failure?

2023-05-15 Thread Mich Talebzadeh
use is EBS migration which basically means if a host is > about to get evicted, a new host is created and the EBS volume is attached > to it > > When Spark assigns a new executor to the newly created instance, it > basically can recover all the shuffle files that are already persisted

Re: Error while merge in delta table

2023-05-12 Thread Farhan Misarwala
Hi Karthick, If you have confirmed that the incompatibility between Delta and spark versions is not the case, then I would say the same what Jacek said earlier, there’s not enough “data” here. To further comment on it, we would need to know more on how you are structuring your multi threaded PySp

Re: Error while merge in delta table

2023-05-12 Thread Karthick Nk
Hi Farhan, Thank you for your response, I am using databricks with 11.3x-scala2.12. Here I am overwriting all the tables in the same database in concurrent thread, But when I do in the iterative manner it is working fine, For Example, i am having 200 tables in same database, i am overwriting the

Re: Error while merge in delta table

2023-05-11 Thread Farhan Misarwala
Hi Karthick, I think I have seen this before and this probably could be because of an incompatibility between your spark and delta versions. Or an incompatibility between the delta version you are using now vs the one you used earlier on the existing table you are merging with. Let me know if th

Re: Error while merge in delta table

2023-05-11 Thread Jacek Laskowski
Hi Karthick, Sorry to say it but there's not enough "data" to help you. There should be something more above or below this exception snippet you posted that could pinpoint the root cause. Pozdrawiam, Jacek Laskowski "The Internals Of" Online Books Follow me on http

RE: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-10 Thread Vijay B
Please see if this works -- aggregate array into map of element of count SELECT aggregate(array(1,2,3,4,5), map('cnt',0), (acc,x) -> map('cnt', acc.cnt+1)) as array_count thanks Vijay On 2023/05/05 19:32:04 Yong Zhang wrote: > Hi, This is on Spark 3.1 environment. > > For some reason, I can

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
what I am trying ot understand. The >>>>> golden rule "The DAG overlaps wont run several times for one action" seems >>>>> not to be apocryphal. If you can shed some light on this matter I would >>>>> appreciate it >>>>

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
ne for the value; otherwise, start the element with the count of 1. Of course, the above code won't work in Spark SQL. * As I said, I am NOT running in either Scale or PySpark session, but in a pure Spark SQL. * Is it possible to do the above logic in Spark SQL, without using "explo

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Nitin Siwach
do not happen again. >>>>>> >>>>>> However, In my case here I am calling just one action. Within the >>>>>> purview of one action Spark should not rerun the overlapping parts of the >>>>>> DAG. I do not understand why the fi

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-09 Thread Yong Zhang
alue; otherwise, start the element with count of 0. Of course, the above code wont' work in Spark SQL. * As I said, I am NOT running in either Scale or PySpark session, but in a pure Spark SQL. * Is it possible to do the above logic in Spark SQL, without using "exploding"?

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-09 Thread Mich Talebzadeh
laimed. The author will in no case be liable for any monetary damages >>>>>> arising from such loss, damage or destruction. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Sun, 7 May 2023 at 14:13, Nitin

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
reciate you guys helping me out with this :) >>>> >>>> On Sun, May 7, 2023 at 12:23 PM Winston Lai >>>> wrote: >>>> >>>>> When your memory is not sufficient to keep the cached data for your >>>>> jobs in two different sta

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
a spill >>>> may triggered when Spark write your data from memory to disk. >>>> >>>> One way to to check is to read Spark UI. When Spark cache the data, you >>>> will see a little green dot connected to the blue rectangle in the Spark >>>> UI. If you see this

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
, likely Spark spill >>> the data after your first job and read it again in the second run. You can >>> also confirm it in other metrics from Spark UI. >>> >>> That is my personal understanding based on what I have read and seen on >>> my job runs. If there is any mistake, be free to correct me.

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Mich Talebzadeh
and read it again in the second run. You can >> also confirm it in other metrics from Spark UI. >> >> That is my personal understanding based on what I have read and seen on >> my job runs. If there is any mistake, be free to correct me. >> >> Thank You &

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-07 Thread Nitin Siwach
> *From:* Nitin Siwach > *Sent:* Sunday, May 7, 2023 12:22:32 PM > *To:* Vikas Kumar > *Cc:* User > *Subject:* Re: Does spark read the same file twice, if two stages are > using the same DataFrame? > > Thank you tons, Vikas :). That makes so much sens

Re: Does spark read the same file twice, if two stages are using the same DataFrame?

2023-05-06 Thread Winston Lai
M To: Vikas Kumar Cc: User Subject: Re: Does spark read the same file twice, if two stages are using the same DataFrame? Thank you tons, Vikas :). That makes so much sense now I'm in learning phase and was just browsing through various concepts of spark with self made small examples. It di

Re: Can Spark SQL (not DataFrame or Dataset) aggregate array into map of element of count?

2023-05-06 Thread Mich Talebzadeh
you can create DF from your SQL RS and work with that in Python the way you want ## you don't need all these import findspark findspark.init() from pyspark.sql import SparkSession from pyspark import SparkContext from pyspark.sql import SQLContext from pyspark.sql.functions import udf, col, curren

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-06 Thread Mich Talebzadeh
So what are you intending to do with the resultset produced? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-05 Thread Marco Costantini
Hi Mich, Thank you. Ah, I want to avoid bringing all data to the driver node. That is my understanding of what will happen in that case. Perhaps, I'll trigger a Lambda to rename/combine the files after PySpark writes them. Cheers, Marco. On Thu, May 4, 2023 at 5:25 PM Mich Talebzadeh wrote: >

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
you can try df2.coalesce(1).write.mode("overwrite").json("/tmp/pairs.json") hdfs dfs -ls /tmp/pairs.json Found 2 items -rw-r--r-- 3 hduser supergroup 0 2023-05-04 22:21 /tmp/pairs.json/_SUCCESS -rw-r--r-- 3 hduser supergroup 96 2023-05-04 22:21 /tmp/pairs.json/part-0-21f1

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Marco Costantini
Hi Mich, Thank you. Are you saying this satisfies my requirement? On the other hand, I am smelling something going on. Perhaps the Spark 'part' files should not be thought of as files, but rather pieces of a conceptual file. If that is true, then your approach (of which I'm well aware) makes sense

Re: Write DataFrame with Partition and choose Filename in PySpark

2023-05-04 Thread Mich Talebzadeh
AWS S3, or Google gs are hadoop compatible file systems (HCFS) , so they do sharding to improve read performance when writing to HCFS file systems. Let us take your code for a drive import findspark findspark.init() from pyspark.sql import SparkSession from pyspark.sql.functions import struct fro

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Marco Costantini
Hi Enrico, What a great answer. Thank you. Seems like I need to get comfortable with the 'struct' and then I will be golden. Thank you again, friend. Marco. On Thu, May 4, 2023 at 3:00 AM Enrico Minack wrote: > Hi, > > You could rearrange the DataFrame so that writing the DataFrame as-is > prod

Re: Write custom JSON from DataFrame in PySpark

2023-05-04 Thread Enrico Minack
Hi, You could rearrange the DataFrame so that writing the DataFrame as-is produces your structure: df = spark.createDataFrame([(1, "a1"), (2, "a2"), (3, "a3")], "id int, datA string") +---++ | id|datA| +---++ |  1|  a1| |  2|  a2| |  3|  a3| +---++ df2 = df.select(df.id, struct(df.d

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-05-02 Thread Trường Trần Phan An
Hi all, I have written a program and overridden two events onStageCompleted and onTaskEnd. However, these two events do not provide information on when a Task/Stage is completed. What I want to know is which Task corresponds to which stage of a DAG (the Spark history server only tells me how many

Re: Change column values using several when conditions

2023-05-01 Thread Bjørn Jørgensen
you can check if the value exists by using distinct before you loop over the dataset. man. 1. mai 2023 kl. 10:38 skrev marc nicole : > Hello > > I want to change values of a column in a dataset according to a mapping > list that maps original values of that column to other new values. Each > elem

Re: Tensorflow on Spark CPU

2023-04-30 Thread Sean Owen
_co...@yahoo.com> wrote: > I re-test with cifar10 example and below is the result . can advice why > lesser num_slot is faster compared with more slots? > > num_slots=20 > > 231 seconds > > > num_slots=5 > > 52 seconds > > > num_slot=1 > > 34 seco

Re: Tensorflow on Spark CPU

2023-04-29 Thread second_co...@yahoo.com.INVALID
I re-test with cifar10 example and below is the result .  can advice why lesser num_slot is faster compared with more slots? num_slots=20 231 seconds num_slots=5 52 seconds num_slot=134 seconds the code is at below https://gist.github.com/cometta/240bbc549155e22f80f6ba670c9a2e32 Do you

Re: Tensorflow on Spark CPU

2023-04-29 Thread Sean Owen
You don't want to use CPUs with Tensorflow. If it's not scaling, you may have a problem that is far too small to distribute. On Sat, Apr 29, 2023 at 7:30 AM second_co...@yahoo.com.INVALID wrote: > Anyone successfully run native tensorflow on Spark ? i tested example at > https://github.com/tenso

Re: ***pyspark.sql.functions.monotonically_increasing_id()***

2023-04-28 Thread Winston Lai
Hi Karthick, A few points that may help you: As stated in the URL you posted, "The function is non-deterministic because its result depends on partition IDs." Hence, the generated ID is dependent on partition IDs. Based on the code snippet you provided, I didn't see the partion columns you sel

Re: config: minOffsetsPerTrigger not working

2023-04-27 Thread Abhishek Singla
Thanks, Mich for acknowledging. Yes, I am providing the checkpoint path. I omitted it here in the code snippet. I believe this is due to spark version 3.1.x, this config is there only in versions greater than 3.2.x On Thu, Apr 27, 2023 at 9:26 PM Mich Talebzadeh wrote: > Is this all of your wr

Re: config: minOffsetsPerTrigger not working

2023-04-27 Thread Mich Talebzadeh
Is this all of your writeStream? df.writeStream() .foreachBatch(new KafkaS3PipelineImplementation(applicationId, appConfig)) .start() .awaitTermination(); What happened to the checkpoint location? option('checkpointLocation', checkpoint_path). example checkpoint_path = "file:///ss

Re: What is the best way to organize a join within a foreach?

2023-04-27 Thread Amit Joshi
Hi Marco, I am not sure if you will get access to data frame inside the for each, as spark context used to be non serialized, if I remember correctly. One thing you can do. Use cogroup operation on both the dataset. This will help you have (Key- iter(v1),itr(V2). And then use for each partition f

RE: Spark Kubernetes Operator

2023-04-26 Thread Aldo Culquicondor
We are welcoming contributors, as announced in the Kubernetes WG Batch https://docs.google.com/document/d/1XOeUN-K0aKmJJNq7H07r74n-mGgSFyiEDQ3ecwsGhec/edit#bookmark=id.gfgjt0nmbgjl If you are interested, you can find us in slack.k8s.io #wg-batch or ping @mwielgus on github/slack. Thanks On 2023/

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Again one try is worth many opinions. Try it and gather matrix from spark UI and see how it performs. On Wed, 26 Apr 2023 at 14:57, Marco Costantini < marco.costant...@rocketfncl.com> wrote: > Thanks team, > Email was just an example. The point was to illustrate that some actions > could be chain

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Marco Costantini
Thanks team, Email was just an example. The point was to illustrate that some actions could be chained using Spark's foreach. In reality, this is an S3 write and a Kafka message production, which I think is quite reasonable for spark to do. To answer Ayan's first question. Yes, all a users orders,

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Indeed very valid points by Ayan. How email is going to handle 1000s of records. As a solution architect I tend to replace. Users by customers and for each order there must be products sort of many to many relationship. If I was a customer I would also be interested in product details as well.sendi

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread ayan guha
Adding to what Mitch said, 1. Are you trying to send statements of all orders to all users? Or the latest order only? 2. Sending email is not a good use of spark. instead, I suggest to use a notification service or function. Spark should write to a queue (kafka, sqs...pick your choice here). Bes

Re: What is the best way to organize a join within a foreach?

2023-04-26 Thread Mich Talebzadeh
Well OK in a nutshell you want the result set for every user prepared and email to that user right. This is a form of ETL where those result sets need to be posted somewhere. Say you create a table based on the result set prepared for each user. You may have many raw target tables at the end of th

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Hi Mich, First, thank you for that. Great effort put into helping. Second, I don't think this tackles the technical challenge here. I understand the windowing as it serves those ranks you created, but I don't see how the ranks contribute to the solution. Third, the core of the challenge is about p

Re: unsubscribe

2023-04-25 Thread santhosh Gandhe
To remove your address from the list, send a message to: On Mon, Apr 24, 2023 at 10:41 PM wrote: > unsubscribe

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Hi Marco, First thoughts. foreach() is an action operation that is to iterate/loop over each element in the dataset, meaning cursor based. That is different from operating over the dataset as a set which is far more efficient. So in your case as I understand it correctly, you want to get order f

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich, Great idea. I have done it. Those files are attached. I'm interested to know your thoughts. Let's imagine this same structure, but with huge amounts of data as well. Please and thank you, Marco. On Tue, Apr 25, 2023 at 12:12 PM Mich Talebzadeh wrote: > Hi Marco, > > Let us start s

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Hi Marco, Let us start simple, Provide a csv file of 5 rows for the users table. Each row has a unique user_id and one or two other columns like fictitious email etc. Also for each user_id, provide 10 rows of orders table, meaning that orders table has 5 x 10 rows for each user_id. both as comm

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Marco Costantini
Thanks Mich, I have not but I will certainly read up on this today. To your point that all of the essential data is in the 'orders' table; I agree! That distills the problem nicely. Yet, I still have some questions on which someone may be able to shed some light. 1) If my 'orders' table is very l

Re: What is the best way to organize a join within a foreach?

2023-04-25 Thread Mich Talebzadeh
Have you thought of using windowing function s to achieve this? Effectively all your information is in the orders table. HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United King

Re: Use Spark Aggregator in PySpark

2023-04-24 Thread Enrico Minack
Hi, For an aggregating UDF, use spark.udf.registerJavaUDAF(name, className). Enrico Am 23.04.23 um 23:42 schrieb Thomas Wang: Hi Spark Community, I have implemented a custom Spark Aggregator (a subclass to |org.apache.spark.sql.expressions.Aggregator|). Now I'm trying to use it in a PySpa

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Thomas Wang
Thanks Raghavendra, Could you be more specific about how I can use ExpressionEncoder()? More specifically, how can I conform to the return type of Encoder>? Thomas On Sun, Apr 23, 2023 at 9:42 AM Raghavendra Ganesh wrote: > For simple array types setting encoder to ExpressionEncoder() should w

Re: Spark Aggregator with ARRAY input and ARRAY output

2023-04-23 Thread Raghavendra Ganesh
For simple array types setting encoder to ExpressionEncoder() should work. -- Raghavendra On Sun, Apr 23, 2023 at 9:20 PM Thomas Wang wrote: > Hi Spark Community, > > I'm trying to implement a custom Spark Aggregator (a subclass to > org.apache.spark.sql.expressions.Aggregator). Correct me if I

Re: Partition by on dataframe causing a Sort

2023-04-20 Thread Nikhil Goyal
Is it possible to use MultipleOutputs and define a custom OutputFormat and then use `saveAsHadoopFile` to be able to achieve this? On Thu, Apr 20, 2023 at 1:29 PM Nikhil Goyal wrote: > Hi folks, > > We are writing a dataframe and doing a partitionby() on it. > df.write.partitionBy('col').parquet

Re: [Spark on SBT] Executor just keeps running

2023-04-18 Thread Dhruv Singla
You can reproduce the behavior in ordinary Scala code if you keep reduce in an object outside the main method. Hope it might help On Mon, Apr 17, 2023 at 10:22 PM Dhruv Singla wrote: > Hi Team >I was trying to run spark using `sbt console` on the terminal. I am > able to build the projec

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Ankit Gupta
Thanks Elliot ! Let me check it out ! On Mon, 17 Apr, 2023, 10:08 pm Elliot West, wrote: > Hi Ankit, > > While not a part of Spark, there is a project called 'WaggleDance' that > can federate multiple Hive metastores so that they are accessible via a > single URI: https://github.com/ExpediaGroup

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Cheng Pan
There is a DSv2-based Hive connector in Apache Kyuubi[1] that supports connecting multiple HMS in a single Spark application. Some limitations - currently only supports Spark 3.3 - has a known issue when using w/ `spark-sql`, but OK w/ spark-shell and normal jar-based Spark application. [1] http

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Elliot West
Hi Ankit, While not a part of Spark, there is a project called 'WaggleDance' that can federate multiple Hive metastores so that they are accessible via a single URI: https://github.com/ExpediaGroup/waggle-dance This may be useful or perhaps serve as inspiration. Thanks, Elliot. On Mon, 17 Apr

Re: Spark Multiple Hive Metastore Catalog Support

2023-04-17 Thread Ankit Gupta
++ User Mailing List Just a reminder, anyone who can help on this. Thanks a lot ! Ankit Prakash Gupta On Wed, Apr 12, 2023 at 8:22 AM Ankit Gupta wrote: > Hi All > > The question is regarding the support of multiple Remote Hive Metastore > catalogs with Spark. Starting Spark 3, multiple catal

Re: Non string type partitions

2023-04-15 Thread Bjørn Jørgensen
I guess that it has to do with indexing and partitioning data to nodes. Have a look at data partitioning system design concept and key range partitions

Re: Non string type partitions

2023-04-15 Thread Charles vinodh
bumping this up again for suggestions?.. Is the official recommendation to not have *int* or *date* typed partition columns? On Wed, 12 Apr 2023 at 10:44, Charles vinodh wrote: > There are other distributed execution engines (like hive, trino) that do > support non-string data types for partiti

Re: Spark Kubernetes Operator

2023-04-14 Thread Yuval Itzchakov
I'm not running on GKE. I am wondering what's the long term strategy around a Spark operator. Operators are the de-facto way to run complex deployments. The Flink community now has an official community led operator, and I was wondering if there are any similar plans for Spark. On Fri, Apr 14, 202

Re: Spark Kubernetes Operator

2023-04-14 Thread Mich Talebzadeh
Hi, What exactly are you trying to achieve? Spark on GKE works fine and you can run Datapoc now on GKE https://www.linkedin.com/pulse/running-google-dataproc-kubernetes-engine-gke-spark-mich/?trackingId=lz12GC5dRFasLiaJm5qDSw%3D%3D Unless I misunderstood your point. HTH Mich Talebzadeh, Lead S

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-14 Thread Mich Talebzadeh
OK I managed to load the Python zipped file and the run py.file onto s3 for AWS EKS to work It is a bit of nightmare compared to the same on Google SDK which is simpler Anyhow you will require additional jar files to be added to $SPARK_HOME/jars. These two files will be picked up after you build

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-14 Thread Jacek Laskowski
Hi, Start with intercepting stage completions using SparkListenerStageCompleted [1]. That's Spark Core (jobs, stages and tasks). Go up the execution chain to Spark SQL with SparkListenerSQLExecutionStart [2] and SparkListenerSQLExecutionEnd [3], and correlate infos. You may want to look at how w

Re: How to create spark udf use functioncatalog?

2023-04-14 Thread Jacek Laskowski
Hi, I'm not sure I understand the question, but if your question is how to register (plug-in) your own custom FunctionCatalog, it's through spark.sql.catalog configuration property, e.g. spark.sql.catalog.catalog-name=com.example.YourCatalogClass spark.sql.catalog registers a CatalogPlugin that

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-13 Thread Trường Trần Phan An
Hi, Can you give me more details or give me a tutorial on "You'd have to intercept execution events and correlate them. Not an easy task yet doable" Thank Vào Th 4, 12 thg 4, 2023 vào lúc 21:04 Jacek Laskowski đã viết: > Hi, > > tl;dr it's not possible to "reverse-engineer" tasks to function

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
Not sure I follow. If my output is my/path/output then the spark metadata will be written to my/path/output/_spark_metadata. All my data will also be stored under my/path/output so there's no way to split it? ‪On Thu, Apr 13, 2023 at 1:14 PM ‫"Yuri Oleynikov (‫יורי אולייניקוב‬‎)"‬‎ < yur...@gmail.

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Yeah but can’t you use following?1 . For data files: My/path/part-2. For partitioned data: my/path/partition=Best regardsOn 13 Apr 2023, at 12:58, Yuval Itzchakov wrote:The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins :(https://docs.

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuval Itzchakov
The problem is that specifying two lifecycle policies for the same path, the one with the shorter retention wins :( https://docs.aws.amazon.com/AmazonS3/latest/userguide/lifecycle-configuration-examples.html#lifecycle-config-conceptual-ex4 "You might specify an S3 Lifecycle configuration in which

Re: _spark_metadata path issue with S3 lifecycle policy

2023-04-13 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
My naïve assumption that specifying lifecycle policy for _spark_metadata with longer retention will solve the issue Best regards > On 13 Apr 2023, at 11:52, Yuval Itzchakov wrote: > >  > Hi everyone, > > I am using Sparks FileStreamSink in order to write files to S3. On the S3 > bucket, I

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Maytas Monsereenusorn
Hi, I was wondering if it's not possible to determine tasks to functions, is it still possible to easily figure out which job and stage completed which part of the query from the UI? For example, in the SQL tab of the Spark UI, I am able to see the query and the Job IDs for that query. However, wh

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Mich Talebzadeh
Thanks! I will have a look. Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies Limited London United Kingdom view my Linkedin profile https://en.everybodywiki.com/Mich_Talebzadeh *Disclaimer:* Use i

Re: Accessing python runner file in AWS EKS kubernetes cluster as in local://

2023-04-12 Thread Bjørn Jørgensen
Yes, it looks inside the docker containers folder. It will work if you are using s3 og gs. ons. 12. apr. 2023, 18:02 skrev Mich Talebzadeh : > Hi, > > In my spark-submit to eks cluster, I use the standard code to submit to > the cluster as below: > > spark-submit --verbose \ >--master k8s://$

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Mich Talebzadeh
outdated as k8s is evolving. So if anyone is interested, > please support the project. > > -- > Lingzhe Sun > Hirain Technologies > > > *From:* Mich Talebzadeh > *Date:* 2023-04-11 02:06 > *To:* Rajesh Katkar > *CC:* user > *Subject

Re: How to determine the function of tasks on each stage in an Apache Spark application?

2023-04-12 Thread Jacek Laskowski
Hi, tl;dr it's not possible to "reverse-engineer" tasks to functions. In essence, Spark SQL is an abstraction layer over RDD API that's made up of partitions and tasks. Tasks are Scala functions (possibly with some Python for PySpark). A simple-looking high-level operator like DataFrame.join can

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread 孙令哲
Hi Rajesh, It's working fine, at least for now. But you'll need to build your own spark image using later versions. Lingzhe Sun Hirain Technologies Original: From:Rajesh Katkar Date:2023-04-12 21:36:52To:Lingzhe SunCc:Mich Talebzadeh , user Subject:Re: Re: spark str

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Yi Huang
has >> been inactive for quite long time. Kind of worried that this project might >> finally become outdated as k8s is evolving. So if anyone is interested, >> please support the project. >> >> -- >> Lingzhe Sun >> Hirain Techno

Re: Re: spark streaming and kinesis integration

2023-04-12 Thread Rajesh Katkar
f anyone is interested, > please support the project. > > -- > Lingzhe Sun > Hirain Technologies > > > *From:* Mich Talebzadeh > *Date:* 2023-04-11 02:06 > *To:* Rajesh Katkar > *CC:* user > *Subject:* Re: spark streaming and kinesis in

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-12 Thread Jacek Laskowski
Hi, You could use QueryExecutionListener or Spark listeners to intercept query execution events and extract whatever is required. That's what web UI does (as it's simply a bunch of SparkListeners --> https://youtu.be/mVP9sZ6K__Y ;-)). Pozdrawiam, Jacek Laskowski "The Internals Of" Online Boo

Re: Non string type partitions

2023-04-12 Thread Charles vinodh
There are other distributed execution engines (like hive, trino) that do support non-string data types for partition columns such as date and integer. Any idea why this restriction exists in Spark? .. On Tue, 11 Apr 2023 at 20:34, Chitral Verma wrote: > Because the name of the directory cannot

Re: Re: spark streaming and kinesis integration

2023-04-11 Thread Lingzhe Sun
active for quite long time. Kind of worried that this project might finally become outdated as k8s is evolving. So if anyone is interested, please support the project. Lingzhe Sun Hirain Technologies From: Mich Talebzadeh Date: 2023-04-11 02:06 To: Rajesh Katkar CC: user Subject: Re: spark str

Re: [SparkSQL, SparkUI, RESTAPI] How to extract the WholeStageCodeGen ids from SparkUI

2023-04-11 Thread Chitral Verma
try explain codegen on your DF and then pardee the string On Fri, 7 Apr, 2023, 3:53 pm Chenghao Lyu, wrote: > Hi, > > The detailed stage page shows the involved WholeStageCodegen Ids in its > DAG visualization from the Spark UI when running a SparkSQL. (e.g., under > the link > node:18088/histor

Re: Non string type partitions

2023-04-11 Thread Chitral Verma
Because the name of the directory cannot be an object, it has to be a string to create partitioned dirs like "date=2023-04-10" On Tue, 11 Apr, 2023, 8:27 pm Charles vinodh, wrote: > > Hi Team, > > We are running into the below error when we are trying to run a simple > query a partitioned table

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
Just to clarify, a major benefit of k8s in this case is to host your Spark applications in the form of containers in an automated fashion so that one can easily deploy as many instances of the application as required (autoscaling). From below: https://price2meet.com/gcp/docs/dataproc_docs_concepts

Re: spark streaming and kinesis integration

2023-04-10 Thread Mich Talebzadeh
What I said was this "In so far as I know k8s does not support spark structured streaming?" So it is an open question. I just recalled it. I have not tested myself. I know structured streaming works on Google Dataproc cluster but I have not seen any official link that says Spark Structured Streami

Re: spark streaming and kinesis integration

2023-04-10 Thread Rajesh Katkar
Do you have any link or ticket which justifies that k8s does not support spark streaming ? On Thu, 6 Apr, 2023, 9:15 pm Mich Talebzadeh, wrote: > Do you have a high level diagram of the proposed solution? > > In so far as I know k8s does not support spark structured streaming? > > Mich Talebzade

Re: Troubleshooting ArrayIndexOutOfBoundsException in long running Spark application

2023-04-09 Thread Andrew Redd
remove On Wed, Apr 5, 2023 at 8:06 AM Mich Talebzadeh wrote: > OK Spark Structured Streaming. > > How are you getting messages into Spark? Is it Kafka? > > This to me index that the message is incomplete or having another value in > Json > > HTH > > Mich Talebzadeh, > Lead Solutions Architect/E

Re: spark streaming and kinesis integration

2023-04-06 Thread Rajesh Katkar
Use case is , we want to read/write to kinesis streams using k8s Officially I could not find the connector or reader for kinesis from spark like it has for kafka. Checking here if anyone used kinesis and spark streaming combination ? On Thu, 6 Apr, 2023, 7:23 pm Mich Talebzadeh, wrote: > Hi Raj

RE: spark streaming and kinesis integration

2023-04-06 Thread Jonske, Kurt
kar Cc: u...@spark.incubator.apache.org Subject: Re: spark streaming and kinesis integration ⚠ [EXTERNAL EMAIL]: Use Caution Do you have a high level diagram of the proposed solution? In so far as I know k8s does not support spark structured streaming? Mich Talebzadeh, Lead Solutions

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Do you have a high level diagram of the proposed solution? In so far as I know k8s does not support spark structured streaming? Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view my Linkedin profile

Re: spark streaming and kinesis integration

2023-04-06 Thread Mich Talebzadeh
Hi Rajesh, What is the use case for Kinesis here? I have not used it personally, Which use case it concerns https://aws.amazon.com/kinesis/ Can you use something else instead? HTH Mich Talebzadeh, Lead Solutions Architect/Engineering Lead Palantir Technologies London United Kingdom view m

Re: Potability of dockers built on different cloud platforms

2023-04-05 Thread Mich Talebzadeh
The whole idea of creating a docker container is to have a reployable self contained utility. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings. The

<    3   4   5   6   7   8   9   10   11   12   >