Re: 7368396 - Apache Spark 3.5.1 (Support)

2024-06-07 Thread Sadha Chilukoori
Hi Alex, Spark is an open source software available under Apache License 2.0 ( https://www.apache.org/licenses/), further details can be found here in the FAQ page (https://spark.apache.org/faq.html). Hope this helps. Thanks, Sadha On Thu, Jun 6, 2024, 1:32 PM SANTOS SOUZA, ALEX wrote: >

Re: Kubernetes cluster: change log4j configuration using uploaded `--files`

2024-06-06 Thread Mich Talebzadeh
The issue you are encountering is due to the order of operations when Spark initializes the JVM for driver and executor pods. The JVM options (-Dlog4j2.configurationFile) are evaluated when the JVM starts, but the --files option copies the files after the JVM has already started. Hence, the log4j

7368396 - Apache Spark 3.5.1 (Support)

2024-06-06 Thread SANTOS SOUZA, ALEX
Hey guys! I am part of the team responsible for software approval at EMBRAER S.A. We are currently in the process of approving the Apache Spark 3.5.1 software and are verifying the licensing of the application. Therefore, I would like to kindly request you to answer the questions below. -What

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Also can I take my lower bound starting from 1 or is it index? On Thu, Jun 6, 2024 at 8:42 PM Perez wrote: > Thanks again Mich. It gives the clear picture but I have again couple of > doubts: > > 1) I know that there will be multiple threads that will be executed with > 10 segment sizes each

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Thanks again Mich. It gives the clear picture but I have again couple of doubts: 1) I know that there will be multiple threads that will be executed with 10 segment sizes each until the upper bound is reached but I didn't get this part of the code exactly segments = [(i, min(i + segment_size,

Re: [SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-06 Thread Someshwar Kale
As a fix, you may consider adding a transformer to rename columns (perhaps replace all columns with dot to underscore) and use the renamed columns in your pipeline as below- val renameColumn = new RenameColumn().setInputCol("location.longitude").setOutputCol("location_longitude") val si = new

Kubernetes cluster: change log4j configuration using uploaded `--files`

2024-06-06 Thread Jennifer Wirth
Hi, I am trying to change the log4j configuration for jobs submitted to a k8s cluster (using submit) The my-log4j.xml is uploaded using --files ,./my-log4j.xml and the file in the working directory of the driver/exec pods. I added D-flags using the extra java options (and tried many different

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
well you can dynamically determine the upper bound by first querying the database to find the maximum value of the partition column and use it as the upper bound for your partitioning logic. def get_max_value(spark, mongo_config, column_name): max_value_df =

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Perez
Thanks, Mich for your response. However, I have multiple doubts as below: 1) I am trying to load the data for the incremental batch so I am not sure what would be my upper bound. So what can we do? 2) So as each thread loads the desired segment size's data into a dataframe if I want to aggregate

Re: Do we need partitioning while loading data from JDBC sources?

2024-06-06 Thread Mich Talebzadeh
Yes, partitioning and parallel loading can significantly improve the performance of data extraction from JDBC sources or databases like MongoDB. This approach can leverage Spark's distributed computing capabilities, allowing you to load data in parallel, thus speeding up the overall data loading

[SPARK-48423] Unable to save ML Pipeline to azure blob storage

2024-06-05 Thread Chhavi Bansal
Hello team, I was exploring on how to save ML pipeline to azure blob storage, but was setback by an issue where it complains of `fs.azure.account.key` not being found in the configuration even when I have provided the values in the pipelineModel.option(key1,value1) field. I considered raising a

[SPARK-48463] Mllib Feature transformer failing with nested dataset (Dot notation)

2024-06-05 Thread Chhavi Bansal
Hello team I was exploring feature transformation exposed via Mllib on nested dataset, and encountered an error while applying any transformer to a column with dot notation naming. I thought of raising a ticket on spark https://issues.apache.org/jira/browse/SPARK-48463, where I have mentioned the

Re: Terabytes data processing via Glue

2024-06-05 Thread Perez
Thanks Nitin and Russel for your responses. Much appreciated. On Mon, Jun 3, 2024 at 9:47 PM Russell Jurney wrote: > You could use either Glue or Spark for your job. Use what you’re more > comfortable with. > > Thanks, > Russell Jurney @rjurney >

Do we need partitioning while loading data from JDBC sources?

2024-06-05 Thread Perez
Hello experts, I was just wondering if I could leverage the below thing to expedite the loading of the data process in Spark. def extract_data_from_mongodb(mongo_config): df = glueContext.create_dynamic_frame.from_options( connection_type="mongodb", connection_options=mongo_config ) return df

Inquiry Regarding Security Compliance of Apache Spark Docker Image

2024-06-05 Thread Tonmoy Sagar
Dear Apache Team, I hope this email finds you well. We are a team from Ernst and Young LLP - India, dedicated to providing innovative supply chain solutions for a diverse range of clients. Our team recently encountered a pivotal use case necessitating the utilization of PySpark for a project

Re: Classification request

2024-06-04 Thread Dirk-Willem van Gulik
Actually - that answer may oversimplify things / be rather incorrect depending on the exact question of the entity that asks and the exact situation (who ships what code from where). For this reason it is properly best to refer this original poster to:

Re: Classification request

2024-06-04 Thread Artemis User
Sara, Apache Spark is open source under Apache License 2.0 (https://github.com/apache/spark/blob/master/LICENSE).  It is not under export control of any country!  Please feel free to use, reproduce and distribute, as long as your practice is compliant with the license. Having said that, some

Classification request

2024-06-04 Thread VARGA, Sara
Hello, my name is Sara and I am working in export control at MTU. In order to ensure export compliance, we would like to be informed about the export control classification for your items: Apache Spark 3.0.1 Apache Spark 3.2.1 Please be so kind and use the attached Supplier Export Control

[ANNOUNCE] Announcing Apache Spark 4.0.0-preview1

2024-06-03 Thread Wenchen Fan
Hi all, To enable wide-scale community testing of the upcoming Spark 4.0 release, the Apache Spark community has posted a preview release of Spark 4.0. This preview is not a stable release in terms of either API or functionality, but it is meant to give the community early access to try the code

Re: Terabytes data processing via Glue

2024-06-03 Thread Russell Jurney
You could use either Glue or Spark for your job. Use what you’re more comfortable with. Thanks, Russell Jurney @rjurney russell.jur...@gmail.com LI FB datasyndrome.com On Sun, Jun 2, 2024 at 9:59 PM

Re: Terabytes data processing via Glue

2024-06-02 Thread Perez
Hello, Can I get some suggestions? On Sat, Jun 1, 2024 at 1:18 PM Perez wrote: > Hi Team, > > I am planning to load and process around 2 TB historical data. For that > purpose I was planning to go ahead with Glue. > > So is it ok if I use glue if I calculate my DPUs needed correctly? or >

[ANNOUNCE] Apache Kyuubi released 1.9.1

2024-06-02 Thread Cheng Pan
Hi all, The Apache Kyuubi community is pleased to announce that Apache Kyuubi 1.9.1 has been released! This release brings support for Apache Spark 4.0.0-preview1. Apache Kyuubi is a distributed and multi-tenant gateway to provide serverless SQL on data warehouses and lakehouses. Kyuubi

Terabytes data processing via Glue

2024-06-01 Thread Perez
Hi Team, I am planning to load and process around 2 TB historical data. For that purpose I was planning to go ahead with Glue. So is it ok if I use glue if I calculate my DPUs needed correctly? or should I go with EMR. This will be a one time activity. TIA

[apache-spark][spark-dataframe] DataFrameWriter.partitionBy does not guarantee previous sort result

2024-05-31 Thread leeyc0
I have a dataset that have the following schema: (timestamp, partitionKey, logValue) I want to have the dataset to be sorted by timestamp, but write to file in the follow directory layout: outputDir/partitionKey/files The output file only contains logValue, that is, timestamp is used for sorting

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Amin Mosayyebzadeh
I am reading from a single file: df = spark.read.text("s3a://test-bucket/testfile.csv") On Fri, May 31, 2024 at 5:26 AM Mich Talebzadeh wrote: > Tell Spark to read from a single file > > data = spark.read.text("s3a://test-bucket/testfile.csv") > > This clarifies to Spark that you are dealing

Re: [s3a] Spark is not reading s3 object content

2024-05-31 Thread Mich Talebzadeh
Tell Spark to read from a single file data = spark.read.text("s3a://test-bucket/testfile.csv") This clarifies to Spark that you are dealing with a single file and avoids any bucket-like interpretation. HTH Mich Talebzadeh, Technologist | Architect | Data Engineer | Generative AI | FinCrime

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-30 Thread Subhasis Mukherjee
Regarding making spark writer fast part, If you are (or can be) on Databricks, check this out. It is just out of the oven at Databricks. https://www.databricks.com/blog/announcing-general-availability-liquid-clustering?utm_source=bambu_medium=social_campaign=advocacy=6087618

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
I will work on the first two possible causes. For the third one, which I guess is the real problem, Spark treats the testfile.csv object with the url s3a://test-bucket/testfile.csv as a bucket to access _spark_metadata with url s3a://test-bucket/testfile.csv/_spark_metadata testfile.csv is an

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
ok some observations - Spark job successfully lists the S3 bucket containing testfile.csv. - Spark job can retrieve the file size (33 Bytes) for testfile.csv. - Spark job fails to read the actual data from testfile.csv. - The printed content from testfile.csv is an empty list. -

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Amin Mosayyebzadeh
The code should read testfile.csv file from s3. and print the content. It only prints a empty list although the file has content. I have also checked our custom s3 storage (Ceph based) logs and I see only LIST operations coming from Spark, there is no GET object operation for testfile.csv The

Re: [s3a] Spark is not reading s3 object content

2024-05-30 Thread Mich Talebzadeh
Hello, Overall, the exit code of 0 suggests a successful run of your Spark job. Analyze the intended purpose of your code and verify the output or Spark UI for further confirmation. 24/05/30 01:23:43 INFO SparkContext: SparkContext is stopping with exitCode 0. what to check 1. Verify

Re: [s3a] Spark is not reading s3 object content

2024-05-29 Thread Amin Mosayyebzadeh
Hi Mich, Thank you for the help and sorry about the late reply. I ran your provided but I got "exitCode 0". Here is the complete output: === 24/05/30 01:23:38 INFO SparkContext: Running Spark version 3.5.0 24/05/30 01:23:38 INFO SparkContext: OS info Linux,

[Spark on k8s] A issue of k8s resource creation order

2024-05-29 Thread Tao Yang
Hi, team! I have a spark on k8s issue which posts in https://stackoverflow.com/questions/78537132/spark-on-k8s-resource-creation-order Need help

Tox and Pyspark

2024-05-29 Thread Perez
Hi Team, I need help with this https://stackoverflow.com/questions/78547676/tox-with-pyspark

Re: OOM concern

2024-05-29 Thread Perez
Thanks Mich for the detailed explanation. On Tue, May 28, 2024 at 9:53 PM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage > varies. For a 100 GB data transfer, the speed difference between Glue and > EMR might not be significant, especially

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-28 Thread Gera Shegalov
I agree with the previous answers that (if requirements allow it) it is much easier to just orchestrate a copy either in the same app or sync externally. A long time ago and not for a Spark app we were solving a similar usecase via

Re: OOM concern

2024-05-28 Thread Russell Jurney
If Glue lets you take a configuration based approach, and you don't have to operate any servers as with EMR... use Glue. Try EMR if that is troublesome. Russ On Tue, May 28, 2024 at 9:23 AM Mich Talebzadeh wrote: > Russell mentioned some of these issues before. So in short your mileage >

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
Russell mentioned some of these issues before. So in short your mileage varies. For a 100 GB data transfer, the speed difference between Glue and EMR might not be significant, especially considering the benefits of Glue's managed service aspects. However, for much larger datasets or scenarios

Re: OOM concern

2024-05-28 Thread Perez
Thanks Mich. Yes, I agree on the costing part but how does the data transfer speed be impacted? Is it because glue takes some time to initialize underlying resources and then process the data? On Tue, May 28, 2024 at 2:23 PM Mich Talebzadeh wrote: > Your mileage varies as usual > > Glue with

Re: OOM concern

2024-05-28 Thread Mich Talebzadeh
Your mileage varies as usual Glue with DPUs seems like a strong contender for your data transfer needs based on the simplicity, scalability, and managed service aspects. However, if data transfer speed is critical or costs become a concern after testing, consider EMR as an alternative. HTH Mich

Re: OOM concern

2024-05-27 Thread Perez
Thank you everyone for your response. I am not getting any errors as of now. I am just trying to choose the right tool for my task which is data loading from an external source into s3 via Glue/EMR. I think Glue job would be the best fit for me because I can calculate DPUs needed (maybe keeping

Re: Spark Protobuf Deserialization

2024-05-27 Thread Sandish Kumar HN
Did you try using to_protobuf and from_protobuf ? https://spark.apache.org/docs/latest/sql-data-sources-protobuf.html On Mon, May 27, 2024 at 15:45 Satyam Raj wrote: > Hello guys, > We're using Spark 3.5.0 for processing Kafka source that contains protobuf > serialized data. The format is as

Re: OOM concern

2024-05-27 Thread Russell Jurney
If you’re using EMR and Spark, you need to choose nodes with enough RAM to accommodate any given partition in your data or you can get an OOM error. Not sure if this job involves a reduce, but I would choose a single 128GB+ memory optimized instance and then adjust parallelism as via the Dpark

Re: OOM concern

2024-05-27 Thread Meena Rajani
What exactly is the error? Is it erroring out while reading the data from db? How are you partitioning the data? How much memory currently do you have? What is the network time out? Regards, Meena On Mon, May 27, 2024 at 4:22 PM Perez wrote: > Hi Team, > > I want to extract the data from DB

Re: [Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Mich Talebzadeh
When you use applyInPandasWithState, Spark processes each input row as it arrives, regardless of whether certain columns, such as the timestamp column, contain NULL values. This behavior is useful where you want to handle incomplete or missing data gracefully within your stateful processing logic.

Spark Protobuf Deserialization

2024-05-27 Thread Satyam Raj
Hello guys, We're using Spark 3.5.0 for processing Kafka source that contains protobuf serialized data. The format is as follows: message Request { long sent_ts = 1; Event[] event = 2; } message Event { string event_name = 1; bytes event_bytes = 2; } The event_bytes contains the data for

[Spark SQL]: Does Spark support processing records with timestamp NULL in stateful streaming?

2024-05-27 Thread Juan Casse
I am using applyInPandasWithState in PySpark 3.5.0. I noticed that records with timestamp==NULL are processed (i.e., trigger a call to the stateful function). And, as you would expect, does not advance the watermark. I am taking advantage of this in my application. My question: Is this a

OOM concern

2024-05-27 Thread Perez
Hi Team, I want to extract the data from DB and just dump it into S3. I don't have to perform any transformations on the data yet. My data size would be ~100 GB (historical load). Choosing the right DPUs(Glue jobs) should solve this problem right? Or should I move to EMR. I don't feel the need

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Shay Elbaz
Seen this before; had a very(!) complex plan behind a DataFrame, to the point where any additional transformation went OOM on the driver. A quick and ugly solution was to break the plan - convert the DataFrame to rdd and back to DF at certain points to make the plan shorter. This has obvious

Re: Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Mich Talebzadeh
Few ideas on top of my head for how to go about solving the problem 1. Try with subsets: Try reproducing the issue with smaller subsets of your data to pinpoint the specific operation causing the memory problems. 2. Explode or Flatten Nested Structures: If your DataFrame schema

Subject: [Spark SQL] [Debug] Spark Memory Issue with DataFrame Processing

2024-05-27 Thread Gaurav Madan
Dear Community, I'm reaching out to seek your assistance with a memory issue we've been facing while processing certain large and nested DataFrames using Apache Spark. We have encountered a scenario where the driver runs out of memory when applying the `withColumn` method on specific DataFrames

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
sorry i thought i gave an explanation The issue you are encountering with incorrect record numbers in the "ShuffleWrite Size/Records" column in the Spark DAG UI when data is read from cache/persist is a known limitation. This discrepancy arises due to the way Spark handles and reports shuffle

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Just to further clarify that the Shuffle Write Size/Records column in the Spark UI can be misleading when working with cached/persisted data because it reflects the shuffled data size and record count, not the entire cached/persisted data., So it is fair to say that this is a limitation of the

Re: BUG :: UI Spark

2024-05-26 Thread Mich Talebzadeh
Yep, the Spark UI's Shuffle Write Size/Records" column can sometimes show incorrect record counts *when data is retrieved from cache or persisted data*. This happens because the record count reflects the number of records written to disk for shuffling, and not the actual number of records in the

Re: BUG :: UI Spark

2024-05-26 Thread Sathi Chowdhury
Can you please explain how did you realize it’s wrong? Did you check cloudwatch for the same metrics and compare? Also are you using do.cache() and expecting that shuffle read/write to go away ? Sent from Yahoo Mail for iPhone On Sunday, May 26, 2024, 7:53 AM, Prem Sahoo wrote: Can anyone

Re: BUG :: UI Spark

2024-05-26 Thread Prem Sahoo
Can anyone please assist me ? On Fri, May 24, 2024 at 12:29 AM Prem Sahoo wrote: > Does anyone have a clue ? > > On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > >> Hello Team, >> in spark DAG UI , we have Stages tab. Once you click on each stage you >> can view the tasks. >> >> In each

Re: Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread Mich Talebzadeh
Something like this in Python from pyspark.sql import SparkSession # Configure Spark Session with JDBC URLs spark_conf = SparkConf() \ .setAppName("SparkCatalogMultipleSources") \ .set("hive.metastore.uris", "thrift://hive1-metastore:9080,thrift://hive2-metastore:9080") jdbc_urls =

Can Spark Catalog Perform Multimodal Database Query Analysis

2024-05-24 Thread ????
I have two clusters hive1 and hive2, as well as a MySQL database. Can I use Spark Catalog for registration, but can I only use one catalog at a time? Can multiple catalogs be joined across databases. select * from hive1.table1 join hive2.table2 join mysql.table1 where

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-24 Thread Anil Dasari
It appears that structured streaming and Dstream have entirely different microbatch metadata representation Can someone assist me in finding the following Dstream microbatch metadata equivalent in Structured streaming. 1. microbatch timestamp : structured streaming foreachBatch gives batchID

Re: BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Does anyone have a clue ? On Thu, May 23, 2024 at 11:40 AM Prem Sahoo wrote: > Hello Team, > in spark DAG UI , we have Stages tab. Once you click on each stage you can > view the tasks. > > In each task we have a column "ShuffleWrite Size/Records " that column > prints wrong data when it gets

Re: [s3a] Spark is not reading s3 object content

2024-05-23 Thread Mich Talebzadeh
Could be a number of reasons First test reading the file with a cli aws s3 cp s3a://input/testfile.csv . cat testfile.csv Try this code with debug option to diagnose the problem from pyspark.sql import SparkSession from pyspark.sql.utils import AnalysisException try: # Initialize Spark

BUG :: UI Spark

2024-05-23 Thread Prem Sahoo
Hello Team, in spark DAG UI , we have Stages tab. Once you click on each stage you can view the tasks. In each task we have a column "ShuffleWrite Size/Records " that column prints wrong data when it gets the data from cache/persist . it typically will show the wrong record number though the data

[s3a] Spark is not reading s3 object content

2024-05-23 Thread Amin Mosayyebzadeh
I am trying to read an s3 object from a local S3 storage (Ceph based) using Spark 3.5.1. I see it can access the bucket and list the files (I have verified it on Ceph side by checking its logs), even returning the correct size of the object. But the content is not read. The object url is:

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
You are right. - another question on migration. Is there a way to get the microbatch id during the microbatch dataset `trasform` operation like in rdd transform ? I am attempting to implement the following pseudo functionality with structured streaming. In this approach, recordCategoriesMetadata

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
With regard to this sentence *Offset Tracking with Structured Streaming:: While storing offsets in an external storage with DStreams was necessary, SSS handles this automatically through checkpointing. The checkpoints include the offsets processed by each micro-batch. However, you can still

Remote File change detection in S3 when spark queries are running and parquet files in S3 changes

2024-05-22 Thread Raghvendra Yadav
Hello, We are hoping someone can help us understand the spark behavior for scenarios listed below. Q. *Will spark running queries fail when S3 parquet object changes underneath with S3A remote file change detection enabled? Is it 100%? * Our understanding is that S3A has a

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
Hi Anil, Ok let us put the complete picture here * Current DStreams Setup:* - Data Source: Kafka - Processing Engine: Spark DStreams - Data Transformation with Spark - Sink: S3 - Data Format: Parquet - Exactly-Once Delivery (Attempted): You're attempting exactly-once

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
The right way to associated microbatches when committing to external storage is to use the microbatch id that you can get in foreachBatch. That microbatch id guarantees that the data produced in the batch is the always the same no matter any recomputations (assuming all processing logic is

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Thanks Das, Mtich. Mitch, We process data from Kafka and write it to S3 in Parquet format using Dstreams. To ensure exactly-once delivery and prevent data loss, our process records micro-batch offsets to an external storage at the end of each micro-batch in foreachRDD, which is then used when the

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
If you want to find what offset ranges are present in a microbatch in Structured Streaming, you have to look at the StreamingQuery.lastProgress or use the QueryProgressListener . Both of these

[ANNOUNCE] Apache Celeborn 0.4.1 available

2024-05-22 Thread Nicholas Jiang
Hi all, Apache Celeborn community is glad to announce the new release of Apache Celeborn 0.4.1. Celeborn is dedicated to improving the efficiency and elasticity of different map-reduce engines and provides an elastic, high-efficient service for intermediate data including shuffle data, spilled

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Mich Talebzadeh
OK to understand better your current model relies on streaming data input through Kafka topic, Spark does some ETL and you send to a sink, a database for file storage like HDFS etc? Your current architecture relies on Direct Streams (DStream) and RDDs and you want to move to Spark sStructured

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread ashok34...@yahoo.com.INVALID
Hello, what options are you considering yourself? On Wednesday 22 May 2024 at 07:37:30 BST, Anil Dasari wrote: Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure

Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Anil Dasari
Hello, We are on Spark 3.x and using Spark dstream + kafka and planning to use structured streaming + Kafka. Is there an equivalent of Dstream HasOffsetRanges in structure streaming to get the microbatch end offsets to the checkpoint in our external checkpoint store ? Thanks in advance. Regards

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
I am looking for writer/comitter optimization which can make the spark write faster. On Tue, May 21, 2024 at 9:15 PM eab...@163.com wrote: > Hi, > I think you should write to HDFS then copy file (parquet or orc) from > HDFS to MinIO. > > -- > eabour > > > *From:*

Re: Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread eab...@163.com
Hi, I think you should write to HDFS then copy file (parquet or orc) from HDFS to MinIO. eabour From: Prem Sahoo Date: 2024-05-22 00:38 To: Vibhor Gupta; user Subject: Re: EXT: Dual Write to HDFS and MinIO in faster way On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: Hello Vibhor,

Re: A handy tool called spark-column-analyser

2024-05-21 Thread ashok34...@yahoo.com.INVALID
Great work. Very handy for identifying problems thanks On Tuesday 21 May 2024 at 18:12:15 BST, Mich Talebzadeh wrote: A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output {    "Postcode":

Re: A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
A colleague kindly pointed out about giving an example of output which wll be added to README Doing analysis for column Postcode Json formatted output { "Postcode": { "exists": true, "num_rows": 93348, "data_type": "string", "null_count": 21921,

Re: EXT: Dual Write to HDFS and MinIO in faster way

2024-05-21 Thread Prem Sahoo
On Tue, May 21, 2024 at 6:58 AM Prem Sahoo wrote: > Hello Vibhor, > Thanks for the suggestion . > I am looking for some other alternatives where I can use the same > dataframe can be written to two destinations without re execution and cache > or persist . > > Can some one help me in scenario 2

A handy tool called spark-column-analyser

2024-05-21 Thread Mich Talebzadeh
I just wanted to share a tool I built called *spark-column-analyzer*. It's a Python package that helps you dig into your Spark DataFrames with ease. Ever spend ages figuring out what's going on in your columns? Like, how many null values are there, or how many unique entries? Built with data

Re: pyspark dataframe join with two different data type

2024-05-17 Thread Karthick Nk
Hi All, I have tried the same result with pyspark and with SQL query by creating with tempView, I could able to achieve whereas I have to do in the pyspark code itself, Could you help on this incoming_data = [["a"], ["b"], ["d"]] column_names = ["column1"] df =

Request for Assistance: Adding User Authentication to Apache Spark Application

2024-05-16 Thread NIKHIL RAJ SHRIVASTAVA
Dear Team, I hope this email finds you well. My name is Nikhil Raj, and I am currently working with Apache Spark for one of my projects , where through the help of a parquet file we are creating an external table in Spark. I am reaching out to seek assistance regarding user authentication for

Re: pyspark dataframe join with two different data type

2024-05-15 Thread Karthick Nk
Thanks Mich, I have tried this solution, but i want all the columns from the dataframe df_1, if i explode the df_1 i am getting only data column. But the resultant should get the all the column from the df_1 with distinct result like below. Results in *df:* +---+ |column1| +---+ |

How to provide a Zstd "training mode" dictionary object

2024-05-15 Thread Saha, Daniel
Hi, I understand that Zstd compression can optionally be provided a dictionary object to improve performance. See “training mode” here https://facebook.github.io/zstd/ Does Spark surface a way to provide this dictionary object when writing/reading data? What about for intermediate shuffle

Query Regarding UDF Support in Spark Connect with Kubernetes as Cluster Manager

2024-05-15 Thread Nagatomi Yasukazu
Hi Spark Community, I have a question regarding the support for User-Defined Functions (UDFs) in Spark Connect, specifically when using Kubernetes as the Cluster Manager. According to the Spark documentation, UDFs are supported by default for the shell and in standalone applications with

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Mich Talebzadeh
You can use a combination of explode and distinct before joining. from pyspark.sql import SparkSession from pyspark.sql.functions import explode # Create a SparkSession spark = SparkSession.builder \ .appName("JoinExample") \ .getOrCreate() sc = spark.sparkContext # Set the log level to

Re: pyspark dataframe join with two different data type

2024-05-14 Thread Karthick Nk
Hi All, Could anyone have any idea or suggestion of any alternate way to achieve this scenario? Thanks. On Sat, May 11, 2024 at 6:55 AM Damien Hawes wrote: > Right now, with the structure of your data, it isn't possible. > > The rows aren't duplicates of each other. "a" and "b" both exist in

Display a warning in EMR welcome screen

2024-05-11 Thread Abhishek Basu
Hi Team, for Elastic Map Reduce (EMR) cluster, it would be great if there is a warning message that Logging should be handled carefully and INFO or DEBUG should be enabled only when required. This logging thing took my whole day and lastly I discovered that it’s exploding the hdfs and not

Re: [spark-graphframes]: Generating incorrect edges

2024-05-11 Thread Nijland, J.G.W. (Jelle, Student M-CS)
Hi all, The issue is solved. I conducted a lot more testing and built checkers to verify at which size it's going wrong. When checking for specific edges, I could construct successful graphs up to 261k records. When verifying all edges created, is breaks somewhere in the 200-250k records. I

unsubscribe

2024-05-10 Thread J UDAY KIRAN
unsubscribe

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible. The rows aren't duplicates of each other. "a" and "b" both exist in the array. So Spark is correctly performing the join. It looks like you need to find another way to model this data to get what you want to achieve. Are the values

Re: pyspark dataframe join with two different data type

2024-05-10 Thread Karthick Nk
Hi Mich, Thanks for the solution, But I am getting duplicate result by using array_contains. I have explained the scenario below, could you help me on that, how we can achieve i have tried different way bu i could able to achieve. For example data = [ ["a"], ["b"], ["d"], ]

Spark 3.5.x on Java 21?

2024-05-08 Thread Stephen Coy
Hi everyone, We’re about to upgrade our Spark clusters from Java 11 and Spark 3.2.1 to Spark 3.5.1. I know that 3.5.1 is supposed to be fine on Java 17, but will it run OK on Java 21? Thanks, Steve C This email contains confidential information of and is the copyright of Infomedia. It

Re: [Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-08 Thread Mich Talebzadeh
you may consider - Increase Watermark Retention: Consider increasing the watermark retention duration. This allows keeping records for a longer period before dropping them. However, this might increase processing latency and violate at-least-once semantics if the watermark lags behind real-time.

Spark not creating staging dir for insertInto partitioned table

2024-05-07 Thread Sanskar Modi
Hi Folks, I wanted to check why spark doesn't create staging dir while doing an insertInto on partitioned tables. I'm running below example code – ``` spark.sql("set hive.exec.dynamic.partition.mode=nonstrict") val rdd = sc.parallelize(Seq((1, 5, 1), (2, 1, 2), (4, 4, 3))) val df =

[Spark Streaming]: Save the records that are dropped by watermarking in spark structured streaming

2024-05-07 Thread Nandha Kumar
Hi Team, We are trying to use *spark structured streaming *for our use case. We will be joining 2 streaming sources(from kafka topic) with watermarks. As time progresses, the records that are prior to the watermark timestamp are removed from the state. For our use case, we want to *store

unsubscribe

2024-05-07 Thread Wojciech Bombik
unsubscribe

unsubscribe

2024-05-06 Thread Moise
unsubscribe

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Mich Talebzadeh
Hi Kartrick, Unfortunately Materialised views are not available in Spark as yet. I raised Jira [SPARK-48117] Spark Materialized Views: Improve Query Performance and Data Management - ASF JIRA (apache.org) as a feature request. Let me think

Re: ********Spark streaming issue to Elastic data**********

2024-05-06 Thread Karthick Nk
Thanks Mich, can you please confirm me is my understanding correct? First, we have to create the materialized view based on the mapping details we have by using multiple tables as source(since we have multiple join condition from different tables). From the materialised view we can stream the

  1   2   3   4   5   6   7   8   9   10   >