Re: deciding Spark tasks & optimization resource

2022-08-29 Thread Gibson
Hello Rajat, Look up the spark *Pipelining* concept; any sequence of operations that feed data directly into each other without need for shuffling will packed into a single stage, ie select -> filter -> select (SparkSQL) ; map -> filter -> map (RDD), for any operation that requires shuffling

Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Gibson
o/documentation/reference/stable/connectors/mysql.html> >>to read Write Ahead logs(WAL) and send to Kafka >>- Kafka connect to write to cloud storage -> Hive >> - OR >> >> >>- Spark streaming to parse WAL -> Storage -> Hive >> &g

Re: Spark streaming - Data Ingestion

2022-08-17 Thread Gibson
If you have space for a message log like, then you should try: MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS -> Hive On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai wrote: > Dear sir > > I have tried a lot on this could you help me with this? > > Data ingestion from

Spark Convert Column to String

2022-07-16 Thread Gibson
Hi Folks, Have created a UDF that queries a confluent schema registry for a schema, which is then used within a Dataset Select with the from_avro function to decode an avro encoded value (reading from a bunch of kafka topics) Dataset recordDF = df.select(