Re: Structured Streaming Microbatch Semantics

2021-03-05 Thread Roland Johann
h or can the records also > be split into multiple batches? > > > Best, > > Rico. > > > - > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > > -- Roland Johann Data Architect/Data Engineer phenetic

unsubscribe

2021-02-26 Thread Roland Johann
unsubscribe signature.asc Description: Message signed with OpenPGP

Unsubscribe

2021-02-24 Thread Roland Johann
unsubscribe-- Roland Johann Data Architect/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io Web: phenetic.io Handelsregister: Amtsgericht Köln (HRB 92595) Geschäftsführer: Roland Johann, Uwe Reimann

Re: Convert Seq[Any] to Seq[String]

2020-12-19 Thread Roland Johann
t; >>> var atrb = ListBuffer[(String,String,String)]() >>> >>> for((key,value) <- aMap){ >>> atrb += ((key, value._1, value._2)) >>> } >>> >>> var newCol = atrb.head.productIterator.toList.toSeq >>> >>> Please someone help me

Re: dynamic executor scalling spark on kubernetes client mode

2020-05-12 Thread Roland Johann
Hi all, don’t want to interrupt the conversation but are keen where I can find information regarding dynamic allocation on kubernetes. As far as I know the docs just point to future work. Thanks a lot, Roland > Am 12.05.2020 um 09:25 schrieb Steven Stetzler : > > Hi all, > > I am

Re: Left Join at SQL query gets planned as inner join

2020-04-30 Thread Roland Johann
t; Software Developer IV > Customer Knowledge Platform > From: Roland Johann > Sent: Thursday, April 30, 2020 8:30:05 AM > To: randy clinton > Cc: Roland Johann ; user > > Subject: Re: Left Join at SQL query gets planned as inner join > > Notice: This emai

Re: Left Join at SQL query gets planned as inner join

2020-04-30 Thread Roland Johann
lter(year = 2020 and month = 4 and day = 29) > p_DF = p_DF.filter(year = 2020 and month = 4 and day = 29 and event_id is > null) > > output = s_DF.join(p_DF, event_id == source_event_id, left) > > > > On Thu, Apr 30, 2020 at 11:06 AM Roland Johann > wrote: > Hi All,

Left Join at SQL query gets planned as inner join

2020-04-30 Thread Roland Johann
la dsl lead to the same execution plan. Can someone point to docs about the internals of this topic of spark? The official docs about SQL in general are not that verbose. Thanks in advance and stay safe! Roland Johann

Re: [pyspark] Load a master data file to spark ecosystem

2020-04-25 Thread Roland Johann
'somefile')) >> lines = spark.sparkContext.textFile("log_file") >> converted_lines_rdd = lines.map(lambda l: process_logline(l, tree_val)) >> log_line_rdd = spark.createDataFrame(converted_lines_rdd) >> log_line_rdd.show() >> >> Basically

Re: 30000 partitions vs 1000 partitions with Coalescing

2020-04-24 Thread Roland Johann
Hi Adnan, coalescing involves network shuffle to other executors. How many executors are configured for that job? Best regards Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io Web

Re: Standard practices for building dashboards for spark processed data

2020-02-25 Thread Roland Johann
ent? >4. Since the pipeline is going to run into Kubernetes, I am trying to >avoid InfluxDB as time-series database and moving with prometheus. Is this >approach correct? > > Thanks, > Ani > --- > ᐧ > -- Roland Johann Software Developer/Data Engineer p

Structured Streaming Kafka change maxOffsetsPerTrigger won't apply

2019-11-20 Thread Roland Johann
Hi All, changing maxOffsetsPerTrigger and restarting the job won’t apply to the batch size. This is somehow bad as we currently use a trigger duration of 5minutes which consumes only 100k messages with an offset lag in the billions. Decreasing trigger duration affects also micro batch size -

Re: Delta with intelligent upsett

2019-11-01 Thread Roland Johann
If the dataset contains a column like changed_at/created_at you can use this as watermark and filter out rows that have changed_at/created_at before the watermark. Best Regards Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49

Re: Need help regarding logging / log4j.properties

2019-10-31 Thread Roland Johann
itten > ? I have checked in the yarn logs but couldn't find the messages I have > written in the java file. > Request your help please as I am little confused and know that there is > something very silly which I am missing. > > Thanks in advance ! > > Debu > -

Re: Spark job fails because of timeout to Driver

2019-10-04 Thread Roland Johann
e default security groups, ran my job again but the same > exception pops up :-( ... > All traffic is open on the security groups now. > > Jochen > > Op vr 4 okt. 2019 om 17:37 schreef Roland Johann < > roland.joh...@phenetic.io>: > >> This are dynamic port ranges an

Re: Spark job fails because of timeout to Driver

2019-10-04 Thread Roland Johann
; > We have indeed custom security groups. Can you tell me where exactly I > need to be able to access what? > For example, is it from the master instance to the driver instance? And > which port should be open? > > Jochen > > Op vr 4 okt. 2019 om 17:14 schreef Roland Johann

Re: Spark job fails because of timeout to Driver

2019-10-04 Thread Roland Johann
.spark.deploy.yarn.ApplicationMaster$.main(ApplicationMaster.scala:803) >>> at >>> org.apache.spark.deploy.yarn.ApplicationMaster.main(ApplicationMaster.scala) >>> {code} >>> >>> It actually goes wrong at this line: >>> https://github.com/ap

Re: PGP Encrypt using spark Scala

2019-08-26 Thread Roland Johann
I want to add that the major hadoop distributions also offer additional encryption possibilities (for example Ranger from Hortonworks) Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io

Re: PGP Encrypt using spark Scala

2019-08-26 Thread Roland Johann
tps://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/TransparentEncryption.html> - obviously if you don’t have to use PGP. Using encryption at the storage layer simplifies your application and architecture and you don’t need to reinvent the wheel. Kind Regards Roland Johann Software Dev

Re: [External]Re: error while connecting to azure blob storage

2019-08-23 Thread Roland Johann
ng you use hadoop 2.7.7. Best Regards Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674 Köln, Germany Mobil: +49 172 365 26 46 Mail: roland.joh...@phenetic.io Web: phenetic.io Handelsregister: Amtsgericht Köln (HRB 92595) Geschäftsführer: Roland Johann, Uwe R

Re: error while connecting to azure blob storage

2019-08-23 Thread Roland Johann
Hi Krishna, there seems to be no attachment. In addition, you should NEVER post private credentials to public forums. Please renew the credentials of your storage account as soon as possible! Best Regards Roland Johann Software Developer/Data Engineer phenetic GmbH Lütticher Straße 10, 50674