Query regarding kafka version
Hi Team, I am using spark 2.2 , so can I use kafka version 2.5 in my spark streaming application? Thanks & Regards, Renu Yadav
Re: Updating spark-env.sh per application
Hi Mich, In spark-env.sh , SPARK_DIST_CLASSPATH is set . I want to override this variable during runtime as wanted to exclude one lib class from it. On Fri, 7 May, 2021, 6:51 pm Mich Talebzadeh, wrote: > Hi, > > Environment variables Re read in when spark-submit kicks off. What exactly > you need to refresh at the application level? > > HTH > > On Fri, 7 May 2021 at 11:34, Renu Yadav wrote: > >> Hi Team, >> >> Is it possible to override the variable of spark-env.sh on application >> level ? >> >> Thanks & Regards, >> Renu Yadav >> >> >> On Fri, May 7, 2021 at 12:16 PM Renu Yadav wrote: >> >>> Hi Team, >>> >>> Is it possible to override the variable of spark-env.sh on application >>> level ? >>> >>> Thanks & Regards, >>> Renu Yadav >>> >>> -- > > > >view my Linkedin profile > <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> > > > > *Disclaimer:* Use it at your own risk. Any and all responsibility for any > loss, damage or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > >
Re: Updating spark-env.sh per application
Hi Team, Is it possible to override the variable of spark-env.sh on application level ? Thanks & Regards, Renu Yadav On Fri, May 7, 2021 at 12:16 PM Renu Yadav wrote: > Hi Team, > > Is it possible to override the variable of spark-env.sh on application > level ? > > Thanks & Regards, > Renu Yadav > >
Spark streaming giving error for version 2.4
Hi Team, I have upgraded my spark streaming from 2.2 to 2.4 but getting below error: spark-streaming-kafka_0-10.2.11_2.4.0 scala 2.11 Any Idea? main" java.lang.AbstractMethodError at org.apache.spark.util.ListenerBus$class.$init$(ListenerBus.scala:34) at org.apache.spark.streaming.scheduler.StreamingListenerBus.(StreamingListenerBus.scala:30) at org.apache.spark.streaming.scheduler.JobScheduler.(JobScheduler.scala:57) at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:184) at org.apache.spark.streaming.StreamingContext.(StreamingContext.scala:85) Thanks & Regards, Renu Yadav
Re: How to upgrade kafka client in spark_streaming_kafka 2.2
Ok, thanks for the clarification. I will try to migrate my project to structured streaming . Regards, Renu On Fri, Mar 12, 2021 at 7:38 PM Gabor Somogyi wrote: > Mainly bugfixes and no breaking AFAIK. > > As a side note there were intentions to close DStreams and discontinue > as-is. > It's not yet happened but it's on the road so I strongly recommend to > migrate to Structured Streaming... > We simply can't support 2 streaming engines for huge amount of time. > > G > > > On Fri, Mar 12, 2021 at 3:02 PM Renu Yadav wrote: > >> Hi Gabor, >> >> It seems like it is better to upgrade my spark version . >> >> Are there major changes in terms of streaming from spark 2.2 to spark 2.4? >> >> PS: I am using KafkaUtils api to create steam >> >> Thanks & Regards, >> Renu yadav >> >> On Fri, Mar 12, 2021 at 7:25 PM Renu Yadav wrote: >> >>> Thanks Gabor, >>> This is very useful. >>> >>> Regards, >>> Renu Yadav >>> >>> On Fri, Mar 12, 2021 at 5:36 PM Gabor Somogyi >>> wrote: >>> >>>> Kafka client upgrade is not a trivial change which may or may not work >>>> since new versions can contain incompatible API and/or behavior changes. >>>> I've collected how Spark evolved in terms of Kafka client and there >>>> I've gathered the breaking changes to make our life easier. >>>> Have a look and based on that you can make your choice: >>>> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9 >>>> >>>> As a general suggestion it would be best to upgrade Spark as-is because >>>> we've added many fixes which one can face... >>>> >>>> Hope this helps! >>>> >>>> G >>>> >>>> >>>> On Fri, Mar 12, 2021 at 9:45 AM Renu Yadav wrote: >>>> >>>>> Hi Team, >>>>> I am using spark -2.2 and spark_streamin_kafka 2.2 , which is >>>>> pointing to kafka-client 0.10 . How can I upgrade a kafka client to kafka >>>>> 2.2.0 ? >>>>> >>>>> Thanks & Regards, >>>>> Renu Yadav >>>>> >>>>
Re: How to upgrade kafka client in spark_streaming_kafka 2.2
Hi Gabor, It seems like it is better to upgrade my spark version . Are there major changes in terms of streaming from spark 2.2 to spark 2.4? PS: I am using KafkaUtils api to create steam Thanks & Regards, Renu yadav On Fri, Mar 12, 2021 at 7:25 PM Renu Yadav wrote: > Thanks Gabor, > This is very useful. > > Regards, > Renu Yadav > > On Fri, Mar 12, 2021 at 5:36 PM Gabor Somogyi > wrote: > >> Kafka client upgrade is not a trivial change which may or may not work >> since new versions can contain incompatible API and/or behavior changes. >> I've collected how Spark evolved in terms of Kafka client and there I've >> gathered the breaking changes to make our life easier. >> Have a look and based on that you can make your choice: >> https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9 >> >> As a general suggestion it would be best to upgrade Spark as-is because >> we've added many fixes which one can face... >> >> Hope this helps! >> >> G >> >> >> On Fri, Mar 12, 2021 at 9:45 AM Renu Yadav wrote: >> >>> Hi Team, >>> I am using spark -2.2 and spark_streamin_kafka 2.2 , which is pointing >>> to kafka-client 0.10 . How can I upgrade a kafka client to kafka 2.2.0 ? >>> >>> Thanks & Regards, >>> Renu Yadav >>> >>
Re: How to upgrade kafka client in spark_streaming_kafka 2.2
Thanks Gabor, This is very useful. Regards, Renu Yadav On Fri, Mar 12, 2021 at 5:36 PM Gabor Somogyi wrote: > Kafka client upgrade is not a trivial change which may or may not work > since new versions can contain incompatible API and/or behavior changes. > I've collected how Spark evolved in terms of Kafka client and there I've > gathered the breaking changes to make our life easier. > Have a look and based on that you can make your choice: > https://gist.github.com/gaborgsomogyi/3476c32d69ff2087ed5d7d031653c7a9 > > As a general suggestion it would be best to upgrade Spark as-is because > we've added many fixes which one can face... > > Hope this helps! > > G > > > On Fri, Mar 12, 2021 at 9:45 AM Renu Yadav wrote: > >> Hi Team, >> I am using spark -2.2 and spark_streamin_kafka 2.2 , which is pointing >> to kafka-client 0.10 . How can I upgrade a kafka client to kafka 2.2.0 ? >> >> Thanks & Regards, >> Renu Yadav >> >
How to upgrade kafka client in spark_streaming_kafka 2.2
Hi Team, I am using spark -2.2 and spark_streamin_kafka 2.2 , which is pointing to kafka-client 0.10 . How can I upgrade a kafka client to kafka 2.2.0 ? Thanks & Regards, Renu Yadav
Hbase in spark
Has anybody implemented bulk load into hbase using spark? I need help to optimize its performance. Please help. Thanks & Regards, Renu Yadav
Re: spark task scheduling delay
Any suggestions? On Wed, Jan 20, 2016 at 6:50 PM, Renu Yadav <yren...@gmail.com> wrote: > Hi , > > I am facing spark task scheduling delay issue in spark 1.4. > > suppose I have 1600 tasks running then 1550 tasks runs fine but for the > remaining 50 i am facing task delay even if the input size of these task is > same as the above 1550 tasks > > Please suggest some solution. > > Thanks & Regards > Renu Yadav >
Schedular delay in spark 1.4
Hi, I am working on spark 1.4 . I am running a spark job on a yarn cluster .When number of other jobs are less then my spark job completes very smoothly and when more number of small job run on the cluster my spark job starts showing schedular delay at the end on each stage. PS:I am running my spark job in high priority queue. PLEASE SUGGEST SOME SOLUTION Thanks & Regards, Renu Yadav
load multiple directory using dataframe load
Hi , I am using dataframe and want to load orc file using multiple directory like this: hiveContext.read.format.load("mypath/3660,myPath/3661") but it is not working. Please suggest how to achieve this Thanks & Regards, Renu Yadav
orc read issue n spark
Hi , I am using spark 1.4.1 and saving orc file using df.write.format("orc").save("outputlocation") outputloation size 440GB and while reading df.read.format("orc").load("outputlocation").count it has 2618 partitions . the count operation runs fine uptil 2500 but starts delay scheduling after that which results in slow performance. *If anyone has any idea on this.Please do reply as I need this very urgent* Thanks in advance Regards, Renu Yadav
Data Locality Issue
Hi, I am working on spark 1.4 and reading a orc table using dataframe and converting that DF to RDD I spark UI I observe that 50 % task are running on locality and ANY and very few on LOCAL. What would be the possible reason for this? Please help. I have even changed locality settings Thanks & Regards, Renu Yadav
Re: Data Locality Issue
what are the parameters on which locality depends On Sun, Nov 15, 2015 at 5:54 PM, Renu Yadav <yren...@gmail.com> wrote: > Hi, > > I am working on spark 1.4 and reading a orc table using dataframe and > converting that DF to RDD > > I spark UI I observe that 50 % task are running on locality and ANY and > very few on LOCAL. > > What would be the possible reason for this? > > Please help. I have even changed locality settings > > > Thanks & Regards, > Renu Yadav >
Re: spark 1.4 GC issue
I have tried with G1 GC .Please if anyone can provide their setting for GC. At code level I am : 1.reading orc table usind dataframe 2.map df to rdd of my case class 3. changed that rdd to paired rdd 4.Applied combineByKey 5. saving the result to orc file Please suggest Regards, Renu Yadav On Fri, Nov 13, 2015 at 8:01 PM, Renu Yadav <yren...@gmail.com> wrote: > am using spark 1.4 and my application is taking much time in GC around > 60-70% of time for each task > > I am using parallel GC. > please help somebody as soon as possible. > > Thanks, > Renu >
spark 1.4 GC issue
am using spark 1.4 and my application is taking much time in GC around 60-70% of time for each task I am using parallel GC. please help somebody as soon as possible. Thanks, Renu
Rdd Partitions issue
I am reading parquet file from a dir which has 400 file of max 180M size so while reading my partition should be 400 as split size is 256 M in my case But it is taking 787 partiition .Why is it so? Please help. Thanks, Renu
Change Orc split size
Hi, I am reading data from hive orc table using spark-sql which is taking 256mb as split size. How can i change this size Thanks, Renu
How does driver memory utilized
Hi I have query regarding driver memory what are the tasks in which driver memory is used? Please Help
Fwd: Spark job failed
-- Forwarded message -- From: Renu Yadav <yren...@gmail.com> Date: Mon, Sep 14, 2015 at 4:51 PM Subject: Spark job failed To: d...@spark.apache.org I am getting below error while running spark job: storage.DiskBlockObjectWriter: Uncaught exception while reverting partial writes to file /data/vol5/hadoop/yarn/local/usercache/renu_yadav/appcache/application_1438196554863_31545/spark-4686a622-82be-418e-a8b0-1653458bc8cb/22/temp_shuffle_8c437ba7-55d2-4520-80ec-adcfe932b3bd java.io.FileNotFoundException: /data/vol5/hadoop/yarn/local/usercache/renu_yadav/appcache/application_1438196554863_31545/spark-4686a622-82be-418e-a8b0-1653458bc8cb/22/temp_shuffle_8c437ba7-55d2-4520-80ec-adcfe932b3bd (No such file or directory I am running 1.3TB data following are the transformation read from hadoop->map(key/value).coalease(2000).groupByKey. then sorting each record by server_ts and select most recent saving data into parquet. Following is the command spark-submit --class com.test.Myapp--master yarn-cluster --driver-memory 16g --executor-memory 20g --executor-cores 5 --num-executors 150 --files /home/renu_yadav/fmyapp/hive-site.xml --conf spark.yarn.preserve.staging.files=true --conf spark.shuffle.memoryFraction=0.6 --conf spark.storage.memoryFraction=0.1 --conf SPARK_SUBMIT_OPTS="-XX:MaxPermSize=768m" --conf SPARK_SUBMIT_OPTS="-XX:MaxPermSize=768m" --conf spark.akka.timeout=40 --conf spark.locality.wait=10 --conf spark.yarn.executor.memoryOverhead=8000 --conf SPARK_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps" --conf spark.reducer.maxMbInFlight=96 --conf spark.shuffle.file.buffer.kb=64 --conf spark.core.connection.ack.wait.timeout=120 --jars /usr/hdp/2.2.6.0-2800/hive/lib/datanucleus-api-jdo-3.2.6.jar,/usr/hdp/2.2.6.0-2800/hive/lib/datanucleus-core-3.2.10.jar,/usr/hdp/2.2.6.0-2800/hive/lib/datanucleus-rdbms-3.2.9.jar myapp_2.10-1.0.jar Cluster configuration 20 Nodes 32 cores per node 125 GB ram per node Please Help. Thanks & Regards, Renu Yadav