Fixed byte array issue

2023-11-01 Thread KhajaAsmath Mohammed
Hi, I am facing an issue with fixed byte array issue when reading spark dataframe. spark.sql.parquet.enableVectorizedReader = false is solving my issue but it is causing significant performance issue. any resolution for this? Thanks, Asmath

Maximum executors in EC2 Machine

2023-10-23 Thread KhajaAsmath Mohammed
Hi, I am running a spark job in spark EC2 machine whiich has 40 cores. Driver and executor memory is 16 GB. I am using local[*] but I still get only one executor(driver). Is there a way to get more executors with this config. I am not using yarn or mesos in this case. Only one machine which is

Re: Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread KhajaAsmath Mohammed
fsets. I assume you were using > file-based instead of Kafka-based of data sources. Are the incoming data > generated in mini-batch files or in a single large file? Have you had this > type of problem before? > >> On 7/21/22 1:02 PM, KhajaAsmath Mohammed wrote: >> Hi,

Spark Structured Streaming -- Cannot consume next messages

2022-07-21 Thread KhajaAsmath Mohammed
Hi, I am seeing weird behavior in our spark structured streaming application where the offerts are not getting picked by the streaming job. If I delete the checkpoint directory and run the job again, I can see the data for the first batch but it is not picking up new offsets again from the next

Spark streaming / confluent Kafka- messages are empty

2022-06-09 Thread KhajaAsmath Mohammed
Hi, I am trying to read data from confluent Kafka using avro schema registry. Messages are always empty and stream always shows empty records. Any suggestion on this please ?? Thanks, Asmath - To unsubscribe e-mail:

REGEX Spark - Dataframe

2021-06-26 Thread KhajaAsmath Mohammed
Hi, What is the equivalent function using dataframe in spark. I was able to make it work for spark sql but looking to use dataframes instead. df11=self.spark.sql("""SELECT transaction_card_bin,(CASE WHEN transaction_card_bin REGEXP '^5[1-5][\d]*' THEN "MC" WHEN transaction_card_bin REGEXP

S3 Access Issues - Spark

2021-05-18 Thread KhajaAsmath Mohammed
Hi, I have written a sample spark job that reads the data residing in hbase. I keep getting below error , any suggestions to resolve this please? Caused by: java.lang.IllegalArgumentException: AWS Access Key ID and Secret Access Key must be specified by setting the fs.s3.awsAccessKeyId and

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-15 Thread KhajaAsmath Mohammed
Thanks everyone. I was able to resolve this. Here is what I did. Just passed conf file using —files option. Mistake that I did was reading the json conf file before creating spark session . Reading if after creating spark session helped it. Thanks once again for your valuable suggestions

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-14 Thread KhajaAsmath Mohammed
7g --executor-memory 7g /appl/common/ftp/ftp_event_data.py /appl/common/ftp/conf.json 2021-05-10 7 On Fri, May 14, 2021 at 6:19 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Sorry my bad, it did not resolve the issue. I still have the same issue. > can anyone please

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-14 Thread KhajaAsmath Mohammed
Sorry my bad, it did not resolve the issue. I still have the same issue. can anyone please guide me. I was still running as a client instead of a cluster. On Fri, May 14, 2021 at 5:05 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > You are right. It worked but I st

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-14 Thread KhajaAsmath Mohammed
You are right. It worked but I still don't understand why I need to pass that to all executors. On Fri, May 14, 2021 at 5:03 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > I am using json only to read properties before calling spark session. I > don't know why we

Re: [EXTERNAL] Urgent Help - Py Spark submit error

2021-05-14 Thread KhajaAsmath Mohammed
cal FS) > /appl/common/ftp/conf.json > > > > > > *From: *KhajaAsmath Mohammed > *Date: *Friday, May 14, 2021 at 4:50 PM > *To: *"user @spark" > *Subject: *[EXTERNAL] Urgent Help - Py Spark submit error > > > > /appl/common/ftp/conf.json >

Urgent Help - Py Spark submit error

2021-05-14 Thread KhajaAsmath Mohammed
Hi, I am having a weird situation where the below command works when the deploy mode is a client and fails if it is a cluster. spark-submit --master yarn --deploy-mode client --files /etc/hive/conf/hive-site.xml,/etc/hadoop/conf/core-site.xml,/etc/hadoop/conf/hdfs-site.xml --driver-memory 70g

spark hbase

2021-04-20 Thread KhajaAsmath Mohammed
Hi, I have tried multiple ways to use hbase-spark and none of them works as expected. SHC and hbase-spark library are loading all the data on executors and it is running for ever. https://ramottamado.dev/how-to-use-hbase-fuzzyrowfilter-in-spark/ Above link has the solution that I am looking for

Spark submit hbase issue

2021-04-14 Thread KhajaAsmath Mohammed
Hi, Spark submit is connecting to local host instead of zookeeper mentioned in hbase-site.xml. This same program works in ide which gets connected to hbase-site.xml. What am I missing in spark submit? > >  > spark-submit --driver-class-path >

Overirde Jackson jar - Spark Submit

2021-04-14 Thread KhajaAsmath Mohammed
Hi, I am having similar issue as mentioned in the below link but was not able to resolve. any other solutions please? https://stackoverflow.com/questions/57329060/exclusion-of-dependency-of-spark-core-in-cdh Thanks, Asmath

Spark2.4 json Jackson errors

2021-04-13 Thread KhajaAsmath Mohammed
Hi, I am having issue when running custom Applications on spark2.4. I was able to Run successfully on windows ide but cannot run this in emr spark2.4. I get jsonmethods not found error. I have included json4s in Uber jar still I get this error. Any solution to resolve this? Thanks, Asmath

Re: Spark Session error with 30s

2021-04-13 Thread KhajaAsmath Mohammed
I was able to Resolve this by changing the Hdfs-site.xml as I mentioned in my initial thread Thanks, Asmath > On Apr 12, 2021, at 8:35 PM, Peng Lei wrote: > >  > Hi KhajaAsmath Mohammed > Please check the configuration of "spark.speculation.interval&qu

Re: Spark Session error with 30s

2021-04-12 Thread KhajaAsmath Mohammed
scala:2520) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession.scala:936) at org.apache.spark.sql.SparkSession$Builder$$anonfun$7.apply(SparkSession Thanks, Asmath On Mon, Apr 12, 2021 at 2:20 PM KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > I a

Re: Spark Session error with 30s

2021-04-12 Thread KhajaAsmath Mohammed
give it "30s". > Is it a time property somewhere that really just wants MS or something? But > most time properties (all?) in Spark should accept that type of input anyway. > Really depends on what property has a problem and what is setting it. > >> On Mon, Apr

Spark Session error with 30s

2021-04-12 Thread KhajaAsmath Mohammed
HI, I am getting weird error when running spark job in emr cluster. Same program runs in my local machine. Is there anything that I need to do to resolve this? 21/04/12 18:48:45 ERROR SparkContext: Error initializing SparkContext. java.lang.NumberFormatException: For input string: "30s" I tried

Spark Hbase Hive error in EMR

2021-04-09 Thread KhajaAsmath Mohammed
Hi, I am trying to connect hbase which sits on top of hive as external table. I am getting below exception. Am I missing anything to pass here? 21/04/09 18:08:11 INFO ZooKeeper: Client environment:user.dir=/ 21/04/09 18:08:11 INFO ZooKeeper: Initiating client connection,

Re: Rdd - zip with index

2021-03-25 Thread KhajaAsmath Mohammed
or any monetary damages arising from such > loss, damage or destruction. > > > >> On Wed, 24 Mar 2021 at 01:19, KhajaAsmath Mohammed >> wrote: >> Hi, >> >> I have 10gb file that should be loaded into spark dataframe. This file is >> csv with header

Re: Rdd - zip with index

2021-03-24 Thread KhajaAsmath Mohammed
Thanks Mich. I understood what I am supposed to do now, will try these options. I still dont understand how the spark will split the large file. I have a 10 GB file which I want to split automatically after reading. I can split and load the file before reading but it is a very big requirement

Re: Rdd - zip with index

2021-03-23 Thread KhajaAsmath Mohammed
format(“avro”).save(...) > > >> On 24 Mar 2021, at 03:19, KhajaAsmath Mohammed >> wrote: >> >> Hi, >> >> I have 10gb file that should be loaded into spark dataframe. This file is >> csv with header and we were using rdd.zipwithindex to get colu

Rdd - zip with index

2021-03-23 Thread KhajaAsmath Mohammed
Hi, I have 10gb file that should be loaded into spark dataframe. This file is csv with header and we were using rdd.zipwithindex to get column names and convert to avro accordingly. I am assuming this is taking long time and only executor runs and never achieves parallelism. Is there a easy

Re: Repartition or Coalesce not working

2021-03-22 Thread KhajaAsmath Mohammed
Thanks Sean.I just realized it. Let me try that. On Mon, Mar 22, 2021 at 12:31 PM Sean Owen wrote: > You need to do something with the result of repartition. You haven't > changed textDF > > On Mon, Mar 22, 2021, 12:15 PM KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wr

Repartition or Coalesce not working

2021-03-22 Thread KhajaAsmath Mohammed
Hi, I have a use case where there are large files in hdfs. Size of the file is 3 GB. It is an existing code in production and I am trying to improve the performance of the job. Sample Code: textDF=dataframe ( This is dataframe that got created from hdfs path) logging.info("Number of

Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-21 Thread KhajaAsmath Mohammed
; transaction bache size. > > > KhajaAsmath Mohammed 于2020年10月21日周三 下午12:19写道: >> Yes. Changing back to latest worked but I still see the slowness compared to >> flume. >> >> Sent from my iPhone >> >>>> On Oct 20, 2020, at 10:21 PM, lec ssmi wrote: &

Re: Spark Structured streaming - Kakfa - slowness with query 0

2020-10-20 Thread KhajaAsmath Mohammed
2:19写道: >> Are you getting any output? Streaming jobs typically run forever, and keep >> processing data as it comes in the input. If a streaming job is working >> well, it will typically generate output at a certain cadence >> >> >> >> From: KhajaAs

Spark Structured streaming - Kakfa - slowness with query 0

2020-10-20 Thread KhajaAsmath Mohammed
Hi, I have started using spark structured streaming for reading data from kaka and the job is very slow. Number of output rows keeps increasing in query 0 and the job is running forever. any suggestions for this please? [image: image.png] Thanks, Asmath

Spark JDBC- OAUTH example

2020-09-30 Thread KhajaAsmath Mohammed
Hi, I am looking for some information on how to read database which has oauth authentication with spark -jdbc. any links that point to this approach would be really helpful Thanks, Asmath

Fwd: Time stamp in Kafka

2020-08-15 Thread KhajaAsmath Mohammed
Hi, > > We have a producer application that has written data to Kafka topic. > > We are reading the data from Kafka topic using spark streaming but the time > stamp on Kafka is 1969-12-31 format for all the data. > > Is there a way to fix this while reading ? > > Thanks, > Asmath >

Spark Structured streaming 2.4 - Kill and deploy in yarn

2020-08-10 Thread KhajaAsmath Mohammed
Hi , I am looking for some information on how to gracefully kill the spark structured streaming kafka job and redeploy it. How to kill a spark structured job in YARN? any suggestions on how to kill gracefully? I was able to monitor the job from SQL tab but how can I kill this job when deployed

Application Upgrade - structured streaming

2020-07-10 Thread KhajaAsmath Mohammed
Hi, When doing application upgrade for spark structured streaming, do we need to delete the checkpoint or does it start consuming offsets from the point we left? kafka source we need to use the option "StartingOffsets" with a json string like """ {"topicA":{"0":23,"1":-1},"topicB":{"0":-2}}

Re: Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-06 Thread KhajaAsmath Mohammed
iew, especially the gap between highest > offset and committed offset. > > Hope this helps. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > > On Mon, Jul 6, 2020 at 2:53 AM Gabor Somogyi > wrote: > >> In 3.0 the community just added it. >> >> On Sun, 5 Jul 2020

Spark structured streaming -Kafka - deployment / monitor and restart

2020-07-05 Thread KhajaAsmath Mohammed
Hi, We are trying to move our existing code from spark dstreams to structured streaming for one of the old application which we built few years ago. Structured streaming job doesn’t have streaming tab in sparkui. Is there a way to monitor the job submitted by us in structured streaming ? Since

Loop through Dataframes

2019-10-06 Thread KhajaAsmath Mohammed
Hi, What is the best approach to loop through 3 dataframes in scala based on some keys instead of using collect. Thanks, Asmath

Re: Structred Streaming Error

2019-05-22 Thread KhajaAsmath Mohammed
artingOffsets contains specific offsets, you must specify all > TopicPartitions. > > BR, > G > > > On Tue, May 21, 2019 at 9:16 PM KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Hi, >> >> I am getting below errror when running sampl

Structred Streaming Error

2019-05-21 Thread KhajaAsmath Mohammed
Hi, I am getting below errror when running sample strreaming app. does anyone have resolution for this? JSON OFFSET {"test":{"0":0,"1":0,"2":0,"3":0,"4":0,"5":0}} - Herreee root |-- key: string (nullable = true) |-- value: string (nullable = true) |-- topic: string (nullable = true) |--

Spark SQL JDBC teradata syntax error

2019-05-03 Thread KhajaAsmath Mohammed
Hi I have followed link https://community.teradata.com/t5/Connectivity/Teradata-JDBC-Driver-returns-the-wrong-schema-column-nullability/m-p/77824 to connect teradata from spark. I was able to print schema if I give table name instead of sql query. I am getting below error if I give query(code

Spark SQL Teradata load is very slow

2019-05-02 Thread KhajaAsmath Mohammed
Hi, I have teradata table who has more than 2.5 billion records and data size is around 600 GB. I am not able to pull efficiently using spark SQL and it is been running for more than 11 hours. here is my code. val df2 = sparkSession.read.format("jdbc") .option("url",

Java Heap Space error - Spark ML

2019-03-22 Thread KhajaAsmath Mohammed
Hi, I am getting the below exception when using Spark Kmeans. Any solutions from the experts. Would be really helpful. val kMeans = new KMeans().setK(reductionCount).setMaxIter(30) val kMeansModel = kMeans.fit(df) Error is occured when calling kmeans.fit Exception in thread "main"

Streaming Tab in Kafka Structured Streaming

2019-02-18 Thread KhajaAsmath Mohammed
Hi, I am new to the structured streaming but used dstreams a lot. Difference I saw in the spark UI is the streaming tab for dstreams. Is there a way to know how many records and batches were executed in structred streaming and also any option on how to see streaming tab? Thanks, Asmath

Transaction Examplefor spark streaming in Spark2.2

2018-03-22 Thread KhajaAsmath Mohammed
Hi Cody, I am following to implement the exactly once semantics and also utilize storing the offsets in database. Question I have is how to use hive instead of traditional datastores. write to hive will be successful even though there is any issue with saving offsets into DB. Could you please

Dynamic allocation Spark Stremaing

2018-03-06 Thread KhajaAsmath Mohammed
Hi, I have enabled dynamic allocation for spark streaming application but the number of containers always shows as 2. Is there a way to get more when job is running? Thanks, Asmath

Joins in spark for large tables

2018-02-28 Thread KhajaAsmath Mohammed
Hi, Is there any best approach to reduce shuffling in spark. I have two tables and both of them are large. any suggestions? I saw only about broadcast but that will not work in my case. Thanks, Asmath

Re: Efficient way to compare the current row with previous row contents

2018-02-12 Thread KhajaAsmath Mohammed
I am also looking for the same answer. Will this work in streaming application too ?? Sent from my iPhone > On Feb 12, 2018, at 8:12 AM, Debabrata Ghosh wrote: > > Georg - Thanks ! Will you be able to help me with a few examples please. > > Thanks in advance again ! >

Structure streaming to hive with kafka 0.9

2018-02-01 Thread KhajaAsmath Mohammed
Hi, Could anyone please share example of on how to use spark structured streaming with kafka and write data into hive. Versions that I do have are Spark 2.1 on CDH5.10 Kafka 0.9 Thanks, Asmath Sent from my iPhone - To

Spark Streaming checkpoint

2018-01-29 Thread KhajaAsmath Mohammed
Hi, I have written spark streaming job to use the checkpoint. I have stopped the streaming job for 5 days and then restart it today. I have encountered weird issue where it shows as zero records for all cycles till date. is it causing data loss? [image: Inline image 1] Thanks, Asmath

Production Critical : Data loss in spark streaming

2018-01-22 Thread KhajaAsmath Mohammed
Hi, I have been using the spark streaming with kafka. I have to restart the application daily due to kms issue and after restart the offsets are not matching with the point I left. I am creating checkpoint directory with val streamingContext = StreamingContext.getOrCreate(checkPointDir, () =>

Gracefully shutdown spark streaming application

2018-01-21 Thread KhajaAsmath Mohammed
Hi, Could anyone please provide your thoughts on how to kill spark streaming application gracefully. I followed link of http://why-not-learn-something.blogspot.in/2016/05/apache-spark-streaming-how-to-do.html https://github.com/lanjiang/streamingstopgraceful I played around with having either

Re: Spark Stream is corrupted

2018-01-18 Thread KhajaAsmath Mohammed
Any solutions for this problem please . Sent from my iPhone > On Jan 17, 2018, at 10:39 PM, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > Hi, > > I have created a streaming object from checkpoint but it always through up > error as stream cor

Spark Stream is corrupted

2018-01-17 Thread KhajaAsmath Mohammed
Hi, I have created a streaming object from checkpoint but it always through up error as stream corrupted when I restart spark streaming job. any solution for this? private def createStreamingContext( sparkCheckpointDir: String, sparkSession: SparkSession, batchDuration: Int, config:

Re: Spark Streaming not reading missed data

2018-01-16 Thread KhajaAsmath Mohammed
...@gmail.com> wrote: > It could be a missing persist before the checkpoint > > > On 16. Jan 2018, at 22:04, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > > > Hi, > > > > Spark streaming job from kafka is not picking the messages and is alw

Spark Streaming not reading missed data

2018-01-16 Thread KhajaAsmath Mohammed
Hi, Spark streaming job from kafka is not picking the messages and is always taking the latest offsets when streaming job is stopped for 2 hours. It is not picking up the offsets that are required to be processed from checkpoint directory. any suggestions on how to process the old messages too

Null pointer exception in checkpoint directory

2018-01-16 Thread KhajaAsmath Mohammed
Hi, I keep getting null pointer exception in the spark streaming job with checkpointing. any suggestions to resolve this. Exception in thread "pool-22-thread-9" java.lang.NullPointerException at org.apache.spark.streaming.CheckpointWriter$CheckpointWriteHandler.run(Checkpoint.scala:233) at

Re: What does Blockchain technology mean for Big Data? And how Hadoop/Spark will play role with it?

2017-12-18 Thread KhajaAsmath Mohammed
I am looking for same answer too .. will wait for response from other people Sent from my iPhone > On Dec 18, 2017, at 10:56 PM, Gaurav1809 wrote: > > Hi All, > > Will Bigdata tools & technology work with Blockchain in future? Any possible > use cases that anyone is

Spark ListenerBus

2017-12-06 Thread KhajaAsmath Mohammed
Hi, I am running spark sql job and it completes without any issues. I am getting errors as ERROR: SparkListenerBus has already stopped! Dropping event SparkListenerExecutorMetricsUpdate after completion of job. could anyone share your suggestions on how to avoid it. Thanks, Asmath

Re: JDK1.8 for spark workers

2017-11-29 Thread KhajaAsmath Mohammed
This didnt work. I tried it but no luck. On Wed, Nov 29, 2017 at 7:49 PM, Vadim Semenov <vadim.seme...@datadoghq.com> wrote: > You can pass `JAVA_HOME` environment variable > > `spark.executorEnv.JAVA_HOME=/usr/lib/jvm/java-1.8.0` > > On Wed, Nov 29, 2017 at 10:54 AM,

JDK1.8 for spark workers

2017-11-29 Thread KhajaAsmath Mohammed
Hi, I am running cloudera version of spark2.1 and our cluster is on JDK1.7. For some of the libraries, I need JDK1.8, is there a way to set to run Spark worker in JDK1.8 without upgrading . I was able run driver in JDK 1.8 by setting the path but not the workers. 17/11/28 20:22:27 WARN

Re: Spark Streaming Kerberos Issue

2017-11-22 Thread KhajaAsmath Mohammed
[image: Inline image 1] This is what we are on. On Wed, Nov 22, 2017 at 12:33 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > We use oracle JDK. we are on unix. > > On Wed, Nov 22, 2017 at 12:31 PM, Georg Heiler <georg.kf.hei...@gmail.com> > wrote: > &

Re: Spark Streaming Kerberos Issue

2017-11-22 Thread KhajaAsmath Mohammed
; on centos 7.3 > > Are these installed? > > KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> schrieb am Mi. 22. Nov. > 2017 um 19:29: > >> I passed keytab, renewal is enabled by running the script every eight >> hours. User gets renewed by the script every eight hours. &

Re: Spark Streaming Kerberos Issue

2017-11-22 Thread KhajaAsmath Mohammed
I passed keytab, renewal is enabled by running the script every eight hours. User gets renewed by the script every eight hours. On Wed, Nov 22, 2017 at 12:27 PM, Georg Heiler <georg.kf.hei...@gmail.com> wrote: > Did you pass a keytab? Is renewal enabled in your kdc? > KhajaAsm

Spark Streaming Kerberos Issue

2017-11-22 Thread KhajaAsmath Mohammed
Hi, I have written spark stream job and job is running successfully for more than 36 hours. After around 36 hours job gets failed with kerberos issue. Any solution on how to resolve it. org.apache.spark.SparkException: Task failed while wri\ ting rows. at

Spark Stremaing Hive Dynamic Partitions Issue

2017-11-22 Thread KhajaAsmath Mohammed
Hi, I am able to wirte data into hive tables from spark stremaing. Job ran successfully for 37 hours and I started getting errors in task failure as below. Hive table has data too untill tasks are failed. Job aborted due to stage failure: Task 0 in stage 691.0 failed 4 times, most recent

Spark Streaming in Wait mode

2017-11-17 Thread KhajaAsmath Mohammed
Hi, I am running spark streaming job and it is not picking up the next batches but the job is still shows as running on YARN. is this expected behavior if there is no data or waiting for data to pick up? I am almost behind 4 hours of batches (30 min interval) [image: Inline image 1] [image:

Struct Type

2017-11-17 Thread KhajaAsmath Mohammed
Hi, I have following schema in dataframe and I want to extract key which matches as MaxSpeed from the array and it's corresponding value of the key. |-- tags: array (nullable = true) ||-- element: struct (containsNull = true) |||-- key: string (nullable = true) |||--

Re: Spark Streaming Job completed without executing next batches

2017-11-16 Thread KhajaAsmath Mohammed
Here is screenshot . Status shows finished but it should be running for next batch to pick up the data. [image: Inline image 1] On Thu, Nov 16, 2017 at 10:01 PM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Hi, > > I have scheduled spark streaming job to run e

Spark Streaming Job completed without executing next batches

2017-11-16 Thread KhajaAsmath Mohammed
Hi, I have scheduled spark streaming job to run every 30 minutes and it was running fine till 32 hours and suddenly I see status of Finsished instead of running (Since it always run in background and shows up in resource manager) Am i doing anything wrong here? how come job was finished without

Restart Spark Streaming after deployment

2017-11-15 Thread KhajaAsmath Mohammed
Hi, I am new in the usage of spark streaming. I have developed one spark streaming job which runs every 30 minutes with checkpointing directory. I have to implement minor change, shall I kill the spark streaming job once the batch is completed using yarn application -kill command and update the

Spark Streaming in Spark 2.1 with Kafka 0.9

2017-11-09 Thread KhajaAsmath Mohammed
Hi, I am not successful when using using spark 2.1 with Kafka 0.9, can anyone please share the code snippet to use it. val sparkSession: SparkSession = runMode match { case "local" => SparkSession.builder.config(sparkConfig).getOrCreate case "yarn" =>

Spark Streaming Small files in Hive

2017-10-29 Thread KhajaAsmath Mohammed
Hi, I am using spark streaming to write data back into hive with the below code snippet eventHubsWindowedStream.map(x => EventContent(new String(x))) .foreachRDD(rdd => { val sparkSession = SparkSession .builder.enableHiveSupport.getOrCreate import

Re: Structured Stream in Spark

2017-10-27 Thread KhajaAsmath Mohammed
+ "/ETL").partitionBy("creationTime").start()* to > *format("console").start().* > > On Fri, Oct 27, 2017 at 8:41 AM, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Hi TathagataDas, >> >> I was trying to us

Re: Structured Stream in Spark

2017-10-27 Thread KhajaAsmath Mohammed
/spark/sql/examples/EventHubsStructuredStreamingExample.scala is the code snippet I have used to connect to eventhub Thanks, Asmath On Thu, Oct 26, 2017 at 9:39 AM, KhajaAsmath Mohammed < mdkhajaasm...@gmail.com> wrote: > Thanks TD. > > On Wed, Oct 25, 2017 at 6:42 PM

Re: Structured streaming with event hubs

2017-10-27 Thread KhajaAsmath Mohammed
com> wrote: > Does event hub support seuctured streaming at all yet? > > On Fri, 27 Oct 2017 at 1:43 pm, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Hi, >> >> Could anyone share if there is any code snippet on how to use spark >> s

Structured streaming with event hubs

2017-10-26 Thread KhajaAsmath Mohammed
Hi, Could anyone share if there is any code snippet on how to use spark structured streaming with event hubs ?? Thanks, Asmath Sent from my iPhone - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Structured Stream in Spark

2017-10-26 Thread KhajaAsmath Mohammed
a look at my talk - https://spark-summit.org/ > 2017/speakers/tathagata-das/ > > On Wed, Oct 25, 2017 at 9:29 PM, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Thanks Subhash. >> >> Have you ever used zero data loss concept with streaming. I am

Re: Structured Stream in Spark

2017-10-25 Thread KhajaAsmath Mohammed
; On Wed, Oct 25, 2017 at 4:08 PM, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Hi Sriram, >> >> Thanks. This is what I was looking for. >> >> one question, where do we need to specify the checkpoint directory in >> case of structured

Re: Structured Stream in Spark

2017-10-25 Thread KhajaAsmath Mohammed
n > use. The following might help: > > https://databricks.com/blog/2017/02/23/working-complex- > data-formats-structured-streaming-apache-spark-2-1.html > > I hope this helps. > > Thanks, > Subhash > > On Wed, Oct 25, 2017 at 2:59 PM, KhajaAsmath Mohammed <

Structured Stream in Spark

2017-10-25 Thread KhajaAsmath Mohammed
Hi, Could anyone provide suggestions on how to parse json data from kafka and load it back in hive. I have read about structured streaming but didn't find any examples. is there any best practise on how to read it and parse it with structured streaming for this use case? Thanks, Asmath

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
ssion.sql(deltaDSQry) Here is the code and also properties used in my project. On Tue, Oct 17, 2017 at 3:38 PM, Sebastian Piu <sebastian@gmail.com> wrote: > Can you share some code? > > On Tue, 17 Oct 2017, 21:11 KhajaAsmath Mohammed, <mdkhajaasm...@gmail.com>

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
he action that is causing the > shuffle as that one will take the value you've set > > On Tue, Oct 17, 2017 at 8:40 PM KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> Yes still I see more number of part files and exactly the number I have >> defined d

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
>> On Oct 17, 2017 1:47 PM, "KhajaAsmath Mohammed" <mdkhajaasm...@gmail.com> >> wrote: >> I tried repartitions but spark.sql.shuffle.partitions is taking up >> precedence over repartitions or coalesce. how to get the lesser number of >> files

Re: Spark - Partitions

2017-10-17 Thread KhajaAsmath Mohammed
w.persistentsys.com > <http://www.persistentsys.com/>* > > > -- > *From:* KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > *Sent:* 13 October 2017 09:35 > *To:* user @spark > *Subject:* Spark - Partitions > > Hi, > > I am reading hive query

Spark - Partitions

2017-10-12 Thread KhajaAsmath Mohammed
Hi, I am reading hive query and wiriting the data back into hive after doing some transformations. I have changed setting spark.sql.shuffle.partitions to 2000 and since then job completes fast but the main problem is I am getting 2000 files for each partition size of file is 10 MB . is there a

cannot cast to double from spark row

2017-09-14 Thread KhajaAsmath Mohammed
Hi, I am getting below error when trying to cast column value from spark dataframe to double. any issues. I tried many solutions but none of them worked. java.lang.ClassCastException: java.lang.String cannot be cast to java.lang.Double 1. row.getAs[Double](Constants.Datapoint.Latitude) 2.

java heap space

2017-09-03 Thread KhajaAsmath Mohammed
Hi, I am getting java.lang.OutOfMemoryError: Java heap space error whenever I ran the spark sql job. I came to conclusion issue is because of reading number of files from spark. I am reading 37 partitions and each partition has around 2000 files with filesize more than 128 MB 37*2000 files

add arraylist to dataframe

2017-08-29 Thread KhajaAsmath Mohammed
Hi, I am initiating arraylist before iterating throuugh the map method. I am always getting the list size value as zero after map operation. How do I add values to list inside the map method of dataframe ? any suggestions? val points = new

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
n Sun, Aug 20, 2017 at 1:35 PM, ayan guha <guha.a...@gmail.com> wrote: > Just curious - is your dataset partitioned on your partition columns? > > On Mon, 21 Aug 2017 at 3:54 am, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> We are in cloude

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
Hive 2 with Tez+llap - these are much more performant with much more > features. > > Alternatively you can try to register the df as temptable and do a insert > into the Hive table from the temptable using Spark sql ("insert into table > hivetable select * from temptable&qu

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
On 20. Aug 2017, at 18:07, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > we are using parquet tables, is it causing any performance issue? > > On Sun, Aug 20, 2017 at 9:09 AM, Jörn Franke <jornfra...@gmail.com> wrote: > >> Improving the performance of H

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
is the default format that it > writes to Hive. One issue for the slow storing into a hive table could be > that it writes by default to csv/gzip or csv/bzip2 > > > On 20. Aug 2017, at 15:52, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > wrote: > > > > Yes we

Re: Spark hive overwrite is very very slow

2017-08-20 Thread KhajaAsmath Mohammed
the performance is? > > In which Format do you expect Hive to write? Have you made sure it is in this > format? It could be that you use an inefficient format (e.g. CSV + bzip2). > >> On 20. Aug 2017, at 03:18, KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> >> wrot

Spark hive overwrite is very very slow

2017-08-19 Thread KhajaAsmath Mohammed
Hi, I have written spark sql job on spark2.0 by using scala . It is just pulling the data from hive table and add extra columns , remove duplicates and then write it back to hive again. In spark ui, it is taking almost 40 minutes to write 400 go of data. Is there anything that I need to

Re: GC overhead exceeded

2017-08-18 Thread KhajaAsmath Mohammed
you are using YARN? > > -Pat > > From: KhajaAsmath Mohammed <mdkhajaasm...@gmail.com> > Date: Friday, August 18, 2017 at 5:30 AM > To: Pralabh Kumar <pralabhku...@gmail.com> > Cc: "user @spark" <user@spark.apache.org> > Subject: Re: GC overhead ex

Persist performace in Spark

2017-08-18 Thread KhajaAsmath Mohammed
Hi, I am using persit before inserting dataframe data back into hive. This step is adding 8 minutes to my total execution time. is there a way to reduce the total time without resulting in out of memory issues. Here is my code. val datapoint_df: Dataset[Row] =

Re: GC overhead exceeded

2017-08-18 Thread KhajaAsmath Mohammed
> On Aug 17, 2017, at 11:43 PM, Pralabh Kumar <pralabhku...@gmail.com> wrote: > > what's is your exector memory , please share the code also > >> On Fri, Aug 18, 2017 at 10:06 AM, KhajaAsmath Mohammed >> <mdkhajaasm...@gmail.com> wrote: >> >> HI, >&g

GC overhead exceeded

2017-08-17 Thread KhajaAsmath Mohammed
HI, I am getting below error when running spark sql jobs. This error is thrown after running 80% of tasks. any solution? spark.storage.memoryFraction=0.4 spark.sql.shuffle.partitions=2000 spark.default.parallelism=100 #spark.eventLog.enabled=false #spark.scheduler.revive.interval=1s

Re: Write only one output file in Spark SQL

2017-08-11 Thread KhajaAsmath Mohammed
te() call, and the results in HDFS. What are all the > files you see? > > On Fri, Aug 11, 2017 at 1:10 PM, KhajaAsmath Mohammed < > mdkhajaasm...@gmail.com> wrote: > >> tempTable = union_df.registerTempTable("tempRaw") >> >> create = hc.sql('CREATE TABLE IF NOT E

  1   2   >