[Help]:Strange Issue :Debug Spark Dataframe code

2016-04-15 Thread Divya Gehlot
Hi, I am using Spark 1.5.2 with Scala 2.10. Is there any other option apart from "explain(true)" to debug Spark Dataframe code . I am facing strange issue . I have a lookuo dataframe and using it join another dataframe on different columns . I am getting *Analysis exception* in third join. When

best fit - Dataframe and spark sql use cases

2016-05-09 Thread Divya Gehlot
Hi, I would like to know the uses cases where data frames is best fit and use cases where Spark SQL is best fit based on the one's experience . Thanks, Divya

[Error] : Save dataframe to csv using Spark-csv in Spark 1.6

2016-07-24 Thread Divya Gehlot
Hi, I am getting below error when I am trying to save dataframe using Spark-CSV > > final_result_df.write.format("com.databricks.spark.csv").option("header","true").save(output_path) java.lang.NoSuchMethodError: > scala.Predef$.$conforms()Lscala/Predef$$less$colon$less; > at >

FileUtil.fullyDelete does ?

2016-07-26 Thread Divya Gehlot
Hi, When I am doing the using theFileUtil.copymerge function val file = "/tmp/primaryTypes.csv" FileUtil.fullyDelete(new File(file)) val destinationFile= "/tmp/singlePrimaryTypes.csv" FileUtil.fullyDelete(new File(destinationFile)) val counts = partitions. reduceByKey {case (x,y) => x +

Re: FileUtil.fullyDelete does ?

2016-07-26 Thread Divya Gehlot
eUtil.html#fullyDelete(java.io.File) > > On Tue, Jul 26, 2016 at 12:09 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Resending to right list >> ------ Forwarded message -- >> From: "Divya Gehlot" <divya.htco...@gmail.c

Re: write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
2 AM > To: Rabin Banerjee <dev.rabin.baner...@gmail.com> > Cc: Divya Gehlot <divya.htco...@gmail.com>, "user @spark" < > user@spark.apache.org> > Subject: Re: write and call UDF in spark dataframe > > Hi Divya, > > There is already "from_unixtime&qu

getting null when calculating time diff with unix_timestamp + spark 1.6

2016-07-20 Thread Divya Gehlot
Hi, val lags=sqlContext.sql("select *,(unix_timestamp(time1,'$timeFmt') - lag(unix_timestamp(time2,'$timeFmt'))) as time_diff from df_table"); Instead of time difference in seconds I am gettng null . Would reay appreciate the help. Thanks, Divya

calculate time difference between consecutive rows

2016-07-20 Thread Divya Gehlot
I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its giving me *null *in the result set. Would really appreciate the help.

add hours to from_unixtimestamp

2016-07-21 Thread Divya Gehlot
Hi, I need to add 8 hours to from_unixtimestamp df.withColumn(from_unixtime(col("unix_timestamp"),fmt)) as "date_time" I am try to joda time function def unixToDateTime (unix_timestamp : String) : DateTime = { val utcTS = new DateTime(unix_timestamp.toLong * 1000L)+ 8.hours return utcTS }

Re: Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Divya Gehlot
hanks, Divya On 18 July 2016 at 23:06, Jacek Laskowski <ja...@japila.pl> wrote: > See broadcast variable. > > Or (just a thought) do join between DataFrames. > > Jacek > > On 18 Jul 2016 9:24 a.m., "Divya Gehlot" <divya.htco...@gmail.com> wrote: &g

write and call UDF in spark dataframe

2016-07-20 Thread Divya Gehlot
Hi, Could somebody share example of writing and calling udf which converts unix tme stamp to date tiime . Thanks, Divya

difference between two consecutive rows of same column + spark + dataframe

2016-07-20 Thread Divya Gehlot
Hi, I have a dataset of time as shown below : Time1 07:30:23 07:34:34 07:38:23 07:39:12 07:45:20 I need to find the diff between two consecutive rows I googled and found the *lag *function in *spark *helps in finding it . but its not giving me *null *in the result set. Would really appreciate

Dynamically get value based on Map key in Spark Dataframe

2016-07-18 Thread Divya Gehlot
Hi, I have created a map by reading a text file val keyValueMap = file_read.map(t => t.getString(0) -> t.getString(4)).collect().toMap Now I have another dataframe where I need to dynamically replace all the keys of Map with values val df_input = reading the file as dataframe val df_replacekeys

find two consective points

2016-07-15 Thread Divya Gehlot
Hi, I have huge data set like similar below : timestamp,fieldid,point_id 1468564189,89,1 1468564090,76,4 1468304090,89,9 1468304090,54,6 1468304090,54,4 Have configuration file of consecutive points -- 1,9 4,6 like 1 and 9 are consecutive points similarly 4,6 are consecutive points Now I need

[Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
Hi, I am working with Spark 1.6 with scala and using Dataframe API . I have a use case where I need to compare two rows and add entry in the new column based on the lookup table for example : My DF looks like : col1col2 newCol1 street1 person1 street2 person1

Re: [Spark1.6]:compare rows and add new column based on lookup

2016-08-04 Thread Divya Gehlot
ri, Aug 5, 2016 at 12:16 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >> Hi, >> I am working with Spark 1.6 with scala and using Dataframe API . >> I have a use case where I need to compare two rows and add entry in the >> new column based on the lo

[Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
Hi, I have use case where I need to use or[||] operator in filter condition. It seems its not working its taking the condition before the operator and ignoring the other filter condition after or operator. As any body faced similar issue . Psuedo code :

[Spark 1.6]-increment value column based on condition + Dataframe

2016-08-09 Thread Divya Gehlot
Hi, I have column values having values like Value 30 12 56 23 12 16 12 89 12 5 6 4 8 I need create another column if col("value") > 30 1 else col("value") < 30 newColValue 0 1 0 1 2 3 4 0 1 2 3 4 5 How can I have create an increment column The grouping is happening based on some other cols

Re: [Spark1.6] Or (||) operator not working in DataFrame

2016-08-07 Thread Divya Gehlot
; https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> http://talebzadehmich.wordpress.com >> >> *Disclaimer:* Use it at your own risk. An

Spark GraphFrames

2016-08-01 Thread Divya Gehlot
Hi, Has anybody has worked with GraphFrames. Pls let me know as I need to know the real case scenarios where It can used . Thanks, Divya

Create dataframe column from list

2016-07-22 Thread Divya Gehlot
Hi, Can somebody help me by creating the dataframe column from the scala list . Would really appreciate the help . Thanks , Divya

[Error:]while read s3 buckets in Spark 1.6 in spark -submit

2016-08-31 Thread Divya Gehlot
Hi, I am using Spark 1.6.1 in EMR machine I am trying to read s3 buckets in my Spark job . When I read it through Spark shell I am able to read it ,but when I try to package the job and and run it as spark submit I am getting below error 16/08/31 07:36:38 INFO ApplicationMaster: Registered signal

Re: Spark build 1.6.2 error

2016-08-31 Thread Divya Gehlot
Which java version are you using ? On 31 August 2016 at 04:30, Diwakar Dhanuskodi wrote: > Hi, > > While building Spark 1.6.2 , getting below error in spark-sql. Much > appreciate for any help. > > ERROR] missing or invalid dependency detected while loading class

Getting memory error when starting spark shell but not often

2016-09-06 Thread Divya Gehlot
Hi, I am using EMR 4.7 with Spark 1.6 Sometimes when I start the spark shell I get below error OpenJDK 64-Bit Server VM warning: INFO: > os::commit_memory(0x0005662c, 10632822784, 0) failed; error='Cannot > allocate memory' (errno=12) > # > # There is insufficient memory for the Java

[Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Divya Gehlot
Hi, I am getting below error if I try to use properties file paramater in spark-submit Exception in thread "main" java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated at

Re: [Spark submit] getting error when use properties file parameter in spark submit

2016-09-06 Thread Divya Gehlot
park Summit 2015 > <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Tue, Sep 6, 2016 at 4:45 PM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > >&

[Spark-Submit:]Error while reading from s3n

2016-09-06 Thread Divya Gehlot
Hi, I am on EMR 4.7 with Spark 1.6.1 I am trying to read from s3n buckets in spark Option 1 : If I set up hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem") hadoopConf.set("fs.s3.awsSecretAccessKey", sys.env("AWS_SECRET_ACCESS_KEY")) hadoopConf.set("fs.s3.awsAccessKeyId",

Re: Window Functions with SQLContext

2016-09-01 Thread Divya Gehlot
Hi Saurabh, Even I am using Spark 1.6+ version ..and when I didnt create hiveContext it threw the same error . So have to create HiveContext to access windows function Thanks, Divya On 1 September 2016 at 13:16, saurabh3d wrote: > Hi All, > > As per SPARK-11001

difference between package and jar Option in Spark

2016-09-01 Thread Divya Gehlot
Hi, Would like to know the difference between the --package and --jars option in Spark . Thanks, Divya

Re: [Error:]while read s3 buckets in Spark 1.6 in spark -submit

2016-09-02 Thread Divya Gehlot
385981/how-to-access-s3a-files-from-apache-spark Is it really the issue ? Could somebody help me validate the above ? Thanks, Divya On 1 September 2016 at 16:59, Steve Loughran <ste...@hortonworks.com> wrote: > > On 1 Sep 2016, at 03:45, Divya Gehlot <divya.htco...@gmail.com> wrote

Calling udf in Spark

2016-09-08 Thread Divya Gehlot
Hi, Is it necessary to import sqlContext.implicits._ whenever define and call UDF in Spark. Thanks, Divya

Error while calling udf Spark submit

2016-09-08 Thread Divya Gehlot
Hi, I am on Spark 1.6.1 I am getting below error when I am trying to call UDF in my spark Dataframe column UDF /* get the train line */ val deriveLineFunc :(String => String) = (str:String) => { val build_key = str.split(",").toList val getValue = if(build_key.length > 1)

Re: difference between package and jar Option in Spark

2016-09-04 Thread Divya Gehlot
n Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot <divya.htco...@gmail.com> > wrote: > > Hi, > > > > Would like to know the difference between the --package and --jars > option in > > Spark . > > > > > > > > Thanks, > > Divya >

how to specify cores and executor to run spark jobs simultaneously

2016-09-14 Thread Divya Gehlot
Hi, I am on EMR cluster and My cluster configuration is as below: Number of nodes including master node - 3 Memory:22.50 GB VCores Total : 16 Active Nodes : 2 Spark version- 1.6.1 Parameter set in spark-default.conf spark.executor.instances 2 > spark.executor.cores 8 >

[Erorr:]vieiwng Web UI on EMR cluster

2016-09-12 Thread Divya Gehlot
Hi, I am on EMR 4.7 with Spark 1.6.1 and Hadoop 2.7.2 When I am trying to view Any of the web UI of the cluster either hadoop or Spark ,I am getting below error " This site can’t be reached " Has anybody using EMR and able to view WebUI . Could you please share the steps. Would really

Ways to check Spark submit running

2016-09-13 Thread Divya Gehlot
Hi, Some how for time being I am unable to view Spark Web UI and Hadoop Web UI. Looking for other ways ,I can check my job is running fine apart from keep checking current yarn logs . Thanks, Divya

Re: [Erorr:]vieiwng Web UI on EMR cluster

2016-09-13 Thread Divya Gehlot
appreciate the help. Thanks, Divya On 13 September 2016 at 15:09, Divya Gehlot <divya.htco...@gmail.com> wrote: > Hi, > Thanks all for your prompt response. > I followed the instruction in the docs EMR SSH tunnel > <https://docs.aws.amazon.com/ElasticMapReduce/latest/Ma

Re: 1TB shuffle failed with executor lost failure

2016-09-19 Thread Divya Gehlot
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is val OOM=52 - i.e. an OutOfMemoryError Refer https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala On 19 September 2016 at 14:57,

Spark Application Log

2016-09-21 Thread Divya Gehlot
Hi, I have initialised the logging in my spark App /*Initialize Logging */ val log = Logger.getLogger(getClass.getName) Logger.getLogger("org").setLevel(Level.OFF) Logger.getLogger("akka").setLevel(Level.OFF) log.warn("Some text"+Somemap.size) When I run my spark job in using spark-submit like

Re: write.df is failing on Spark Cluster

2016-09-20 Thread Divya Gehlot
Spark version plz ? On 21 September 2016 at 09:46, Sankar Mittapally < sankar.mittapa...@creditvidya.com> wrote: > Yeah I can do all operations on that folder > > On Sep 21, 2016 12:15 AM, "Kevin Mellott" > wrote: > >> Are you able to manually delete the folder below?

Re: Getting a TreeNode Exception while saving into Hadoop

2016-08-17 Thread Divya Gehlot
Can you please check order of all the data set of union all operations. Are they in same order ? On 9 August 2016 at 02:47, max square wrote: > Hey guys, > > I'm trying to save Dataframe in CSV format after performing unionAll > operations on it. > But I get this

read multiple files

2016-09-27 Thread Divya Gehlot
Hi, The input data files for my spark job generated at every five minutes file name follows epoch time convention as below : InputFolder/batch-147495960 InputFolder/batch-147495990 InputFolder/batch-147496020 InputFolder/batch-147496050 InputFolder/batch-147496080

Re: how to extract arraytype data to file

2016-10-18 Thread Divya Gehlot
http://stackoverflow.com/questions/33864389/how-can-i-create-a-spark-dataframe-from-a-nested-array-of-struct-element Hope this helps Thanks, Divya On 19 October 2016 at 11:35, lk_spark wrote: > hi,all: > I want to read a json file and search it by sql . > the data struct

Re: tutorial for access elements of dataframe columns and column values of a specific rows?

2016-10-18 Thread Divya Gehlot
Can you please elaborate your use case ? On 18 October 2016 at 15:48, muhammet pakyürek wrote: > > > > > > -- > *From:* muhammet pakyürek > *Sent:* Monday, October 17, 2016 11:51 AM > *To:* user@spark.apache.org > *Subject:*

Re: Best practice for preprocessing feature with DataFrame

2016-11-16 Thread Divya Gehlot
Hi, You can use the Column functions provided by Spark API https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html Hope this helps . Thanks, Divya On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote: > Hi, > I have a sample, like: >

Re: converting hBaseRDD to DataFrame

2016-10-11 Thread Divya Gehlot
Hi Mich , you can create dataframe from RDD in below manner also val df = sqlContext.createDataFrame(rdd,schema) val df = sqlContext.createDataFrame(rdd) The below article also may help you : http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/ On 11 October 2016 at

Re: Is spark a right tool for updating a dataframe repeatedly

2016-10-17 Thread Divya Gehlot
If my understanding is correct about your query In spark Dataframes are immutable , cant update the dataframe. you have to create a new dataframe to update the current dataframe . Thanks, Divya On 17 October 2016 at 09:50, Mungeol Heo wrote: > Hello, everyone. > > As

Re: How many Spark streaming applications can be run at a time on a Spark cluster?

2016-12-14 Thread Divya Gehlot
It depends on the use case ... Spark always depends on the resource availability . As long as you have resource to acoomodate ,can run as many spark/spark streaming application. Thanks, Divya On 15 December 2016 at 08:42, shyla deshpande wrote: > How many Spark

Re: spark reshape hive table and save to parquet

2016-12-14 Thread Divya Gehlot
you can use udfs to do it http://stackoverflow.com/questions/31615657/how-to-add-a-new-struct-column-to-a-dataframe Hope it will help. Thanks, Divya On 9 December 2016 at 00:53, Anton Kravchenko wrote: > Hello, > > I wonder if there is a way (preferably

Re: What is the deployment model for Spark Streaming? A specific example.

2016-12-17 Thread Divya Gehlot
I am not pyspark person .. But from the errors I could figure out that your Spark application is having memory issues . Are you collecting the results to the driver at any point of time or have configured less memory for the nodes ? and If you are using Dataframes then there is issue raised in

query on Spark Log directory

2017-01-05 Thread Divya Gehlot
Hi , I am using EMR machine and I could see the Spark log directory has grown till 4G. file name : spark-history-server.out Need advise how can I reduce the the size of the above mentioned file. Is there config property which can help me . Thanks, Divya

Re: How to get recent value in spark dataframe

2016-12-20 Thread Divya Gehlot
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html Hope this helps Thanks, Divya On 15 December 2016 at 12:49, Milin korath wrote: > Hi > > I have a spark data frame with following structure > > id flag price date > a 0

Re: Location for the additional jar files in Spark

2016-12-27 Thread Divya Gehlot
Hi Mich , Have you set SPARK_CLASSPATH in Spark-env.sh ? Thanks, Divya On 27 December 2016 at 17:33, Mich Talebzadeh wrote: > When one runs in Local mode (one JVM) on an edge host (the host user > accesses the cluster), it is possible to put additional jar file

Re: Has anyone managed to connect to Oracle via JDBC from Spark CDH 5.5.2

2016-12-21 Thread Divya Gehlot
Hi Mich, Can you try placing these jars in Spark Classpath. It should work . Thanks, Divya On 22 December 2016 at 05:40, Mich Talebzadeh wrote: > This works with Spark 2 with Oracle jar file added to > > $SPARK_HOME/conf/ spark-defaults.conf > > > > >

Spark job stopping abrubptly

2017-03-07 Thread Divya Gehlot
Hi, I have spark standalone cluster on AWS EC2 and recently my spark stream jobs stopping abrubptly. When I check the logs I found this 17/03/07 06:09:39 INFO ProtocolStateActor: No response from remote. Handshake timed out or transport failure detector triggered. 17/03/07 06:09:39 ERROR

[Error] Python version mismatch in CDH cluster when running pyspark job

2017-06-16 Thread Divya Gehlot
Hi , I have a CDH cluster and running pyspark script in client mode There are different python version installed in client and worker nodes and was getting python version mismatch error. To resolve this issue I followed below cludera document

Re: Fastest way to drop useless columns

2018-05-31 Thread Divya Gehlot
you can try dropduplicate function https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala On 31 May 2018 at 16:34, wrote: > Hi there ! > > I have a potentially large dataset ( regarding number of rows and cols ) > > And I want to find the fastest way

Spark Structured Streaming for Twitter Streaming data

2018-01-30 Thread Divya Gehlot
Hi, I am exploring the spark structured streaming . When turned to internet to understand about it I could find its more integrated with Kafka or other streaming tool like Kenesis. I couldnt find where we can use Spark Streaming API for twitter streaming data . Would be grateful ,if any body used

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Divya Gehlot
DataSource APIs to build streaming > sources are not public yet, and are in flux. > > 2. Use Kafka/Kinesis as an intermediate system: Write something simple > that uses Twitter APIs directly to read tweets and write them into > Kafka/Kinesis. And then just read from Kafka/Kinesis

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Divya Gehlot
Hi , I see ,Does that means Spark structured streaming doesn't work with Twitter streams ? I could see people used kafka or other streaming tools and used spark to process the data in structured streaming . The below doesn't work directly with Twitter Stream until I set up Kafka ? > import

[Error :] RDD TO Dataframe Spark Streaming

2018-01-31 Thread Divya Gehlot
Hi, I am getting below error when creating Dataframe from twitter Streaming RDD val sparkSession:SparkSession = SparkSession .builder .appName("twittertest2") .master("local[*]") .enableHiveSupport()

Re: [Spark Java] Add new column in DataSet based on existed column

2018-03-28 Thread Divya Gehlot
Hi , Here is example snippet in scala // Convert to a Date typeval timestamp2datetype: (Column) => Column = (x) => { to_date(x) }df = df.withColumn("date", timestamp2datetype(col("end_date"))) Hope this helps ! Thanks, Divya On 28 March 2018 at 15:16, Junfeng Chen

Re: Triggering sql on Was S3 via Apache Spark

2018-10-23 Thread Divya Gehlot
Hi Omer , Here are couple of the solutions which you can implement for your use case : *Option 1 : * you can mount the S3 bucket as local file system Here are the details : https://cloud.netapp.com/blog/amazon-s3-as-a-file-system *Option 2 :* You can use Amazon Glue for your use case here are the

<    1   2