Hi,
I am using Spark 1.5.2 with Scala 2.10.
Is there any other option apart from "explain(true)" to debug Spark
Dataframe code .
I am facing strange issue .
I have a lookuo dataframe and using it join another dataframe on different
columns .
I am getting *Analysis exception* in third join.
When
Hi,
I would like to know the uses cases where data frames is best fit and use
cases where Spark SQL is best fit based on the one's experience .
Thanks,
Divya
Hi,
I am getting below error when I am trying to save dataframe using Spark-CSV
>
> final_result_df.write.format("com.databricks.spark.csv").option("header","true").save(output_path)
java.lang.NoSuchMethodError:
> scala.Predef$.$conforms()Lscala/Predef$$less$colon$less;
> at
>
Hi,
When I am doing the using theFileUtil.copymerge function
val file = "/tmp/primaryTypes.csv"
FileUtil.fullyDelete(new File(file))
val destinationFile= "/tmp/singlePrimaryTypes.csv"
FileUtil.fullyDelete(new File(destinationFile))
val counts = partitions.
reduceByKey {case (x,y) => x +
eUtil.html#fullyDelete(java.io.File)
>
> On Tue, Jul 26, 2016 at 12:09 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Resending to right list
>> ------ Forwarded message --
>> From: "Divya Gehlot" <divya.htco...@gmail.c
2 AM
> To: Rabin Banerjee <dev.rabin.baner...@gmail.com>
> Cc: Divya Gehlot <divya.htco...@gmail.com>, "user @spark" <
> user@spark.apache.org>
> Subject: Re: write and call UDF in spark dataframe
>
> Hi Divya,
>
> There is already "from_unixtime&qu
Hi,
val lags=sqlContext.sql("select *,(unix_timestamp(time1,'$timeFmt') -
lag(unix_timestamp(time2,'$timeFmt'))) as time_diff from df_table");
Instead of time difference in seconds I am gettng null .
Would reay appreciate the help.
Thanks,
Divya
I have a dataset of time as shown below :
Time1
07:30:23
07:34:34
07:38:23
07:39:12
07:45:20
I need to find the diff between two consecutive rows
I googled and found the *lag *function in *spark *helps in finding it .
but its giving me *null *in the result set.
Would really appreciate the help.
Hi,
I need to add 8 hours to from_unixtimestamp
df.withColumn(from_unixtime(col("unix_timestamp"),fmt)) as "date_time"
I am try to joda time function
def unixToDateTime (unix_timestamp : String) : DateTime = {
val utcTS = new DateTime(unix_timestamp.toLong * 1000L)+ 8.hours
return utcTS
}
hanks,
Divya
On 18 July 2016 at 23:06, Jacek Laskowski <ja...@japila.pl> wrote:
> See broadcast variable.
>
> Or (just a thought) do join between DataFrames.
>
> Jacek
>
> On 18 Jul 2016 9:24 a.m., "Divya Gehlot" <divya.htco...@gmail.com> wrote:
&g
Hi,
Could somebody share example of writing and calling udf which converts unix
tme stamp to date tiime .
Thanks,
Divya
Hi,
I have a dataset of time as shown below :
Time1
07:30:23
07:34:34
07:38:23
07:39:12
07:45:20
I need to find the diff between two consecutive rows
I googled and found the *lag *function in *spark *helps in finding it .
but its not giving me *null *in the result set.
Would really appreciate
Hi,
I have created a map by reading a text file
val keyValueMap = file_read.map(t => t.getString(0) ->
t.getString(4)).collect().toMap
Now I have another dataframe where I need to dynamically replace all the
keys of Map with values
val df_input = reading the file as dataframe
val df_replacekeys
Hi,
I have huge data set like similar below :
timestamp,fieldid,point_id
1468564189,89,1
1468564090,76,4
1468304090,89,9
1468304090,54,6
1468304090,54,4
Have configuration file of consecutive points --
1,9
4,6
like 1 and 9 are consecutive points similarly 4,6 are consecutive points
Now I need
Hi,
I am working with Spark 1.6 with scala and using Dataframe API .
I have a use case where I need to compare two rows and add entry in the
new column based on the lookup table
for example :
My DF looks like :
col1col2 newCol1
street1 person1
street2 person1
ri, Aug 5, 2016 at 12:16 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>> Hi,
>> I am working with Spark 1.6 with scala and using Dataframe API .
>> I have a use case where I need to compare two rows and add entry in the
>> new column based on the lo
Hi,
I have use case where I need to use or[||] operator in filter condition.
It seems its not working its taking the condition before the operator and
ignoring the other filter condition after or operator.
As any body faced similar issue .
Psuedo code :
Hi,
I have column values having values like
Value
30
12
56
23
12
16
12
89
12
5
6
4
8
I need create another column
if col("value") > 30 1 else col("value") < 30
newColValue
0
1
0
1
2
3
4
0
1
2
3
4
5
How can I have create an increment column
The grouping is happening based on some other cols
; https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>> *Disclaimer:* Use it at your own risk. An
Hi,
Has anybody has worked with GraphFrames.
Pls let me know as I need to know the real case scenarios where It can used
.
Thanks,
Divya
Hi,
Can somebody help me by creating the dataframe column from the scala list .
Would really appreciate the help .
Thanks ,
Divya
Hi,
I am using Spark 1.6.1 in EMR machine
I am trying to read s3 buckets in my Spark job .
When I read it through Spark shell I am able to read it ,but when I try to
package the job and and run it as spark submit I am getting below error
16/08/31 07:36:38 INFO ApplicationMaster: Registered signal
Which java version are you using ?
On 31 August 2016 at 04:30, Diwakar Dhanuskodi wrote:
> Hi,
>
> While building Spark 1.6.2 , getting below error in spark-sql. Much
> appreciate for any help.
>
> ERROR] missing or invalid dependency detected while loading class
Hi,
I am using EMR 4.7 with Spark 1.6
Sometimes when I start the spark shell I get below error
OpenJDK 64-Bit Server VM warning: INFO:
> os::commit_memory(0x0005662c, 10632822784, 0) failed; error='Cannot
> allocate memory' (errno=12)
> #
> # There is insufficient memory for the Java
Hi,
I am getting below error if I try to use properties file paramater in
spark-submit
Exception in thread "main" java.util.ServiceConfigurationError:
org.apache.hadoop.fs.FileSystem: Provider
org.apache.hadoop.fs.s3a.S3AFileSystem could not be instantiated
at
park Summit 2015
> <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Tue, Sep 6, 2016 at 4:45 PM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
>
>&
Hi,
I am on EMR 4.7 with Spark 1.6.1
I am trying to read from s3n buckets in spark
Option 1 :
If I set up
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3.S3FileSystem")
hadoopConf.set("fs.s3.awsSecretAccessKey", sys.env("AWS_SECRET_ACCESS_KEY"))
hadoopConf.set("fs.s3.awsAccessKeyId",
Hi Saurabh,
Even I am using Spark 1.6+ version ..and when I didnt create hiveContext it
threw the same error .
So have to create HiveContext to access windows function
Thanks,
Divya
On 1 September 2016 at 13:16, saurabh3d wrote:
> Hi All,
>
> As per SPARK-11001
Hi,
Would like to know the difference between the --package and --jars option
in Spark .
Thanks,
Divya
385981/how-to-access-s3a-files-from-apache-spark
Is it really the issue ?
Could somebody help me validate the above ?
Thanks,
Divya
On 1 September 2016 at 16:59, Steve Loughran <ste...@hortonworks.com> wrote:
>
> On 1 Sep 2016, at 03:45, Divya Gehlot <divya.htco...@gmail.com> wrote
Hi,
Is it necessary to import sqlContext.implicits._ whenever define and
call UDF in Spark.
Thanks,
Divya
Hi,
I am on Spark 1.6.1
I am getting below error when I am trying to call UDF in my spark Dataframe
column
UDF
/* get the train line */
val deriveLineFunc :(String => String) = (str:String) => {
val build_key = str.split(",").toList
val getValue = if(build_key.length > 1)
n Thu, Sep 1, 2016 at 10:24 AM, Divya Gehlot <divya.htco...@gmail.com>
> wrote:
> > Hi,
> >
> > Would like to know the difference between the --package and --jars
> option in
> > Spark .
> >
> >
> >
> > Thanks,
> > Divya
>
Hi,
I am on EMR cluster and My cluster configuration is as below:
Number of nodes including master node - 3
Memory:22.50 GB
VCores Total : 16
Active Nodes : 2
Spark version- 1.6.1
Parameter set in spark-default.conf
spark.executor.instances 2
> spark.executor.cores 8
>
Hi,
I am on EMR 4.7 with Spark 1.6.1 and Hadoop 2.7.2
When I am trying to view Any of the web UI of the cluster either hadoop or
Spark ,I am getting below error
"
This site can’t be reached
"
Has anybody using EMR and able to view WebUI .
Could you please share the steps.
Would really
Hi,
Some how for time being I am unable to view Spark Web UI and Hadoop Web UI.
Looking for other ways ,I can check my job is running fine apart from keep
checking current yarn logs .
Thanks,
Divya
appreciate the help.
Thanks,
Divya
On 13 September 2016 at 15:09, Divya Gehlot <divya.htco...@gmail.com> wrote:
> Hi,
> Thanks all for your prompt response.
> I followed the instruction in the docs EMR SSH tunnel
> <https://docs.aws.amazon.com/ElasticMapReduce/latest/Ma
The exit code 52 comes from org.apache.spark.util.SparkExitCode, and it is
val OOM=52 - i.e. an OutOfMemoryError
Refer
https://github.com/apache/spark/blob/d6dc12ef0146ae409834c78737c116050961f350/core/src/main/scala/org/apache/spark/util/SparkExitCode.scala
On 19 September 2016 at 14:57,
Hi,
I have initialised the logging in my spark App
/*Initialize Logging */
val log = Logger.getLogger(getClass.getName)
Logger.getLogger("org").setLevel(Level.OFF)
Logger.getLogger("akka").setLevel(Level.OFF)
log.warn("Some text"+Somemap.size)
When I run my spark job in using spark-submit like
Spark version plz ?
On 21 September 2016 at 09:46, Sankar Mittapally <
sankar.mittapa...@creditvidya.com> wrote:
> Yeah I can do all operations on that folder
>
> On Sep 21, 2016 12:15 AM, "Kevin Mellott"
> wrote:
>
>> Are you able to manually delete the folder below?
Can you please check order of all the data set of union all operations.
Are they in same order ?
On 9 August 2016 at 02:47, max square wrote:
> Hey guys,
>
> I'm trying to save Dataframe in CSV format after performing unionAll
> operations on it.
> But I get this
Hi,
The input data files for my spark job generated at every five minutes file
name follows epoch time convention as below :
InputFolder/batch-147495960
InputFolder/batch-147495990
InputFolder/batch-147496020
InputFolder/batch-147496050
InputFolder/batch-147496080
http://stackoverflow.com/questions/33864389/how-can-i-create-a-spark-dataframe-from-a-nested-array-of-struct-element
Hope this helps
Thanks,
Divya
On 19 October 2016 at 11:35, lk_spark wrote:
> hi,all:
> I want to read a json file and search it by sql .
> the data struct
Can you please elaborate your use case ?
On 18 October 2016 at 15:48, muhammet pakyürek wrote:
>
>
>
>
>
> --
> *From:* muhammet pakyürek
> *Sent:* Monday, October 17, 2016 11:51 AM
> *To:* user@spark.apache.org
> *Subject:*
Hi,
You can use the Column functions provided by Spark API
https://spark.apache.org/docs/1.6.2/api/java/org/apache/spark/sql/functions.html
Hope this helps .
Thanks,
Divya
On 17 November 2016 at 12:08, 颜发才(Yan Facai) wrote:
> Hi,
> I have a sample, like:
>
Hi Mich ,
you can create dataframe from RDD in below manner also
val df = sqlContext.createDataFrame(rdd,schema)
val df = sqlContext.createDataFrame(rdd)
The below article also may help you :
http://hortonworks.com/blog/spark-hbase-dataframe-based-hbase-connector/
On 11 October 2016 at
If my understanding is correct about your query
In spark Dataframes are immutable , cant update the dataframe.
you have to create a new dataframe to update the current dataframe .
Thanks,
Divya
On 17 October 2016 at 09:50, Mungeol Heo wrote:
> Hello, everyone.
>
> As
It depends on the use case ...
Spark always depends on the resource availability .
As long as you have resource to acoomodate ,can run as many spark/spark
streaming application.
Thanks,
Divya
On 15 December 2016 at 08:42, shyla deshpande
wrote:
> How many Spark
you can use udfs to do it
http://stackoverflow.com/questions/31615657/how-to-add-a-new-struct-column-to-a-dataframe
Hope it will help.
Thanks,
Divya
On 9 December 2016 at 00:53, Anton Kravchenko
wrote:
> Hello,
>
> I wonder if there is a way (preferably
I am not pyspark person ..
But from the errors I could figure out that your Spark application is
having memory issues .
Are you collecting the results to the driver at any point of time or have
configured less memory for the nodes ?
and If you are using Dataframes then there is issue raised in
Hi ,
I am using EMR machine and I could see the Spark log directory has grown
till 4G.
file name : spark-history-server.out
Need advise how can I reduce the the size of the above mentioned file.
Is there config property which can help me .
Thanks,
Divya
https://jaceklaskowski.gitbooks.io/mastering-apache-spark/content/spark-sql-windows.html
Hope this helps
Thanks,
Divya
On 15 December 2016 at 12:49, Milin korath
wrote:
> Hi
>
> I have a spark data frame with following structure
>
> id flag price date
> a 0
Hi Mich ,
Have you set SPARK_CLASSPATH in Spark-env.sh ?
Thanks,
Divya
On 27 December 2016 at 17:33, Mich Talebzadeh
wrote:
> When one runs in Local mode (one JVM) on an edge host (the host user
> accesses the cluster), it is possible to put additional jar file
Hi Mich,
Can you try placing these jars in Spark Classpath.
It should work .
Thanks,
Divya
On 22 December 2016 at 05:40, Mich Talebzadeh
wrote:
> This works with Spark 2 with Oracle jar file added to
>
> $SPARK_HOME/conf/ spark-defaults.conf
>
>
>
>
>
Hi,
I have spark standalone cluster on AWS EC2 and recently my spark stream
jobs stopping
abrubptly.
When I check the logs I found this
17/03/07 06:09:39 INFO ProtocolStateActor: No response from remote.
Handshake timed out or transport failure detector triggered.
17/03/07 06:09:39 ERROR
Hi ,
I have a CDH cluster and running pyspark script in client mode
There are different python version installed in client and worker nodes and
was getting python version mismatch error.
To resolve this issue I followed below cludera document
you can try dropduplicate function
https://github.com/spirom/LearningSpark/blob/master/src/main/scala/dataframe/DropDuplicates.scala
On 31 May 2018 at 16:34, wrote:
> Hi there !
>
> I have a potentially large dataset ( regarding number of rows and cols )
>
> And I want to find the fastest way
Hi,
I am exploring the spark structured streaming .
When turned to internet to understand about it I could find its more
integrated with Kafka or other streaming tool like Kenesis.
I couldnt find where we can use Spark Streaming API for twitter streaming
data .
Would be grateful ,if any body used
DataSource APIs to build streaming
> sources are not public yet, and are in flux.
>
> 2. Use Kafka/Kinesis as an intermediate system: Write something simple
> that uses Twitter APIs directly to read tweets and write them into
> Kafka/Kinesis. And then just read from Kafka/Kinesis
Hi ,
I see ,Does that means Spark structured streaming doesn't work with Twitter
streams ?
I could see people used kafka or other streaming tools and used spark to
process the data in structured streaming .
The below doesn't work directly with Twitter Stream until I set up Kafka ?
> import
Hi,
I am getting below error when creating Dataframe from twitter Streaming RDD
val sparkSession:SparkSession = SparkSession
.builder
.appName("twittertest2")
.master("local[*]")
.enableHiveSupport()
Hi ,
Here is example snippet in scala
// Convert to a Date typeval timestamp2datetype: (Column) => Column =
(x) => { to_date(x) }df = df.withColumn("date",
timestamp2datetype(col("end_date")))
Hope this helps !
Thanks,
Divya
On 28 March 2018 at 15:16, Junfeng Chen
Hi Omer ,
Here are couple of the solutions which you can implement for your use case
:
*Option 1 : *
you can mount the S3 bucket as local file system
Here are the details :
https://cloud.netapp.com/blog/amazon-s3-as-a-file-system
*Option 2 :*
You can use Amazon Glue for your use case
here are the
101 - 163 of 163 matches
Mail list logo