Hi,
Could someone please revert on this?
Thanks
Pankaj Bhootra
On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra, wrote:
> Hello Team
>
> I am new to Spark and this question may be a possible duplicate of the
> issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347
&
from using csv files to parquet, but from my
hands-on so far, it seems that parquet's read time is slower than csv? This
seems contradictory to popular opinion that parquet performs better in
terms of both computation and storage?
Thanks
Pankaj Bhootra
-- Forward
I have
temporarily used a UDF that accepts all these columns as parameters and create
a json string for adding a column "value" for writing to Kafka.
Is there easier and cleaner way to do the same?
Thanks,
Pankaj
def run() = {
println("In shutdown hook")
// stop gracefully
ssCtx.stop(true, true)
}
})
}
}
Pankaj
On Fri, Dec 22, 2017 at 9:56 AM, Toy wrote:
> I'm trying to write a deployment job for Spark application. Basically the
> job will send ya
Please make sure that you have enough memory available on the driver node. If
there is not enough free memory on the driver node, then your application won't
start.
Pankaj
From: vaquar khan mailto:vaquar.k...@gmail.com>>
Date: Saturday, June 10, 2017 at 5:02 PM
To: Abdul
(EventLoop.scala:48)
I see that there is Spark ticket opened with the same
issue(https://issues.apache.org/jira/browse/SPARK-19547) but it has been marked
as INVALID. Can someone explain why this ticket is marked INVALID.
Thanks,
Pankaj
You may want to try using df2.na.fill(…)
From: lk_spark
Date: Tuesday, 6 December 2016 at 3:05 PM
To: "user.spark"
Subject: how to add colum to dataframe
hi,all:
my spark version is 2.0
I have a parquet file with one colum name url type is string,I wang get
substring from the url and add
Large amount of time taken during execution is fine, but
the process should not Fail.
4. What is exactly meant by Akka timeout error during ALS job execution ?
Regards,
Pankaj Rawat
Next thing you may want to check is if the jar has been provided to all the
executors in your cluster. Most of the class not found errors got resolved for
me after making required jars available in the SparkContext.
Thanks.
From: Ted Yu mailto:yuzhih...@gmail.com>>
Date: Saturday, 12 March 2016
I am encountering below error. Can somebody guide ?
Something similar is one this link
https://github.com/elastic/elasticsearch-hadoop/issues/298
actor.MentionCrawlActor
java.io.NotSerializableException: actor.MentionCrawlActor
at
java.io.ObjectOutputStream.writeObject0(ObjectOutputStrea
Nube Technologies <http://www.nubetech.co/>
> Check out Reifier at Spark Summit 2015
> <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/>
>
> <http://in.linkedin.com/in/sonalgoyal>
>
>
>
> On Wed, Aug 26
> System.out.println("Valid Records: " + validated.count());
> }
> Within TimeSeriesData Object I need to set the asset name for the reading, so
> I need output of data.take(1) to be different for different files.
>
>
> Thank You.
>
> Best
could not achieve the same.
Have anybody have idea how to do that ?
Regards
Pankaj
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Restart-at-scheduled-intervals-tp24192.html
Sent from the Apache Spark User List mailing list archive at
tweet=>
DBQuery.saveTweets(tweet))
//tweetsRDD.saveAsTextFile(location+
timeMs)+ ".txt"
DBQuery.addTweetRDD(counter)
}
})
// Checkpoint directory to recover from failures
println("twee
ConfigurationProperty], new
> AvroSerializer[ConfigurationProperty]())
> kryo.register(classOf[Event], new AvroSerializer[Event]()))
> }
>
> I encountered a similar error since several of the Avor core classes are
> not marked Serializable.
>
> HTH.
>
> Todd
>
> On Tue, May 5,
Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro
file was created using Avro 1.7.7. Similar to the example mentioned in
http://www.infoobjects.com/spark-with-avro/
I am getting a nullPointerException on Schema read. It could be a avro
version mismatch. Has anybody had a simil
Hi,
I have 3 node spark cluster
node1 , node2 and node 3
I running below command on node 1 for deploying driver
/usr/local/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class
com.fst.firststep.aggregator.FirstStepMessageProcessor --master
spark://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:7077 --de
loading data into hive tables.
Thanks,
Pankaj
http://spark.apache.org/docs/latest/
Follow this. Its easy to get started. Use prebuilt version of spark as of
now :D
On Thu, Jan 22, 2015 at 5:06 PM, Sudipta Banerjee <
asudipta.baner...@gmail.com> wrote:
>
>
> Hi Apache-Spark team ,
>
> What are the system requirements installing Hadoop and A
.map(line=>(_.split(",").length,line))
val groupedData = dataLengthRDD.groupByKey()
now you can process the groupedData as it will have arrays of length x in
one RDD.
groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a
dataset of (K, Iterable) pairs.
I hope thi
send me the current code here. I will fix and send back to you
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21295.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
--
I just checked the post. do you need help still ?
I think getAs(Seq[String]) should help.
If you are still stuck let me know.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21252.html
Sent from t
Instead of counted.saveAsText(“/path/to/save/dir") if you call
counted.collect what happens ?
If you still face the same issue please paste the stacktrace here.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-inclu
As per telephonic call see how we can fetch the count
val tweetsCount = sql("SELECT COUNT(*) FROM tweets")
println(f"\n\n\nThere are ${tweetsCount.collect.head.getLong(0)} Tweets on
this Dataset\n\n")
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spar
Thats great. I was not having access on the developer machine so sent you the
psuedo code only.
Happy to see its working. If you need any more help related to spark let me
know anytime.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrence
I suggest to create uber jar instead.
check my thread for the same
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-td20926.html
Regards
-Pankaj
Linkedin
https://www.linkedin.com/profile/view?id=171566646
Good luck. Let me know If I can assist you further
Regards
-Pankaj
Linkedin
https://www.linkedin.com/profile/view?id=171566646
Skype
pankaj.narang
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration
= popularHashTags.flatMap ( x =>
x.getAs[Seq[String]](0))
Even if you want I will take the remote of your machine to fix that
Regards
Pankaj
Linkedin
https://www.linkedin.com/profile/view?id=171566646
Skype
pankaj.narang
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.
If you need more help let me know
-Pankaj
Linkedin
https://www.linkedin.com/profile/view?id=171566646
Skype
pankaj.narang
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20976.html
Sent from the
{swimming,2}, {hiking,1}
Now hbmap .map{case(hobby,count)=>(count,hobby)}.sortByKey(ascending
=false).collect
will give you hobbies sorted in descending by their count
This is pseudo code and must help you
Regards
Pankaj
--
View this message in context:
http://apache-spark-user-
If you can paste the code here I can certainly help.
Also confirm the version of spark you are using
Regards
Pankaj
Infoshore Software
India
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html
Sent from the Apache Spark
assandra-thrift" % "2.0.5"
libraryDependencies += "joda-time" % "joda-time" % "2.6"
and your error is Exception in thread "main" java.lang.NoSuchMethodError:
com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/T
ON RDD are saveAsObjectFile, saveAsFile
*
Now you can read these files to show them on web interface in any language
of your choice
Regards
Pankaj
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface
do you assemble the uber jar ?
you can use sbt assembly to build the jar and then run. It should fix the
issue
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/NoClassDefFoundError-when-trying-to-run-spark-application-tp20707p20944.html
Sent from the Apache
oops
sqlContext.setConf("spark.sql.parquet.binaryAsString", "true")
thois solved the issue important for everyone
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20936.html
Sent from the Apache Spark U
Also it looks like that when I store the String in parquet and try to fetch
them using spark code I got classcast exception
below how my array of strings are saved. each character ascii value is
present in array of ints
res25: Array[Seq[String]] r= Array(ArrayBuffer(Array(104, 116, 116, 112, 58
lue.
Pankaj
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20933.html
Sent from the Apache Spark User List mailing list archive at Nabble.
Hi ,
suppose i keep batch size of 3 minute. in 1 batch there can be incoming
records with any time stamp.
so it is difficult to keep track of when the 3 minute interval was start and
end. i am doing output operation on worker nodes in forEachPartition not in
drivers(forEachRdd) so i cannot use any
Hi,
My incoming message has time stamp as one field and i have to perform
aggregation over 3 minute of time slice.
Message sample
"Item ID" "Item Type" "timeStamp"
1 X 1-12-2014:12:01
1 X 1-12-2014:12:02
1 X
.
Now, I can't figure out as to why it should run successfully during this
time even if it could not find SparkContext. I am sure there should be good
reason behind this behavior. Anyone has any idea on this?
Thanks,
Pankaj Channe
On Saturday, November 22, 2014, pankaj channe wrote:
>
>> Best Regards
>>
>> On Sat, Nov 22, 2014 at 8:39 AM, pankaj channe
>> wrote:
>>
>>> I have seen similar posts on this issue but could not find solution.
>>> Apologies if this has been discussed here before.
>>>
>>> I am
un$1.apply(DiskBlockManager.scala:169)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311)
at
org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:169)
Note: I am building my jar on my local with spark dependency added in
pom.xml and running it on cluster running spark.
-Pankaj
I think i should elaborate usecase little more.
So we have UI dashboard whose response time is quite fast as all the data is
cached. Users query data based on time range and also there is always new
data coming into the system at predefined frequency lets say 1 hour.
As you said i can uncache t
I think i should elaborate usecase little more.
So we have UI dashboard whose response time is quite fast as all the data is
cached. Users query data based on time range and also there is always new
data coming into the system at predefined frequency lets say 1 hour.
As you said i can uncache tab
45 matches
Mail list logo