Re: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-09 Thread Pankaj Bhootra
Hi, Could someone please revert on this? Thanks Pankaj Bhootra On Sun, 7 Mar 2021, 01:22 Pankaj Bhootra, wrote: > Hello Team > > I am new to Spark and this question may be a possible duplicate of the > issue highlighted here: https://issues.apache.org/jira/browse/SPARK-9347 &

Fwd: [jira] [Commented] (SPARK-34648) Reading Parquet Files in Spark Extremely Slow for Large Number of Files?

2021-03-06 Thread Pankaj Bhootra
from using csv files to parquet, but from my hands-on so far, it seems that parquet's read time is slower than csv? This seems contradictory to popular opinion that parquet performs better in terms of both computation and storage? Thanks Pankaj Bhootra -- Forward

Structured Streaming to Kafka Topic

2019-03-06 Thread Pankaj Wahane
I have temporarily used a UDF that accepts all these columns as parameters and create a json string for adding a column "value" for writing to Kafka. Is there easier and cleaner way to do the same? Thanks, Pankaj

Re: [E] How to do stop streaming before the application got killed

2017-12-22 Thread Rastogi, Pankaj
def run() = { println("In shutdown hook") // stop gracefully ssCtx.stop(true, true) } }) } } Pankaj On Fri, Dec 22, 2017 at 9:56 AM, Toy wrote: > I'm trying to write a deployment job for Spark application. Basically the > job will send ya

Re: [E] Re: Spark Job is stuck at SUBMITTED when set Driver Memory > Executor Memory

2017-06-12 Thread Rastogi, Pankaj
Please make sure that you have enough memory available on the driver node. If there is not enough free memory on the driver node, then your application won't start. Pankaj From: vaquar khan mailto:vaquar.k...@gmail.com>> Date: Saturday, June 10, 2017 at 5:02 PM To: Abdul

SPARK-19547

2017-06-07 Thread Rastogi, Pankaj
(EventLoop.scala:48) I see that there is Spark ticket opened with the same issue(https://issues.apache.org/jira/browse/SPARK-19547) but it has been marked as INVALID. Can someone explain why this ticket is marked INVALID. Thanks, Pankaj

Re: how to add colum to dataframe

2016-12-06 Thread Pankaj Wahane
You may want to try using df2.na.fill(…) From: lk_spark Date: Tuesday, 6 December 2016 at 3:05 PM To: "user.spark" Subject: how to add colum to dataframe hi,all: my spark version is 2.0 I have a parquet file with one colum name url type is string,I wang get substring from the url and add

Execution error during ALS execution in spark

2016-03-31 Thread Pankaj Rawat
Large amount of time taken during execution is fine, but the process should not Fail. 4. What is exactly meant by Akka timeout error during ALS job execution ? Regards, Pankaj Rawat

Re: Spark Streaming: java.lang.NoClassDefFoundError: org/apache/kafka/common/message/KafkaLZ4BlockOutputStream

2016-03-11 Thread Pankaj Wahane
Next thing you may want to check is if the jar has been provided to all the executors in your cluster. Most of the class not found errors got resolved for me after making required jars available in the SparkContext. Thanks. From: Ted Yu mailto:yuzhih...@gmail.com>> Date: Saturday, 12 March 2016

seriazable error in apache spark job

2015-12-17 Thread Pankaj Narang
I am encountering below error. Can somebody guide ? Something similar is one this link https://github.com/elastic/elasticsearch-hadoop/issues/298 actor.MentionCrawlActor java.io.NotSerializableException: actor.MentionCrawlActor at java.io.ObjectOutputStream.writeObject0(ObjectOutputStrea

Re: Question on take function - Spark Java API

2015-08-26 Thread Pankaj Wahane
Nube Technologies <http://www.nubetech.co/> > Check out Reifier at Spark Summit 2015 > <https://spark-summit.org/2015/events/real-time-fuzzy-matching-with-spark-and-elastic-search/> > > <http://in.linkedin.com/in/sonalgoyal> > > > > On Wed, Aug 26

Question on take function - Spark Java API

2015-08-25 Thread Pankaj Wahane
> System.out.println("Valid Records: " + validated.count()); > } > Within TimeSeriesData Object I need to set the asset name for the reading, so > I need output of data.take(1) to be different for different files. > > > Thank You. > > Best

Spark Streaming Restart at scheduled intervals

2015-08-10 Thread Pankaj Narang
could not achieve the same. Have anybody have idea how to do that ? Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Restart-at-scheduled-intervals-tp24192.html Sent from the Apache Spark User List mailing list archive at

Out of memory with twitter spark streaming

2015-08-06 Thread Pankaj Narang
tweet=> DBQuery.saveTweets(tweet)) //tweetsRDD.saveAsTextFile(location+ timeMs)+ ".txt" DBQuery.addTweetRDD(counter) } }) // Checkpoint directory to recover from failures println("twee

Re: AvroFiles

2015-05-05 Thread Pankaj Deshpande
ConfigurationProperty], new > AvroSerializer[ConfigurationProperty]()) > kryo.register(classOf[Event], new AvroSerializer[Event]())) > } > > I encountered a similar error since several of the Avor core classes are > not marked Serializable. > > HTH. > > Todd > > On Tue, May 5,

AvroFiles

2015-05-05 Thread Pankaj Deshpande
Hi I am using Spark 1.3.1 to read an avro file stored on HDFS. The avro file was created using Avro 1.7.7. Similar to the example mentioned in http://www.infoobjects.com/spark-with-avro/ I am getting a nullPointerException on Schema read. It could be a avro version mismatch. Has anybody had a simil

Issue with deploye Driver in cluster mode

2015-02-26 Thread pankaj
Hi, I have 3 node spark cluster node1 , node2 and node 3 I running below command on node 1 for deploying driver /usr/local/spark-1.2.1-bin-hadoop2.4/bin/spark-submit --class com.fst.firststep.aggregator.FirstStepMessageProcessor --master spark://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:7077 --de

Loading JSON dataset with Spark Mllib

2015-02-15 Thread pankaj channe
loading data into hive tables. Thanks, Pankaj

Re: Spark Team - Paco Nathan said that your team can help

2015-01-22 Thread Pankaj
http://spark.apache.org/docs/latest/ Follow this. Its easy to get started. Use prebuilt version of spark as of now :D On Thu, Jan 22, 2015 at 5:06 PM, Sudipta Banerjee < asudipta.baner...@gmail.com> wrote: > > > Hi Apache-Spark team , > > What are the system requirements installing Hadoop and A

Re: reading a csv dynamically

2015-01-21 Thread Pankaj Narang
.map(line=>(_.split(",").length,line)) val groupedData = dataLengthRDD.groupByKey() now you can process the groupedData as it will have arrays of length x in one RDD. groupByKey([numTasks]) When called on a dataset of (K, V) pairs, returns a dataset of (K, Iterable) pairs. I hope thi

Re: Finding most occurrences in a JSON Nested Array

2015-01-21 Thread Pankaj Narang
send me the current code here. I will fix and send back to you -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21295.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --

Re: Finding most occurrences in a JSON Nested Array

2015-01-19 Thread Pankaj Narang
I just checked the post. do you need help still ? I think getAs(Seq[String]) should help. If you are still stuck let me know. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p21252.html Sent from t

Re: How to compute RDD[(String, Set[String])] that include large Set

2015-01-19 Thread Pankaj Narang
Instead of counted.saveAsText(“/path/to/save/dir") if you call counted.collect what happens ? If you still face the same issue please paste the stacktrace here. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/How-to-compute-RDD-String-Set-String-that-inclu

Re: Spark SQL implementation error

2015-01-06 Thread Pankaj Narang
As per telephonic call see how we can fetch the count val tweetsCount = sql("SELECT COUNT(*) FROM tweets") println(f"\n\n\nThere are ${tweetsCount.collect.head.getLong(0)} Tweets on this Dataset\n\n") -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spar

Re: Finding most occurrences in a JSON Nested Array

2015-01-06 Thread Pankaj Narang
Thats great. I was not having access on the developer machine so sent you the psuedo code only. Happy to see its working. If you need any more help related to spark let me know anytime. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrence

Re: Set EXTRA_JAR environment variable for spark-jobserver

2015-01-06 Thread Pankaj Narang
I suggest to create uber jar instead. check my thread for the same http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration-with-akka-http-akka-stream-td20926.html Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-06 Thread Pankaj Narang
Good luck. Let me know If I can assist you further Regards -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoSuchMethodError-com-typesafe-config-Config-getDuration

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
= popularHashTags.flatMap ( x => x.getAs[Seq[String]](0)) Even if you want I will take the remote of your machine to fix that Regards Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
If you need more help let me know -Pankaj Linkedin https://www.linkedin.com/profile/view?id=171566646 Skype pankaj.narang -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Finding-most-occurrences-in-a-JSON-Nested-Array-tp20971p20976.html Sent from the

Re: Finding most occurrences in a JSON Nested Array

2015-01-05 Thread Pankaj Narang
{swimming,2}, {hiking,1} Now hbmap .map{case(hobby,count)=>(count,hobby)}.sortByKey(ascending =false).collect will give you hobbies sorted in descending by their count This is pseudo code and must help you Regards Pankaj -- View this message in context: http://apache-spark-user-

Re: saveAsTextFile

2015-01-03 Thread Pankaj Narang
If you can paste the code here I can certainly help. Also confirm the version of spark you are using Regards Pankaj Infoshore Software India -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-tp20951p20953.html Sent from the Apache Spark

Re: NoSuchMethodError: com.typesafe.config.Config.getDuration with akka-http/akka-stream

2015-01-02 Thread Pankaj Narang
assandra-thrift" % "2.0.5" libraryDependencies += "joda-time" % "joda-time" % "2.6" and your error is Exception in thread "main" java.lang.NoSuchMethodError: com.typesafe.config.Config.getDuration(Ljava/lang/String;Ljava/util/concurrent/T

Re: Publishing streaming results to web interface

2015-01-02 Thread Pankaj Narang
ON RDD are saveAsObjectFile, saveAsFile * Now you can read these files to show them on web interface in any language of your choice Regards Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Publishing-streaming-results-to-web-interface

(send this email to subscribe)

2015-01-02 Thread Pankaj

Re: NoClassDefFoundError when trying to run spark application

2015-01-02 Thread Pankaj Narang
do you assemble the uber jar ? you can use sbt assembly to build the jar and then run. It should fix the issue -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/NoClassDefFoundError-when-trying-to-run-spark-application-tp20707p20944.html Sent from the Apache

Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
oops sqlContext.setConf("spark.sql.parquet.binaryAsString", "true") thois solved the issue important for everyone -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20936.html Sent from the Apache Spark U

Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
Also it looks like that when I store the String in parquet and try to fetch them using spark code I got classcast exception below how my array of strings are saved. each character ascii value is present in array of ints res25: Array[Seq[String]] r= Array(ArrayBuffer(Array(104, 116, 116, 112, 58

Re: Reading nested JSON data with Spark SQL

2015-01-01 Thread Pankaj Narang
lue. Pankaj -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Reading-nested-JSON-data-with-Spark-SQL-tp19310p20933.html Sent from the Apache Spark User List mailing list archive at Nabble.

Re: Time based aggregation in Real time Spark Streaming

2014-12-01 Thread pankaj
Hi , suppose i keep batch size of 3 minute. in 1 batch there can be incoming records with any time stamp. so it is difficult to keep track of when the 3 minute interval was start and end. i am doing output operation on worker nodes in forEachPartition not in drivers(forEachRdd) so i cannot use any

Time based aggregation in Real time Spark Streaming

2014-12-01 Thread pankaj
Hi, My incoming message has time stamp as one field and i have to perform aggregation over 3 minute of time slice. Message sample "Item ID" "Item Type" "timeStamp" 1 X 1-12-2014:12:01 1 X 1-12-2014:12:02 1 X

Re: Spark streaming job failing after some time.

2014-11-24 Thread pankaj channe
. Now, I can't figure out as to why it should run successfully during this time even if it could not find SparkContext. I am sure there should be good reason behind this behavior. Anyone has any idea on this? Thanks, Pankaj Channe On Saturday, November 22, 2014, pankaj channe wrote: >

Re: Spark streaming job failing after some time.

2014-11-22 Thread pankaj channe
>> Best Regards >> >> On Sat, Nov 22, 2014 at 8:39 AM, pankaj channe >> wrote: >> >>> I have seen similar posts on this issue but could not find solution. >>> Apologies if this has been discussed here before. >>> >>> I am

Spark streaming job failing after some time.

2014-11-21 Thread pankaj channe
un$1.apply(DiskBlockManager.scala:169) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1311) at org.apache.spark.storage.DiskBlockManager$$anon$1.run(DiskBlockManager.scala:169) Note: I am building my jar on my local with spark dependency added in pom.xml and running it on cluster running spark. -Pankaj

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj arora
I think i should elaborate usecase little more. So we have UI dashboard whose response time is quite fast as all the data is cached. Users query data based on time range and also there is always new data coming into the system at predefined frequency lets say 1 hour. As you said i can uncache t

Re: Use Case of mutable RDD - any ideas around will help.

2014-09-12 Thread pankaj arora
I think i should elaborate usecase little more. So we have UI dashboard whose response time is quite fast as all the data is cached. Users query data based on time range and also there is always new data coming into the system at predefined frequency lets say 1 hour. As you said i can uncache tab