Cached data not showing up in Storage tab

2018-10-16 Thread Venkat Dabri
When I cache a variable the data never shows up in the storage tab. The storage tab is always blank. I have tried it in Zeppelin as well as spark-shell. scala> val classCount = spark.read.parquet("s3:// /classCount") scala> classCount.persist scala> classCount.count Nothing shows up in the St

Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-16 Thread Venkat Dabri
The same problem is mentioned here : https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri wrote: > > I did tr

Re: Spark seems to think that a particular broadcast variable is large in size

2018-10-16 Thread Venkat Dabri
t; > On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri wrote: >> >> I am trying to do a broadcast join on two tables. The size of the >> smaller table will vary based upon the parameters but the size of the >> larger table is close to 2TB. What I have noticed is that if I don&

Spark seems to think that a particular broadcast variable is large in size

2018-10-15 Thread Venkat Dabri
I am trying to do a broadcast join on two tables. The size of the smaller table will vary based upon the parameters but the size of the larger table is close to 2TB. What I have noticed is that if I don't set the spark.sql.autoBroadcastJoinThreshold to 10G some of these operations do a SortMergeJoi

Re: Best practices on how to multiple spark sessions

2018-09-17 Thread Venkat Ramakrishnan
Umesh, I found the following write-up dealing with architecture and memory considerations elaborately. There are updates on memory, but it would be a good start for you: https://0x0fff.com/spark-architecture/ Any additional source(s) of info. are welcome from others too. - Venkat. On Sun, Sep

Re: java.lang.UnsupportedOperationException: No Encoder found for Set[String]

2018-08-16 Thread Venkat Dabri
We are using spark 2.2.0. Is it possible to bring the ExpressionEncoder from 2.3.0 and related classes into my code base and use them? I see the changes in ExpressionEncoder between 2.3.0 and 2.2.0 is not much but there might be many other classes underneath that might have changed. On Thu, Aug 16

Re: [Spark streaming] No assigned partition error during seek

2017-11-30 Thread venkat
this imply that we should not be adding kafka clients in our jars?. Thanks Venkat On Fri, 1 Dec 2017 at 06:45 venkat wrote: > Yes I use latest Kafka clients 0.11 to determine beginning offsets without > seek and also I use Kafka offsets commits externally. > I dont find the spark async

Re: [Spark streaming] No assigned partition error during seek

2017-11-30 Thread venkat
Yes I use latest Kafka clients 0.11 to determine beginning offsets without seek and also I use Kafka offsets commits externally. I dont find the spark async commit useful for our needs. Thanks Venkat On Fri, 1 Dec 2017 at 02:39 Cody Koeninger wrote: > You mentioned 0.11 version; the lat

Re: Spark - Eclipse IDE - Maven

2015-07-29 Thread Venkat Reddy
Thanks Petar, I will purchase. Thanks for the input. Thanks Siva On Tue, Jul 28, 2015 at 4:39 PM, Carol McDonald wrote: > I agree, I found this book very useful for getting started with spark and > eclipse > > On Tue, Jul 28, 2015 at 11:10 AM, Petar Zecevic > wrote: > >> >> Sorry about

SparkSQL - Caching RDDs

2015-04-01 Thread Venkat, Ankam
hanks! Regards, Venkat Ankam This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this communication is strictly prohibited and may be unlawful. If you have received this communication in error, please immediately notify the s

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-23 Thread Venkat, Ankam
Spark Committers: Please advise the way forward for this issue. Thanks for your support. Regards, Venkat From: Venkat, Ankam Sent: Thursday, January 22, 2015 9:34 AM To: 'Frank Austin Nothaft'; 'user@spark.apache.org' Cc: 'Nick Allen' Subject: RE: How to 'Pip

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Venkat, Ankam
How much time it takes to port it? Spark committers: Please let us know your thoughts. Regards, Venkat From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu] Sent: Thursday, January 22, 2015 9:08 AM To: Venkat, Ankam Cc: Nick Allen; user@spark.apache.org Subject: Re: How to 'Pipe&#x

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-22 Thread Venkat, Ankam
: What's your take on this? Regards, Venkat Ankam From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu] Sent: Wednesday, January 21, 2015 12:30 PM To: Venkat, Ankam Cc: Nick Allen; user@spark.apache.org Subject: Re: How to 'Pipe' Binary Data in Apache Spark Hi Venkat/Nick, The

RE: How to 'Pipe' Binary Data in Apache Spark

2015-01-21 Thread Venkat, Ankam
7;/usr/local/bin/sox', '-t' >>> 'wav', '-', '-n', 'stats'])).collect() <-- Does not work. Tried different >>> options. AttributeError: 'function' object has no attribute 'read' Any suggestions? Regards, V

Processing .wav files in PySpark

2015-01-16 Thread Venkat, Ankam
wavfile = sc.textFile('hdfs://xxx:8020/user/ab00855/ext2187854_03_27_2014.wav') wavfile.pipe(subprocess.call(['sox', '-t' 'wav', '-', '-n', 'stats'])) I tried different options like sc.binaryFiles and sc.pickleFile. Any thoughts

RE: MLlib vs Madlib

2014-12-14 Thread Venkat, Ankam
orm large scale text analytics and I can data store on HDFS or on Pivotal Greenplum/Hawq. Regards, Venkat Ankam From: Brian Dolan [mailto:buddha_...@yahoo.com] Sent: Sunday, December 14, 2014 10:02 AM To: Venkat, Ankam Cc: 'user@spark.apache.org' Subject: Re: MLlib vs Madlib MADLib (http:

MLlib vs Madlib

2014-12-14 Thread Venkat, Ankam
Can somebody throw light on MLlib vs Madlib? Which is better for machine learning? and are there any specific use case scenarios MLlib or Madlib will shine in? Regards, Venkat Ankam This communication is the property of CenturyLink and may contain confidential or privileged information

Re: dockerized spark executor on mesos?

2014-12-09 Thread Venkat Subramanian
We have dockerized Spark Master and worker(s) separately and are using it in our dev environment. We don't use Mesos though, running it in Standalone mode, but adding Mesos should not be that difficult I think. Regards Venkat -- View this message in context: http://apache-spark-user

Re: Spark SQL table Join, one task is taking long

2014-12-04 Thread Venkat Subramanian
utomatically. * Table joins for huge table(s) are costly. Fact and Dimension concepts from star schema don't translate well to Big Data (Hadoop, Spark). It may be better to de-normalize and store huge tables to avoid Joins. Joins seem to be evil. (Have tried de-normalizing when using Cassandra

Re: Spark SQL table Join, one task is taking long

2014-12-02 Thread Venkat Subramanian
Bump up. Michael Armbrust, anybody from Spark SQL team? -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-table-Join-one-task-is-taking-long-tp20124p20218.html Sent from the Apache Spark User List mailing list archive at Nabble.com. ---

Scala Dependency Injection

2014-12-02 Thread Venkat Subramanian
This is a more of a Scala question than Spark question. Which Dependency Injection framework do you guys use for Scala when using Spark? Is http://scaldi.org/ recommended? Regards Venkat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Scala-Dependency

Spark SQL table Join, one task is taking long

2014-12-01 Thread Venkat Subramanian
t me know if you have any suggestions. Regards, Venkat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-table-Join-one-task-is-taking-long-tp20124.html Sent from the Apache Spark User List mailing list archi

RE: Spark Streaming with Python

2014-11-25 Thread Venkat, Ankam
Any idea how to resolve this? Regards, Venkat From: Venkat, Ankam Sent: Sunday, November 23, 2014 12:05 PM To: 'user@spark.apache.org' Subject: Spark Streaming with Python I am trying to run network_wordcount.py example mentioned at https://github.com/apache/spark/blob/master/example

Python Logistic Regression error

2014-11-23 Thread Venkat, Ankam
ot;/usr/lib/spark/examples/lib/mllib/logistic_regression.py", line 37, in parsePoint values = [float(s) for s in line.split(' ')] ValueError: invalid literal for float(): 1:0.4551273600657362 Regards, Venkat This communication is the property of CenturyLink and may contain confide

Spark Streaming with Python

2014-11-23 Thread Venkat, Ankam
s/lib/network_wordcount.py", line 4, in from pyspark.streaming import StreamingContext ImportError: No module named streaming. How to resolve this? Regards, Venkat This communication is the property of CenturyLink and may contain confidential or privileged information. Unauthorized use of this commun

Re: Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-10-06 Thread Bharat Venkat
TD has addressed this. It should be available in 1.2.0. https://issues.apache.org/jira/browse/SPARK-3495 On Thu, Oct 2, 2014 at 9:45 AM, maddenpj wrote: > I am seeing this same issue. Bumping for visibility. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.

Re: Out of memory with Spark Streaming

2014-09-11 Thread Bharat Venkat
You could set "spark.executor.memory" to something bigger than the default (512mb) On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar < aniket.bhatna...@gmail.com> wrote: > I am running a simple Spark Streaming program that pulls in data from > Kinesis at a batch interval of 10 seconds, windows i

Re: Low Level Kafka Consumer for Spark

2014-08-27 Thread Bharat Venkat
Hi Dibyendu, That would be great. One of the biggest drawback of Kafka utils as well as your implementation is I am unable to scale out processing. I am relatively new to Spark and Spark Streaming - from what I read and what I observe with my deployment is that having the RDD created on one rece

Kafka Spark Streaming job has an issue when the worker reading from Kafka is killed

2014-08-21 Thread Bharat Venkat
Hi, To test the resiliency of Kafka Spark streaming, I killed the worker reading from Kafka Topic and noticed that the driver is unable to replace the worker and the job becomes a rogue job that keeps running doing nothing from that point on. Is this a known issue? Are there any workarounds? He

UpdateStateByKey - How to improve performance?

2014-08-06 Thread Venkat Subramanian
The method def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) => Option[S] ): DStream[(K, S)] takes Dstream (K,V) and Produces DStream (K,S) in Spark Streaming We have a input Dstream(K,V) that has 40,000 elements. We update on average of 1000 elements of them in every 3 secon

Re: streaming window not behaving as advertised (v1.0.1)

2014-08-01 Thread Venkat Subramanian
spending weeks trying to figure out what is happening here and trying out different things). This issue has been around from 0.9 till date (1.01) at least. Thanks, Venkat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/streaming-window-not-behaving-as-advertised

Re: Spark SQL JDBC Connectivity

2014-07-30 Thread Venkat Subramanian
For the time being, we decided to take a different route. We created a Rest API layer in our app and allowed SQL query passing via the Rest. Internally we pass that query to the SparkSQL layer on the RDD and return back the results. With this Spark SQL is supported for our RDDs via this rest API no

Partioner to process data in the same order for each key

2014-07-30 Thread Venkat Subramanian
not the right one for this use case? or Do we write our own partitioner ?- if we need to write a new partitioner, can someone give a psedocode for this use case to help us. Regards, Venkat -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Partioner-to-pro

Re: Spark SQL JDBC Connectivity and more

2014-06-09 Thread Venkat Subramanian
ot possible out of the box with Shark. If you look at the code for SharkServer2 though, you'll see that its just a standard HiveContext under the covers. If you modify this startup code, any SchemaRDD you register as a table in this context will be exposed over JDBC. [Venkat] Are you

Re: Spark SQL JDBC Connectivity and more

2014-05-29 Thread Venkat Subramanian
Thanks Michael. OK will try SharkServer2.. But I have some basic questions on a related area: 1) If I have a standalone spark application that has already built a RDD, how can SharkServer2 or for that matter Shark access 'that' RDD and do queries on it. All the examples I have seen for Shark, the

Spark SQL JDBC Connectivity

2014-05-28 Thread Venkat Subramanian
SQL work on DStreams (since the underlying structure is RDD anyway) and can we expose the streaming DStream RDD through JDBC via Spark SQL for Realtime analytics. Any pointers on this will greatly help. Regards, Venkat -- View this message in context: http://apache-spark-user-list.1001560.n3

unsubscribe

2014-05-18 Thread Venkat Krishnamurthy

Re: Spark on other parallel filesystems

2014-04-05 Thread Venkat Krishnamurthy
Christopher Just to clarify - by ‘load ops’ do you mean RDD actions that result in IO? Venkat From: Christopher Nguyen mailto:c...@adatao.com>> Reply-To: "user@spark.apache.org<mailto:user@spark.apache.org>" mailto:user@spark.apache.org>> Date: Saturday, April

Spark on other parallel filesystems

2014-04-04 Thread Venkat Krishnamurthy
HDFS substrate) for a start, or will that not work at all? I do know that it’s possible to implement Tachyon on Lustre and get the HDFS interface – just looking at other options. Venkat