When I cache a variable the data never shows up in the
storage tab. The storage tab is always blank. I have tried it in
Zeppelin as well as spark-shell.
scala> val classCount = spark.read.parquet("s3:// /classCount")
scala> classCount.persist
scala> classCount.count
Nothing shows up in the St
The same problem is mentioned here :
https://forums.databricks.com/questions/117/why-is-my-rdd-not-showing-up-in-the-storage-tab-of.html
https://stackoverflow.com/questions/44792213/blank-storage-tab-in-spark-history-server
On Tue, Oct 16, 2018 at 8:06 AM Venkat Dabri wrote:
>
> I did tr
t;
> On Mon, Oct 15, 2018 at 11:53 AM Venkat Dabri wrote:
>>
>> I am trying to do a broadcast join on two tables. The size of the
>> smaller table will vary based upon the parameters but the size of the
>> larger table is close to 2TB. What I have noticed is that if I don&
I am trying to do a broadcast join on two tables. The size of the
smaller table will vary based upon the parameters but the size of the
larger table is close to 2TB. What I have noticed is that if I don't
set the spark.sql.autoBroadcastJoinThreshold to 10G some of these
operations do a SortMergeJoi
Umesh,
I found the following write-up dealing with architecture and memory
considerations elaborately. There are updates on memory, but it would
be a good start for you:
https://0x0fff.com/spark-architecture/
Any additional source(s) of info. are welcome from others too.
- Venkat.
On Sun, Sep
We are using spark 2.2.0. Is it possible to bring the
ExpressionEncoder from 2.3.0 and related classes into my code base and
use them? I see the changes in ExpressionEncoder between 2.3.0 and
2.2.0 is not much but there might be many other classes underneath
that might have changed.
On Thu, Aug 16
this imply that we should not be adding kafka clients in our jars?.
Thanks
Venkat
On Fri, 1 Dec 2017 at 06:45 venkat wrote:
> Yes I use latest Kafka clients 0.11 to determine beginning offsets without
> seek and also I use Kafka offsets commits externally.
> I dont find the spark async
Yes I use latest Kafka clients 0.11 to determine beginning offsets without
seek and also I use Kafka offsets commits externally.
I dont find the spark async commit useful for our needs.
Thanks
Venkat
On Fri, 1 Dec 2017 at 02:39 Cody Koeninger wrote:
> You mentioned 0.11 version; the lat
Thanks Petar,
I will purchase. Thanks for the input.
Thanks
Siva
On Tue, Jul 28, 2015 at 4:39 PM, Carol McDonald
wrote:
> I agree, I found this book very useful for getting started with spark and
> eclipse
>
> On Tue, Jul 28, 2015 at 11:10 AM, Petar Zecevic
> wrote:
>
>>
>> Sorry about
hanks!
Regards,
Venkat Ankam
This communication is the property of CenturyLink and may contain confidential
or privileged information. Unauthorized use of this communication is strictly
prohibited and may be unlawful. If you have received this communication in
error, please immediately notify the s
Spark Committers: Please advise the way forward for this issue.
Thanks for your support.
Regards,
Venkat
From: Venkat, Ankam
Sent: Thursday, January 22, 2015 9:34 AM
To: 'Frank Austin Nothaft'; 'user@spark.apache.org'
Cc: 'Nick Allen'
Subject: RE: How to 'Pip
How much time it takes to port it?
Spark committers: Please let us know your thoughts.
Regards,
Venkat
From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu]
Sent: Thursday, January 22, 2015 9:08 AM
To: Venkat, Ankam
Cc: Nick Allen; user@spark.apache.org
Subject: Re: How to 'Pipe
: What's your take on this?
Regards,
Venkat Ankam
From: Frank Austin Nothaft [mailto:fnoth...@berkeley.edu]
Sent: Wednesday, January 21, 2015 12:30 PM
To: Venkat, Ankam
Cc: Nick Allen; user@spark.apache.org
Subject: Re: How to 'Pipe' Binary Data in Apache Spark
Hi Venkat/Nick,
The
7;/usr/local/bin/sox', '-t'
>>> 'wav', '-', '-n', 'stats'])).collect() <-- Does not work. Tried different
>>> options.
AttributeError: 'function' object has no attribute 'read'
Any suggestions?
Regards,
V
wavfile =
sc.textFile('hdfs://xxx:8020/user/ab00855/ext2187854_03_27_2014.wav')
wavfile.pipe(subprocess.call(['sox', '-t' 'wav', '-', '-n', 'stats']))
I tried different options like sc.binaryFiles and sc.pickleFile.
Any thoughts
orm large scale text analytics and I can data store on HDFS or on
Pivotal Greenplum/Hawq.
Regards,
Venkat Ankam
From: Brian Dolan [mailto:buddha_...@yahoo.com]
Sent: Sunday, December 14, 2014 10:02 AM
To: Venkat, Ankam
Cc: 'user@spark.apache.org'
Subject: Re: MLlib vs Madlib
MADLib (http:
Can somebody throw light on MLlib vs Madlib?
Which is better for machine learning? and are there any specific use case
scenarios MLlib or Madlib will shine in?
Regards,
Venkat Ankam
This communication is the property of CenturyLink and may contain confidential
or privileged information
We have dockerized Spark Master and worker(s) separately and are using it in
our dev environment. We don't use Mesos though, running it in Standalone
mode, but adding Mesos should not be that difficult I think.
Regards
Venkat
--
View this message in context:
http://apache-spark-user
utomatically.
* Table joins for huge table(s) are costly. Fact and Dimension concepts from
star schema don't translate well to Big Data (Hadoop, Spark). It may be
better to de-normalize and store huge tables to avoid Joins. Joins seem to
be evil. (Have tried de-normalizing when using Cassandra
Bump up.
Michael Armbrust, anybody from Spark SQL team?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-table-Join-one-task-is-taking-long-tp20124p20218.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
---
This is a more of a Scala question than Spark question. Which Dependency
Injection framework do you guys use for Scala when using Spark? Is
http://scaldi.org/ recommended?
Regards
Venkat
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Scala-Dependency
t me know if you have any suggestions.
Regards,
Venkat
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-table-Join-one-task-is-taking-long-tp20124.html
Sent from the Apache Spark User List mailing list archi
Any idea how to resolve this?
Regards,
Venkat
From: Venkat, Ankam
Sent: Sunday, November 23, 2014 12:05 PM
To: 'user@spark.apache.org'
Subject: Spark Streaming with Python
I am trying to run network_wordcount.py example mentioned at
https://github.com/apache/spark/blob/master/example
ot;/usr/lib/spark/examples/lib/mllib/logistic_regression.py", line 37, in
parsePoint
values = [float(s) for s in line.split(' ')]
ValueError: invalid literal for float(): 1:0.4551273600657362
Regards,
Venkat
This communication is the property of CenturyLink and may contain confide
s/lib/network_wordcount.py", line 4, in
from pyspark.streaming import StreamingContext
ImportError: No module named streaming.
How to resolve this?
Regards,
Venkat
This communication is the property of CenturyLink and may contain confidential
or privileged information. Unauthorized use of this commun
TD has addressed this. It should be available in 1.2.0.
https://issues.apache.org/jira/browse/SPARK-3495
On Thu, Oct 2, 2014 at 9:45 AM, maddenpj wrote:
> I am seeing this same issue. Bumping for visibility.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.
You could set "spark.executor.memory" to something bigger than the default
(512mb)
On Thu, Sep 11, 2014 at 8:31 AM, Aniket Bhatnagar <
aniket.bhatna...@gmail.com> wrote:
> I am running a simple Spark Streaming program that pulls in data from
> Kinesis at a batch interval of 10 seconds, windows i
Hi Dibyendu,
That would be great. One of the biggest drawback of Kafka utils as well as
your implementation is I am unable to scale out processing. I am
relatively new to Spark and Spark Streaming - from what I read and what I
observe with my deployment is that having the RDD created on one rece
Hi,
To test the resiliency of Kafka Spark streaming, I killed the worker
reading from Kafka Topic and noticed that the driver is unable to replace
the worker and the job becomes a rogue job that keeps running doing nothing
from that point on.
Is this a known issue? Are there any workarounds?
He
The method
def updateStateByKey[S: ClassTag] ( updateFunc: (Seq[V], Option[S]) =>
Option[S] ): DStream[(K, S)]
takes Dstream (K,V) and Produces DStream (K,S) in Spark Streaming
We have a input Dstream(K,V) that has 40,000 elements. We update on average
of 1000 elements of them in every 3 secon
spending weeks trying to figure out
what is happening here and trying out different things).
This issue has been around from 0.9 till date (1.01) at least.
Thanks,
Venkat
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/streaming-window-not-behaving-as-advertised
For the time being, we decided to take a different route. We created a Rest
API layer in our app and allowed SQL query passing via the Rest. Internally
we pass that query to the SparkSQL layer on the RDD and return back the
results. With this Spark SQL is supported for our RDDs via this rest API
no
not the right one for this use case? or Do we write our
own partitioner ?- if we need to write a new partitioner, can someone give a
psedocode for this use case to help us.
Regards,
Venkat
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Partioner-to-pro
ot possible out of the box with Shark. If you look at the code for
SharkServer2 though, you'll see that its just a standard HiveContext under
the covers. If you modify this startup code, any SchemaRDD you register as
a table in this context will be exposed over JDBC.
[Venkat] Are you
Thanks Michael.
OK will try SharkServer2..
But I have some basic questions on a related area:
1) If I have a standalone spark application that has already built a RDD,
how can SharkServer2 or for that matter Shark access 'that' RDD and do
queries on it. All the examples I have seen for Shark, the
SQL work
on DStreams (since the underlying structure is RDD anyway) and can we expose
the streaming DStream RDD through JDBC via Spark SQL for Realtime analytics.
Any pointers on this will greatly help.
Regards,
Venkat
--
View this message in context:
http://apache-spark-user-list.1001560.n3
Christopher
Just to clarify - by ‘load ops’ do you mean RDD actions that result in IO?
Venkat
From: Christopher Nguyen mailto:c...@adatao.com>>
Reply-To: "user@spark.apache.org<mailto:user@spark.apache.org>"
mailto:user@spark.apache.org>>
Date: Saturday, April
HDFS substrate) for a start,
or will that not work at all? I do know that it’s possible to implement Tachyon
on Lustre and get the HDFS interface – just looking at other options.
Venkat
39 matches
Mail list logo