Do you have enough partitions of your RDDs to spread across all your
processing cores? Are all executors actually processing tasks?
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action
--
View this
Hi,
I have a DataFrame with several columns I'd like to explode. All of the
columns I have to explode has an ArrayBuffer type of some other types
inside.
I'd say that the following code is totally legit to use it as explode
function for any given ArrayBuffer - my assumption is that for any given
Hi Richard,
Requirement is to get latest records using a key i think hash map is a good
choice for this task.
As of now data comes from third party and we are not sure what's the latest
record is so hash map is chosen.
Is there anything better than hash map please let me know.
Thanks
Sri
Sent
Hi, I am following the spark streaming stateful application example and write
a simple counting application with updateStateByKey.
val keyStateStream = actRegBatchCountStream.updateStateByKey(update, new
HashPartitioner(ssc.sparkContext.defaultParallelism), true, initKeyStateRDD)
This runs
Well it depends on exactly what algorithms are involved in Network Root Cause
analysis (not something I'm familiar with). GraphX provides a number of out
of the box algorithms like PageRank, connected components, strongly
connected components, label propagation as well as an implementation of the
One other thought - you need to call SparkContext.setCheckpointDir otherwise
nothing will happen
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
http://www.manning.com/books/spark-graphx-in-action
--
View this message in context:
let's start from some basics: might be u need to split your data into more
partitions?
spilling depends on your configuration when you create graph(look for
storage level param) and your global configuration.
in addition, you assumption of 64GB/100M is probably wrong, since spark
divides memory
Got it ..., created hashmap and saved it to file please follow below steps ..
val QuoteRDD=quotefile.map(x => x.split("\\|")).
filter(line => line(0).contains("1017")).
map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
if (x(15) =="B")
(
{if (x(25) == "") x(9) else
Hi,
I've been reviewing the Spark code and noticed that `iterator` method
of RDD [1] does a check whether RDD has a non-NONE storage and calls
`computeOrReadCheckpoint` private method [2] that checks RDD
checkpointing.
Is there a doc on how StorageLevel, CacheManager and checkpointing
influence
You need to checkpoint before you materialize. You'll find you probably only
want to checkpoint every 100 or so iterations otherwise the checkpointing
will slow down your application excessively
-
Robin East
Spark GraphX in Action Michael Malak and Robin East
Manning Publications Co.
can you show the output of df.printSchema? Just a guess but I think I ran
into something similar with a column that was part of a path in parquet.
E.g. we had an account_id in the parquet file data itself which was of type
string but we also named the files in the following manner
"Job has not accepted resources" is a well-known error message -- you can
search the Internet. 2 common causes come to mind:
1) you already have an application connected to the master -- by default a
driver will grab all resources so unless that application disconnects,
nothing else is allowed to
Do you need the HashMap for anything else besides writing out to a file? If
not, there is really no need to create one at all. You could just keep
everything as RDDs.
On Oct 10, 2015 11:31 AM, "kali.tumm...@gmail.com"
wrote:
> Got it ..., created hashmap and saved it to
There is spark without hadoop version.. You can use that to link with any
custom hadoop version.
Raghav
On Oct 10, 2015 5:34 PM, "Steve Loughran" wrote:
>
> During development, I'd recommend giving Hadoop a version ending with
> -SNAPSHOT, and building spark with maven,
You should be able to achieve what you're looking for by using foldByKey to
find the latest record for each key. If you're relying on the order
elements within the file to determine which ones are the "latest" (rather
than sorting by some field within the file itself), call zipWithIndex first
to
Hi any idea? how do I call percentlie_approx using callUdf() please guide.
On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha wrote:
> I have a doubt Michael I tried to use callUDF in the following code it
> does not work.
>
>
Thanks Richard , will give a try tomorrow...
Thanks
Sri
Sent from my iPhone
> On 10 Oct 2015, at 19:15, Richard Eggert wrote:
>
> You should be able to achieve what you're looking for by using foldByKey to
> find the latest record for each key. If you're relying on
Hello,
Does Spark-SQL support join order optimization as of the 1.5.1 release ?
>From the release notes, I did not see support for this feature, but figured
will ask the users-list to be sure.
Thanks
Raajay
bq. all sort of optimizations like Tungsten
For Tungsten, please use 1.5.1 release.
On Sat, Oct 10, 2015 at 6:24 PM, Alex Rovner
wrote:
> How many executors are you running with? How many nodes in your cluster?
>
>
> On Thursday, October 8, 2015, unk1102
How many executors are you running with? How many nodes in your cluster?
On Thursday, October 8, 2015, unk1102 wrote:
> Hi as recommended I am caching my Spark job dataframe as
> dataframe.persist(StorageLevels.MEMORY_AND_DISK_SER) but what I see in
> Spark
> job UI is
Hi Alex thanks for the response. I am using 40 executor with 30 gb
including 5 gb menoryOverhead and 4 cores. My cluster has around 100 nodes
with 30 gig and 8 cores.
On Oct 11, 2015 06:54, "Alex Rovner" wrote:
> How many executors are you running with? How many nodes
here is what the df.schema.toString() prints.
DF Schema is ::StructType(StructField(batch_id,StringType,true))
I think you nailed the problem, this filed is the part of our hdfs file
path. We have kind of partitioned our data on the basis of batch_ids folder.
How did you get around it?
Thanks
The choice of datastore is driven by your use case. In fact Spark can work
with multiple datastores too. Each datastore is optimised for certain kinds
of data.
e.g. HDFS is great for analytics and large data sets at rest. It is
scalable and very performant, but is immutable. No-SQL databases
Hi All,
I changed my way of approach now I am bale to load data into MAP and get
data out using get command.
val QuoteRDD=quotefile.map(x => x.split("\\|")).
filter(line => line(0).contains("1017")).
map(x => ((x(5)+x(4)) , (x(5),x(4),x(1) ,
if (x(15) =="B")
if
During development, I'd recommend giving Hadoop a version ending with
-SNAPSHOT, and building spark with maven, as mvn knows to refresh the snapshot
every day.
you can do this in hadoop with
mvn versions:set 2.7.0.stevel-SNAPSHOT
if you are working on hadoop branch-2 or trunk direct, they
How are you determining how much time is serialization taking?
I made this change in a streaming app that relies heavily on updateStateByKey.
The memory consumption went up 3x on the executors but I can't see any perf
improvement. Task execution time is the same and the serialization state
26 matches
Mail list logo