Maybe your master or zeppelin server is running out of memory and the more data
it receives the more memory swapping it has to dosomething to check.
Get Outlook for Android
On Wed, May 17, 2017 at 11:14 AM -0400, "Junaid Nasir" wrote:
I have a large data set of 1B records
Thanks. It looks like they posted the release just now because it wasn't
showing before.
Get Outlook for Android
On Fri, May 5, 2017 at 11:04 AM -0400, "Jules Damji" wrote:
Go to this link http://spark.apache.org/downloads.html
CheersJules
Sent from my iPhonePardon the
Hi
Website says it is released. Where can it be downloaded?
Thanks
Get Outlook for Android
So what was the answer?
Sent from my Verizon, Samsung Galaxy smartphone
Original message From: Andrew Holway
Date: 1/15/17 11:37 AM (GMT-05:00) To: Marco
Mistroni Cc: Neil Jonkers , User
Subject: Re: Running Spark on EMR
Darn. I didn't respond to the list. Sorry.
On Su
Anyone got a good guide for getting spark master to talk to remote workers
inside dockers? I followed the tips found by searching but doesn't work still.
Spark 1.6.2.
I exposed all the ports and tried to set local IP inside container to the host
IP but spark complains it can't bind ui ports.
Tha
Just replying for info since it's not identical to your request but in the same
spirit.
Darren
Sent from my Verizon, Samsung Galaxy smartphone
Original message From: Chetan Khatri
Date: 1/4/17 6:34 AM (GMT-05:00) To: Lars
Albertsson Cc: user , Spark Dev List
S
te: 9/2/16 4:03 AM (GMT-05:00) To: Mich Talebzadeh
Cc: Jakob Odersky , ayan guha
, Tal Grynbaum , darren
, kant kodali , AssafMendelson
, user Subject: Re: Scala Vs
Python
Whatever benefits you may accrue from the rapid prototyping and coding in
Python, it will be offset against the tim
This topic is a concern for us as well. In the data science world no one uses
native scala or java by choice. It's R and Python. And python is growing. Yet
in spark, python is 3rd in line for feature support, if at all.
This is why we have decoupled from spark in our project. It's really
unfortu
This is fantastic news.
Sent from my Verizon 4G LTE smartphone
Original message
From: Paolo Patierno
Date: 7/3/16 4:41 AM (GMT-05:00)
To: user@spark.apache.org
Subject: AMQP extension for Apache Spark Streaming (messaging/IoT)
Hi all,
I'm working on an AMQP exten
from my Verizon Wireless 4G LTE smartphone
Original message
From: Malcolm Lockyer
Date: 05/30/2016 10:40 PM (GMT-05:00)
To: user@spark.apache.org
Subject: Re: Spark + Kafka processing trouble
On Tue, May 31, 2016 at 1:56 PM, Darren Govoni wrote:
> So you are calling a
So you are calling a SQL query (to a single database) within a spark operation
distributed across your workers?
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Malcolm Lockyer
Date: 05/30/2016 9:45 PM (GMT-05:00)
To: user@spark.apache.org
Su
Hi I have a python egg with a __main__.py in it. I am able to execute the egg
by itself fine.
Is there a way to just submit the egg to spark and have it run? It seems an
external .py script is needed which would be unfortunate if true.
Thanks
Sent from my Verizon Wireless 4G LTE smartpho
M (GMT-05:00)
To: Darren Govoni , Jules Damji ,
Joshua Sorrell
Cc: user@spark.apache.org
Subject: Re: Does pyspark still lag far behind the Scala API in terms of
features
Plenty of people get their data in Parquet, Avro, or ORC files; or from a
database; or do their initial loading of u
Dataframes are essentially structured tables with schemas. So where does the
non typed data sit before it becomes structured if not in a traditional RDD?
For us almost all the processing comes before there is structure to it.
Sent from my Verizon Wireless 4G LTE smartphone
Orig
This might be hard to do. One generalization of this problem is
https://en.m.wikipedia.org/wiki/Longest_path_problem
Given a node (e.g. A), find longest path. All interior relations are transitive
and can be inferred.
But finding a distributed spark way of doing it in P time would be intere
I meant to write 'last task in stage'.
Sent from my Verizon Wireless 4G LTE smartphone
Original message ----
From: Darren Govoni
Date: 02/16/2016 6:55 AM (GMT-05:00)
To: Abhishek Modi , user@spark.apache.org
Subject: RE: Unusually large deserialisation time
I think this is part of the bigger issue of serious deadlock conditions
occurring in spark many of us have posted on.
Would the task in question be the past task of a stage by chance?
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Abhishek Modi
Max
Date: 02/11/2016 2:44 PM (GMT-05:00)
To: Darren Govoni
Cc: user@spark.apache.org
Subject: Re: Spark workers disconnecting on 1.5.2
No, ours are running on Docker containers spread across few physical servers.
Databricks runs their service on AWS. Wonder if they are seeing this issues
I see this too. Might explain some other serious problems we're having with
1.5.2
Is your cluster in AWS?
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Andy Max
Date: 02/11/2016 2:12 PM (GMT-05:00)
To: user@spark.apache.org
Subject: Spark w
From: "Sanders, Isaac B"
Date: 01/25/2016 8:59 AM (GMT-05:00)
To: Ted Yu
Cc: Darren Govoni , Renu Yadav , Muthu
Jayakumar , user@spark.apache.org
Subject: Re: 10hrs of Scheduler Delay
Is the thread dump the stack trace you are talking about? If so, I will see if
I can
Why not deploy it. Then build a custom distribution with Scala 2.11 and just
overlay it.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Nuno Santos
Date: 01/25/2016 7:38 AM (GMT-05:00)
To: user@spark.apache.org
Subject: Re: Launching EC2 ins
if I
only run 10mb of it it will succeed. This suggest a serious fundamental scaling
problem.
Workers have plenty of resources.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: "Sanders, Isaac B"
Date: 01/24/2016 2:54 PM (GMT-05:00)
To: Ren
)
To: Darren Govoni , "Sanders, Isaac B"
, Ted Yu
Cc: user@spark.apache.org
Subject: Re: 10hrs of Scheduler Delay
Does increasing the number of partition helps? You could try out something 3
times what you currently have. Another trick i used was to partition the
problem int
Me too. I had to shrink my dataset to get it to work. For us at least Spark
seems to have scaling issues.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: "Sanders, Isaac B"
Date: 01/21/2016 11:18 PM (GMT-05:00)
To: Ted Yu
Cc: user@spark.apac
I've experienced this same problem. Always the last stage hangs. Indeterminant.
No errors in logs. I run spark 1.5.2. Can't find an explanation. But it's
definitely a showstopper.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Ted Yu
Date: 01/2
Gotta roll your own. Look at kafka and websockets for example.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: patcharee
Date: 01/20/2016 2:54 PM (GMT-05:00)
To: user@spark.apache.org
Subject: visualize data from spark streaming
Hi,
How to
I also would be interested in some best practice for making this work.
Where will the writeup be posted? On mesosphere website?
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Sathish Kumaran Vairavelu
Date: 01/19/2016 7:00 PM (GMT-05:00)
To: T
What's the rationale behind that? It certainly limits the kind of flow logic we
can do in one statement.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: David Russell
Date: 01/18/2016 10:44 PM (GMT-05:00)
To: charles li
Cc: user@spark.apache
here's executor trace.
Thread 58: Executor task launch
worker-3 (RUNNABLE)
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:152)
java.net.SocketI
Hi,
I've had this nagging problem where a task will hang and the
entire job hangs. Using pyspark. Spark 1.5.1
The job output looks like this, and hangs after the last task:
..
15/12/29 17:00:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in
me
I'll throw a thought in here.
Dataframes are nice if your data is uniform and clean with consistent schema.
However in many big data problems this is seldom the case.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Chris Fregly
Date: 12/28/2015
I use python too. I'm actually surprises it's not the primary language since it
is by far more used in data science than java snd Scala combined.
If I had a second choice of script language for general apps I'd want groovy
over scala.
Sent from my Verizon Wireless 4G LTE smartphone
-
Maybe this is helpful
https://github.com/lensacom/sparkit-learn/blob/master/README.rst
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: Mustafa Elbehery
Date: 12/06/2015 3:59 PM (GMT-05:00)
To: user
Subject: PySpark RDD with NumpyArray Structu
This to me doesn't give me a direction to look without the actual logs
from $SPARK_HOME or the stderr from the worker UI.
Just imho maybe someone know what this means but it seems like it
could be caused by a lot of things.
On 12/2/2015 6:48 PM, Darren Govoni wrote:
Hi all,
Wondering if
Hi all,
Wondering if someone can provide some insight why this pyspark app is
just hanging. Here is output.
...
15/12/03 01:47:05 INFO TaskSetManager: Starting task 21.0 in stage 0.0
(TID 21, 10.65.143.174, PROCESS_LOCAL, 1794787 bytes)
15/12/03 01:47:05 INFO TaskSetManager: Starting task 22
I agree 100%. Making the model requires large data and many cpus.
Using it does not.
This is a very useful side effect of ML models.
If mlib can't use models outside spark that's a real shame.
Sent from my Verizon Wireless 4G LTE smartphone
Original message
From: "Kothuvat
Hi,
I read on this page
http://spark.apache.org/docs/latest/streaming-kafka-integration.html
about python support for "receiverless" kafka integration (Approach 2)
but it says its incomplete as of version 1.4.
Has this been updated in version 1.5.
val aDstream = ...
val distinctStream = aDstream.transform(_.distinct())
but the elements in distinctStream are not distinct.
Did I use it wrong?
On Wed, Mar 18, 2015 at 8:31 PM, Shao, Saisai wrote:
> From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar
> 18 16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the
> final shuffle result.
>
why the shuffle result is written to disk?
> As I said, did you think
I've already done that:
>From SparkUI Environment Spark properties has:
spark.shuffle.spillfalse
On Wed, Mar 18, 2015 at 6:34 PM, Akhil Das
wrote:
> I think you can disable it with spark.shuffle.spill=false
>
> Thanks
> Best Regards
>
> On Wed, Mar 18, 2015 at
Thanks, Shao
On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai wrote:
> Yeah, as I said your job processing time is much larger than the sliding
> window, and streaming job is executed one by one in sequence, so the next
> job will wait until the first job is finished, so the total latency will be
sliding window is just 3 seconds, so you will
> process each 60 second's data in 3 seconds, if processing latency is larger
> than the sliding window, so maybe you computation power cannot reach to the
> qps you wanted.
>
>
>
> I think you need to identify the bottleneck
I use spark-streaming reading messages from a Kafka, the producer creates
messages about 1500 per second
def hash(x: String): Int = {
MurmurHash3.stringHash(x)
}
val stream = KafkaUtils.createStream(ssc, zkQuorum, group, topicMap,
StorageLevel.MEMORY_ONLY_SER).map(_._2
43 matches
Mail list logo