Kinesis? What I would want to finally achieve is that the
flatMapGroupWithState() that I would call later in the pipeline should have the
same (partition) key internally for key lookups in the (RocksDB) state so that
data locality can be achieved.
Is this redundant or implicit or not possible
way to figure out which IP or hostname of data-nodes returns
> from name-node to the spark? or Can you offer me a debug approach?
>
>> On Farvardin 24, 1400 AP, at 17:45, Russell Spitzer
>> mailto:russell.spit...@gmail.com>> wrote:
>>
>> Data locality can on
Data locality can only occur if the Spark Executor IP address string matches
the preferred location returned by the file system. So this job would only have
local tasks if the datanode replicas for the files in question had the same ip
address as the Spark executors you are using. If they don
https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
<https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>
Hi,
I am using spark 2.3.2, i am facing issues due to data locality, even after
giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL,
can someone help me with this.
Thank you
Hi all,
I am using spark 2.3.2, i am facing issues due to data locality, even after
giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL,
can someone help me with this.
Thank you
ight have to do with your container ips, it depends on your
>> networking setup. You might want to try host networking so that the
>> containers share the ip with the host.
>>
>> On Wed, Dec 28, 2016 at 1:46 AM, Karamba wrote:
>>> Hi Sun Rui,
>>>
>>>
ut at time of reading from hdfs in a spark program, the script is
>> called in my hdfs namenode container.
>>
>>> However, node-level locality can still work for Standalone.
>> I have a couple of physical hosts that run spark and hdfs docker
>> containers. How does
Although the Spark task scheduler is aware of rack-level data locality, it
>> seems that only YARN implements the support for it.
>
> This explains why the script that I configured in core-site.xml
> topology.script.file.name is not called in by the spark container.
> But at time of
Hi Sun Rui,
thanks for answering!
> Although the Spark task scheduler is aware of rack-level data locality, it
> seems that only YARN implements the support for it.
This explains why the script that I configured in core-site.xml
topology.script.file.name is not called in by the
Although the Spark task scheduler is aware of rack-level data locality, it
seems that only YARN implements the support for it. However, node-level
locality can still work for Standalone.
It is not necessary to copy the hadoop config files into the Spark CONF
directory. Set HADOOP_CONF_DIR to
Hi,
I am running a couple of docker hosts, each with an HDFS and a spark
worker in a spark standalone cluster.
In order to get data locality awareness, I would like to configure Racks
for each host, so that a spark worker container knows from which hdfs
node container it should load its data
>
> Does Spark uses data locality information from HDFS, when running in
> standalone mode? Or is it running on YARN mandatory for such purpose? I
> can't find this information in the docs, and on Google I am only finding
> contrasting opinion on that.
>
> Regards
> Marco Capuccini
>
gt; HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>> h
com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 5 June 2016 at 10:50, Marco Capuccini
> wrote:
>
>> Dear all,
>>
>> Does Spark uses data locality information from HDFS, when running in
m<http://talebzadehmich.wordpress.com/>
On 5 June 2016 at 10:50, Marco Capuccini
mailto:marco.capucc...@farmbio.uu.se>> wrote:
Dear all,
Does Spark uses data locality information from HDFS, when running in standalone
mode? Or is it running on YARN mandatory for such purpose?
uccini
wrote:
> Dear all,
>
> Does Spark uses data locality information from HDFS, when running in
> standalone mode? Or is it running on YARN mandatory for such purpose? I
> can't find this information in the docs, and on Google I am only finding
> contrasting opinion
Dear all,
Does Spark uses data locality information from HDFS, when running in standalone
mode? Or is it running on YARN mandatory for such purpose? I can't find this
information in the docs, and on Google I am only finding contrasting opinion on
that.
Regards
Marco Capuccini
We are using spark in two ways
1. Yarn with spark support. Kafka running along with data nodes
2. Spark master and workers running with some of Kafka brokers.
Data locality is important.
Regards
Diwakar
Sent from Samsung Mobile.
Original message From: أنس
iwakar .
>
>
>
> Sent from Samsung Mobile.
>
>
> Original message
> From: "Yuval.Itzchakov"
> Date:07/02/2016 19:38 (GMT+05:30)
> To: user@spark.apache.org
> Cc:
> Subject: Re: Apache Spark data locality when integrating with Kafka
Fanoos,
Where you want the solution to be deployed ?. On premise or cloud?
Regards
Diwakar .
Sent from Samsung Mobile.
Original message From: "Yuval.Itzchakov"
Date:07/02/2016 19:38 (GMT+05:30)
To: user@spark.apache.org Cc: Subject: Re:
Apache Spark dat
two clusters so you can,
again, benefit from low IO latency and high throughput.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165p26170.html
Sent from the Apache Spark User List mail
spark can benefit from data locality and will try to launch tasks on the
node where the kafka partition resides.
however i think in production many organizations run a dedicated kafka
cluster.
On Sat, Feb 6, 2016 at 11:27 PM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
>
Yes . To reduce network latency .
Sent from Samsung Mobile.
Original message From: fanooos
Date:07/02/2016 09:24 (GMT+05:30)
To: user@spark.apache.org Cc: Subject: Apache
Spark data locality when integrating with Kafka
Dears
If I will use Kafka as a streaming source
/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail
wared, this is same for different cluster
manager.
Thanks
Saisai
On Thu, Jan 28, 2016 at 10:50 AM, Todd wrote:
> Hi,
> I am kind of confused about how data locality is honored when spark is
> running on yarn(client or cluster mode),can someone please elaberate on
> this? Thanks!
>
>
>
Hi,
I am kind of confused about how data locality is honored when spark is
running on yarn(client or cluster mode),can someone please elaberate on this?
Thanks!
what are the parameters on which locality depends
On Sun, Nov 15, 2015 at 5:54 PM, Renu Yadav wrote:
> Hi,
>
> I am working on spark 1.4 and reading a orc table using dataframe and
> converting that DF to RDD
>
> I spark UI I observe that 50 % task are running on locality and ANY and
> very few
Hi,
I am working on spark 1.4 and reading a orc table using dataframe and
converting that DF to RDD
I spark UI I observe that 50 % task are running on locality and ANY and
very few on LOCAL.
What would be the possible reason for this?
Please help. I have even changed locality settings
Thanks
Hi Shane,
Tachyon provides an api to get the block locations of the file which Spark
uses when scheduling tasks.
Hope this helps,
Calvin
On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane
wrote:
> Hi all,
>
>
>
> I am looking into how Spark handles data locality wrt Tachyon. My
Hi all,
I am looking into how Spark handles data locality wrt Tachyon. My main concern
is how this is coordinated. Will it send a task based on a file loaded from
Tachyon to a node that it knows has that file locally and how does it know
which nodes has what?
Kind regards,
Shane
This email
fault of 3 seconds.
-adrian
From: Cody Koeninger
Sent: Monday, September 21, 2015 10:19 PM
To: Ashish Soni
Cc: user
Subject: Re: Spark Streaming and Kafka MultiNode Setup - Data Locality
The direct stream already uses the kafka leader for a given partition as the
The direct stream already uses the kafka leader for a given partition as
the preferred location.
I don't run kafka on the same nodes as spark, and I don't know anyone who
does, so that situation isn't particularly well tested.
On Mon, Sep 21, 2015 at 1:15 PM, Ashish Soni wrote:
> Hi All ,
>
> J
Hi All ,
Just wanted to find out if there is an benefits to installing kafka
brokers and spark nodes on the same machine ?
is it possible that spark can pull data from kafka if it is local to the
node i.e. the broker or partition is on the same machine.
Thanks,
Ashish
Hi Sunil,
Have you seen this fix in Spark 1.5 that may fix the locality issue?:
https://issues.apache.org/jira/browse/SPARK-4352
On Thu, Aug 20, 2015 at 4:09 AM, Sunil wrote:
> Hello . I am seeing some unexpected issues with achieving HDFS
> data
> locality. I expect the ta
Hello . I am seeing some unexpected issues with achieving HDFS data
locality. I expect the tasks to be executed only on the node which has the
data but this is not happening (ofcourse, unless the node is busy in which
case, I understand tasks can go to some other node). Could anyone
Hi Spark users and developers,
I have been trying to use spark-ec2. After I launched the spark cluster
(1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job
where the data is stored in the ephemeral hdfs. It does not matter what I
tried to do, there is no data locality at all
Hi guys,
I am running some SQL queries, but all my tasks are reported as either
NODE_LOCAL or PROCESS_LOCAL.
In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because
they have to aggregate data from multiple hosts. However, in Spark even the
aggregation stages are reported a
unning an hourly job using Spark 1.2 on Yarn. It saves an RDD of
Tuple2. At the end of day, a daily job is launched, which works on the
outputs of the hourly jobs.
For data locality and speed, we wish that when the daily job launches, it
finds all instances of a given key at a single executor rather t
At the end of day, a daily job is launched, which works on the
> outputs of the hourly jobs.
>
> For data locality and speed, we wish that when the daily job launches, it
> finds all instances of a given key at a single executor rather than
> fetching
> it from others durin
Hi,
We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
Tuple2. At the end of day, a daily job is launched, which works on the
outputs of the hourly jobs.
For data locality and speed, we wish that when the daily job launches, it
finds all instances of a given key at a single
Response inline.
On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun wrote:
> (resending...)
>
> I was thinking the same setup… But the more I think of this problem, and
> the more interesting this could be.
>
> If we allocate 50% total memory to Tachyon statically, then the Mesos
> benefits of dy
(resending...)
I was thinking the same setup… But the more I think of this problem, and
the more interesting this could be.
If we allocate 50% total memory to Tachyon statically, then the Mesos
benefits of dynamically scheduling resources go away altogether.
Can Tachyon be resource managed by Me
> > I am fairly new to the spark ecosystem and I have been trying to
> > setup a spark on mesos deployment. I can't seem to figure out the
> > "best practices" around HDFS and Tachyon. The documentation about
> > Spark's data-locality section seems to poi
ailto:achau...@brightcove.com>> wrote:
>
> Hi,
>
> I am fairly new to the spark ecosystem and I have been trying to
> setup a spark on mesos deployment. I can't seem to figure out the
> "best practices" around HDFS and Tachyon. The documentation about
> Spa
s deployment. I can't seem to figure out the "best
> practices" around HDFS and Tachyon. The documentation about Spark's
> data-locality section seems to point that each of my mesos slave nodes
> should also run a hdfs datanode. This seems fine but I can't seem to
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am fairly new to the spark ecosystem and I have been trying to setup
a spark on mesos deployment. I can't seem to figure out the "best
practices" around HDFS and Tachyon. The documentation about Spark's
data-locality section
se).
Thanks.
bit1...@163.com
From: eric wong
Date: 2015-03-14 22:36
To: bit1...@163.com; user
Subject: Re: How does Spark honor data locality when allocating computing
resources for an application
you seem like not to note the configuration varible "spreadOutApps"
And it's comme
app.state = ApplicationState.RUNNING
> }
> }
> }
> }
>
> Looks that the resource allocation policy here is that Master will assign
> as few workers as possible, so long as these few workers has enough
> resources for the application.
> My question is: Assume t
Hi, sparkers,
When I read the code about computing resources allocation for the newly
submitted application in the Master#schedule method, I got a question about
data locality:
// Pack each app into as few nodes as possible until we've assigned all its
cores
for (worker <- wo
Hi,
We wrote a spark steaming app that receives file names on HDFS from Kafka
and opens them using Hadoop's libraries.
The problem with this method is that I'm not utilizing data locality
because any worker might open any file without giving precedence to data
locality.
I can't
context:
http://apache-spark-user-list.1001560.n3.nabble.com/data-locality-in-logs-tp1276p21416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
-list.1001560.n3.nabble.com/Data-Locality-tp21000p21413.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user
?
-
--Harihar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000p21410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
with
Mesos.
Looking at the logs again, it looks like the locality info between the
stand-alone and Mesos coarse-grained mode are very similar.
I must have been hallucinating earlier thinking somehow the data locality
information was different.
So this whole thing might just simply be due to the fact
egards to Mesos in fine-grained mode, do you have a feel for the
> overhead of
> launching executors for every task? Of course, any perceived slow down will
> probably be very dependent
> on the workload. I just want to have a feel of the possible overhead (e.g.,
> factor of 2 o
ching executors for every task? Of course, any perceived slow down will
probably be very dependent
on the workload. I just want to have a feel of the possible overhead (e.g.,
factor of 2 or 3 slowdown?).
If not a data locality issue, perhaps this overhead can be a factor in the
slowdown I observed, at
ssue might be the reason for the slow down but I
> can't figure out why, especially for coarse-grained mode as the executors
> supposedly do not go away until job completion.
>
> Any ideas?
>
> Thanks,
> Mike
>
>
>
> --
> View this message in context:
> http:/
own but I
can't figure out why, especially for coarse-grained mode as the executors
supposedly do not go away until job completion.
Any ideas?
Thanks,
Mike
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-running-Spark-on-Mesos-tp21041.h
You can also read about locality here in the docs:
http://spark.apache.org/docs/latest/tuning.html#data-locality
On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger wrote:
> No, not all rdds have location information, and in any case tasks may be
> scheduled on non-local nodes if there i
gt; is local ie Node1 and Node 2(assuming Node1 and Node2 have enough resources
> to execute the tasks)?
>
> Gaurav
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000.html
> Sent from the Apache
data
is local ie Node1 and Node 2(assuming Node1 and Node2 have enough resources
to execute the tasks)?
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
wrote:
> I am seeing skewed execution times. As far as I can tell, they are
> attributable to differences in data locality - tasks with locality
> PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest.
>
> This seems entirely as it should be - the question is, why the different
I am seeing skewed execution times. As far as I can tell, they are
attributable to differences in data locality - tasks with locality
PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest.
This seems entirely as it should be - the question is, why the different
locality levels?
I am
; case?
>>
>> On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld <
>> nkronenf...@oculusinfo.com> wrote:
>>
>>> Can anyone point me to a good primer on how spark decides where to send
>>> what task, how it distributes them, and how it determines data loc
, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld <
> nkronenf...@oculusinfo.com> wrote:
>
>> Can anyone point me to a good primer on how spark decides where to send
>> what task, how it distributes them, and how it determines data locality?
>>
>> I'm trying a pretty
enf...@oculusinfo.com> wrote:
> Can anyone point me to a good primer on how spark decides where to send
> what task, how it distributes them, and how it determines data locality?
>
> I'm trying a pretty simple task - it's doing a foreach over cached data,
> accumulating some (relat
Can anyone point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?
I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some (relatively complex) values.
So I s
your reply!
qinwei
?发件人:?Shao, Saisai发送时间:?2014-09-28?14:42收件人:?qinwei抄送:?user主题:?RE: problem with
data locality api
Hi
?
First conf is used for Hadoop to determine the locality distribution of HDFS
file. Second conf is used for Spark, though with the same name, actually
they are two
Subject: problem with data locality api
Hi, everyone
I come across with a problem about data locality, i found these example
code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》
val locData = InputFormatInfo.computePreferredLocations(Seq(new
InputFormatInfo(conf, classOf[TextInputFormat
Hi, everyone? ? I come across with a problem about data locality, i found
these?example?code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》? ??? ??val
locData = InputFormatInfo.computePreferredLocations(Seq(new
InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt
On Fri, Jul 25, 2014 at 4:13 AM, Tsai Li Ming wrote:
> Hi,
>
> In the standalone mode, how can we check data locality is working as
> expected when tasks are assigned?
>
> Thanks!
>
>
> On 23 Jul, 2014, at 12:49 am, Sandy Ryza wrote:
>
> On standalone there
Just an update on this - I have benchmarked on a cluster built with
spark-ec2 and again found that reading from hdfs is not much faster than
from s3 (about 20%).
Does anyone know how to check that data locality is being used by spark on
my cluster?
Is it surprising that access to HDFS on local
Hi all,
I'm consistently finding that reading from HDFS is not appreciably faster
than reading from S3 using pyspark. How can I tell whether data locality is
being respected?
In this example, reading from HDFS is only about 10% faster than reading
the same file from S3. The files were pulled
Hi,
In the standalone mode, how can we check data locality is working as expected
when tasks are assigned?
Thanks!
On 23 Jul, 2014, at 12:49 am, Sandy Ryza wrote:
> On standalone there is still special handling for assigning tasks within
> executors. There just isn't special h
ease correct me if I'm wrong. Thank you for your patience!
>
>
> --
>
> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
> *Sent:* 2014年7月22日 9:47
>
> *To:* user@spark.apache.org
> *Subject:* Re: data locality
>
>
>
> Th
;m wrong. Thank you for your patience!
From: Sandy Ryza [mailto:sandy.r...@cloudera.com]
Sent: 2014年7月22日 9:47
To: user@spark.apache.org
Subject: Re: data locality
This currently only works for YARN. The standalone default is to place an
executor on every
hosts, how does Spark decide how many total
> executors to use for this application?
>
>
>
> Thanks again!
>
>
> --
>
> *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
> *Sent:* Friday, July 18, 2014 3:44 PM
> *To:* user@spark.apac
executors to use for this application?
Thanks again!
From: Sandy Ryza [mailto:sandy.r...@cloudera.com]
Sent: Friday, July 18, 2014 3:44 PM
To: user@spark.apache.org
Subject: Re: data locality
Hi Haopu,
Spark will ask HDFS for file block locations and
before it has any information about
where the input data for the jobs is located. If the executors occupy
significantly fewer nodes than exist in the cluster, it can be difficult
for Spark to achieve data locality. The workaround for this is an API that
allows passing in a set of preferred loca
I have a standalone spark cluster and a HDFS cluster which share some of nodes.
When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS
the location for each file block in order to get a right worker node?
How about a spark cluster on Yarn?
Thank you very much!
p://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317p7320.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
ast, and some optimization like this is definitely done I
> assume?
>
> I suppose the only benefit with HDFS would be better fault tolerance, and
> the ability to checkpoint and recover even if master fails.
>
> Cheers,
> Nilesh
>
>
>
> --
> View this message in c
setup HDFS on the same cluster as Spark, write
the data from the Actors to HDFS, and then use HDFS as input source for
Spark Streaming. Does this result in better performance due to data locality
(with HDFS data replication turned on)? I think performance should be almost
the same with actors,
84 matches
Mail list logo