Kinesis? What I would want to finally achieve is that the
flatMapGroupWithState() that I would call later in the pipeline should have the
same (partition) key internally for key lookups in the (RocksDB) state so that
data locality can be achieved.
Is this redundant or implicit or not possible
ich IP or hostname of data-nodes returns
> from name-node to the spark? or Can you offer me a debug approach?
>
>> On Farvardin 24, 1400 AP, at 17:45, Russell Spitzer
>> mailto:russell.spit...@gmail.com>> wrote:
>>
>> Data locality can only occur if the Spar
Data locality can only occur if the Spark Executor IP address string matches
the preferred location returned by the file system. So this job would only have
local tasks if the datanode replicas for the files in question had the same ip
address as the Spark executors you are using. If they don't
m/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache
<https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>
Hi,
I am using spark 2.3.2, i am facing issues due to data locality, even after
giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL,
can someone help me with this.
Thank you
Hi all,
I am using spark 2.3.2, i am facing issues due to data locality, even after
giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL,
can someone help me with this.
Thank you
gt;>>
>>> thanks for answering!
>>>
>>>
>>>> Although the Spark task scheduler is aware of rack-level data locality, it
>>>> seems that only YARN implements the support for it.
>>> This explains why the script that I configured in
>
>>> Although the Spark task scheduler is aware of rack-level data locality, it
>>> seems that only YARN implements the support for it.
>> This explains why the script that I configured in core-site.xml
>> topology.script.file.name is not called in by the sp
g!
>
>
>> Although the Spark task scheduler is aware of rack-level data locality, it
>> seems that only YARN implements the support for it.
>
> This explains why the script that I configured in core-site.xml
> topology.script.file.name is not called in by the spark
Hi Sun Rui,
thanks for answering!
> Although the Spark task scheduler is aware of rack-level data locality, it
> seems that only YARN implements the support for it.
This explains why the script that I configured in core-site.xml
topology.script.file.name is not called in by the
Although the Spark task scheduler is aware of rack-level data locality, it
seems that only YARN implements the support for it. However, node-level
locality can still work for Standalone.
It is not necessary to copy the hadoop config files into the Spark CONF
directory. Set HADOOP_CONF_DIR
Hi,
I am running a couple of docker hosts, each with an HDFS and a spark
worker in a spark standalone cluster.
In order to get data locality awareness, I would like to configure Racks
for each host, so that a spark worker container knows from which hdfs
node container it should load its data
Does Spark uses data locality information from HDFS, when running in
> standalone mode? Or is it running on YARN mandatory for such purpose? I
> can't find this information in the docs, and on Google I am only finding
> contrasting opinion on that.
>
> Regards
> Marco Capuccini
>
ill know about the datanodes from
>> %HADOOP_HOME/etc/Hadoop/slaves
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>> LinkedIn *
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/prof
profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 5 June 2016 at 10:50, Marco Capuccini <marco.capucc...@farmbio.uu.se>
> wrote:
>
&
V8Pw
http://talebzadehmich.wordpress.com<http://talebzadehmich.wordpress.com/>
On 5 June 2016 at 10:50, Marco Capuccini
<marco.capucc...@farmbio.uu.se<mailto:marco.capucc...@farmbio.uu.se>> wrote:
Dear all,
Does Spark uses data locality information from HDFS, when running in stan
uccini <marco.capucc...@farmbio.uu.se>
wrote:
> Dear all,
>
> Does Spark uses data locality information from HDFS, when running in
> standalone mode? Or is it running on YARN mandatory for such purpose? I
> can't find this information in the docs, and on Google I am only fi
Dear all,
Does Spark uses data locality information from HDFS, when running in standalone
mode? Or is it running on YARN mandatory for such purpose? I can't find this
information in the docs, and on Google I am only finding contrasting opinion on
that.
Regards
Marco Capuccini
, benefit from low IO latency and high throughput.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165p26170.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
ect: Re:
Apache Spark data locality when integrating with Kafka
I would definitely try to avoid hosting Kafka and Spark on the same
servers.
Kafka and Spark will be doing alot of IO between them, so you'll want to
maximize on those resources and not share them on the same server. You'll
want e
r .
>
>
>
> Sent from Samsung Mobile.
>
>
> Original message
> From: "Yuval.Itzchakov" <yuva...@gmail.com>
> Date:07/02/2016 19:38 (GMT+05:30)
> To: user@spark.apache.org
> Cc:
> Subject: Re: Apache Spark data locality when integrati
We are using spark in two ways
1. Yarn with spark support. Kafka running along with data nodes
2. Spark master and workers running with some of Kafka brokers.
Data locality is important.
Regards
Diwakar
Sent from Samsung Mobile.
Original message From: أنس
/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail
Yes . To reduce network latency .
Sent from Samsung Mobile.
Original message From: fanooos
<dev.fano...@gmail.com> Date:07/02/2016 09:24 (GMT+05:30)
To: user@spark.apache.org Cc: Subject: Apache
Spark data locality when integrating with Kafka
Dears
If I wi
spark can benefit from data locality and will try to launch tasks on the
node where the kafka partition resides.
however i think in production many organizations run a dedicated kafka
cluster.
On Sat, Feb 6, 2016 at 11:27 PM, Diwakar Dhanuskodi <
diwakar.dhanusk...@gmail.com> wrote:
, this is same for different cluster
manager.
Thanks
Saisai
On Thu, Jan 28, 2016 at 10:50 AM, Todd <bit1...@163.com> wrote:
> Hi,
> I am kind of confused about how data locality is honored when spark is
> running on yarn(client or cluster mode),can someone please elaberate on
> this? Thanks!
>
>
>
Hi,
I am kind of confused about how data locality is honored when spark is
running on yarn(client or cluster mode),can someone please elaberate on this?
Thanks!
Hi,
I am working on spark 1.4 and reading a orc table using dataframe and
converting that DF to RDD
I spark UI I observe that 50 % task are running on locality and ANY and
very few on LOCAL.
What would be the possible reason for this?
Please help. I have even changed locality settings
Thanks
what are the parameters on which locality depends
On Sun, Nov 15, 2015 at 5:54 PM, Renu Yadav wrote:
> Hi,
>
> I am working on spark 1.4 and reading a orc table using dataframe and
> converting that DF to RDD
>
> I spark UI I observe that 50 % task are running on locality and
Hi Shane,
Tachyon provides an api to get the block locations of the file which Spark
uses when scheduling tasks.
Hope this helps,
Calvin
On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane <shane.kinse...@aspect.com>
wrote:
> Hi all,
>
>
>
> I am looking into how Spark hand
Hi all,
I am looking into how Spark handles data locality wrt Tachyon. My main concern
is how this is coordinated. Will it send a task based on a file loaded from
Tachyon to a node that it knows has that file locally and how does it know
which nodes has what?
Kind regards,
Shane
This email
Hi All ,
Just wanted to find out if there is an benefits to installing kafka
brokers and spark nodes on the same machine ?
is it possible that spark can pull data from kafka if it is local to the
node i.e. the broker or partition is on the same machine.
Thanks,
Ashish
seconds.
-adrian
From: Cody Koeninger <c...@koeninger.org>
Sent: Monday, September 21, 2015 10:19 PM
To: Ashish Soni
Cc: user
Subject: Re: Spark Streaming and Kafka MultiNode Setup - Data Locality
The direct stream already uses the kafka leader for a
The direct stream already uses the kafka leader for a given partition as
the preferred location.
I don't run kafka on the same nodes as spark, and I don't know anyone who
does, so that situation isn't particularly well tested.
On Mon, Sep 21, 2015 at 1:15 PM, Ashish Soni
Hi Sunil,
Have you seen this fix in Spark 1.5 that may fix the locality issue?:
https://issues.apache.org/jira/browse/SPARK-4352
On Thu, Aug 20, 2015 at 4:09 AM, Sunil sdhe...@gmail.com wrote:
Hello . I am seeing some unexpected issues with achieving HDFS
data
locality. I expect
Hello . I am seeing some unexpected issues with achieving HDFS data
locality. I expect the tasks to be executed only on the node which has the
data but this is not happening (ofcourse, unless the node is busy in which
case, I understand tasks can go to some other node). Could anyone
Hi Spark users and developers,
I have been trying to use spark-ec2. After I launched the spark cluster
(1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job
where the data is stored in the ephemeral hdfs. It does not matter what I
tried to do, there is no data locality at all
Hi guys,
I am running some SQL queries, but all my tasks are reported as either
NODE_LOCAL or PROCESS_LOCAL.
In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because
they have to aggregate data from multiple hosts. However, in Spark even the
aggregation stages are reported
. At the end of day, a daily job is launched, which works on the
outputs of the hourly jobs.
For data locality and speed, we wish that when the daily job launches, it
finds all instances of a given key at a single executor rather than
fetching
it from others during shuffle.
Is it possible
Response inline.
On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun sean.bigdata...@gmail.com
wrote:
(resending...)
I was thinking the same setup… But the more I think of this problem, and
the more interesting this could be.
If we allocate 50% total memory to Tachyon statically, then the
Hi,
We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of
Tuple2. At the end of day, a daily job is launched, which works on the
outputs of the hourly jobs.
For data locality and speed, we wish that when the daily job launches, it
finds all instances of a given key at a single
(resending...)
I was thinking the same setup… But the more I think of this problem, and
the more interesting this could be.
If we allocate 50% total memory to Tachyon statically, then the Mesos
benefits of dynamically scheduling resources go away altogether.
Can Tachyon be resource managed by
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1
Hi,
I am fairly new to the spark ecosystem and I have been trying to setup
a spark on mesos deployment. I can't seem to figure out the best
practices around HDFS and Tachyon. The documentation about Spark's
data-locality section seems to point
deployment. I can't seem to figure out the best
practices around HDFS and Tachyon. The documentation about Spark's
data-locality section seems to point that each of my mesos slave nodes
should also run a hdfs datanode. This seems fine but I can't seem to
figure out how I would go about pushing
...@brightcove.com wrote:
Hi,
I am fairly new to the spark ecosystem and I have been trying to
setup a spark on mesos deployment. I can't seem to figure out the
best practices around HDFS and Tachyon. The documentation about
Spark's data-locality section seems to point that each of my mesos
data locality:
// Pack each app into as few nodes as possible until we've assigned all
its cores
for (worker - workers if worker.coresFree 0 worker.state ==
WorkerState.ALIVE) {
for (app - waitingApps if app.coresLeft 0) {
if (canUse(app, worker)) {
val coresToUse
Hi, sparkers,
When I read the code about computing resources allocation for the newly
submitted application in the Master#schedule method, I got a question about
data locality:
// Pack each app into as few nodes as possible until we've assigned all its
cores
for (worker - workers
Hi,
We wrote a spark steaming app that receives file names on HDFS from Kafka
and opens them using Hadoop's libraries.
The problem with this method is that I'm not utilizing data locality
because any worker might open any file without giving precedence to data
locality.
I can't open the files
-list.1001560.n3.nabble.com/Data-Locality-tp21000p21413.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user
?
-
--Harihar
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000p21410.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user
in context:
http://apache-spark-user-list.1001560.n3.nabble.com/data-locality-in-logs-tp1276p21416.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
-
To unsubscribe, e-mail: user-unsubscr
as with
Mesos.
Looking at the logs again, it looks like the locality info between the
stand-alone and Mesos coarse-grained mode are very similar.
I must have been hallucinating earlier thinking somehow the data locality
information was different.
So this whole thing might just simply be due to the fact
for every task? Of course, any perceived slow down will
probably be very dependent
on the workload. I just want to have a feel of the possible overhead (e.g.,
factor of 2 or 3 slowdown?).
If not a data locality issue, perhaps this overhead can be a factor in the
slowdown I observed, at least
executors for every task? Of course, any perceived slow down will
probably be very dependent
on the workload. I just want to have a feel of the possible overhead (e.g.,
factor of 2 or 3 slowdown?).
If not a data locality issue, perhaps this overhead can be a factor in the
slowdown I observed, at least
, especially for coarse-grained mode as the executors
supposedly do not go away until job completion.
Any ideas?
Thanks,
Mike
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-running-Spark-on-Mesos-tp21041.html
Sent from the Apache Spark User
do not go away until job completion.
Any ideas?
Thanks,
Mike
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-running-Spark-on-Mesos-tp21041.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
You can also read about locality here in the docs:
http://spark.apache.org/docs/latest/tuning.html#data-locality
On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger c...@koeninger.org wrote:
No, not all rdds have location information, and in any case tasks may be
scheduled on non-local nodes
is local ie Node1 and Node 2(assuming Node1 and Node2 have enough resources
to execute the tasks)?
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
the data
is local ie Node1 and Node 2(assuming Node1 and Node2 have enough resources
to execute the tasks)?
Gaurav
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com
I am seeing skewed execution times. As far as I can tell, they are
attributable to differences in data locality - tasks with locality
PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest.
This seems entirely as it should be - the question is, why the different
locality levels?
I am
. As far as I can tell, they are
attributable to differences in data locality - tasks with locality
PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest.
This seems entirely as it should be - the question is, why the different
locality levels?
I am seeing skewed caching, as I
...@oculusinfo.com wrote:
Can anyone point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?
I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some (relatively complex) values.
So I see
, Nathan Kronenfeld
nkronenf...@oculusinfo.com wrote:
Can anyone point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?
I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some
point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?
I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some (relatively complex) values.
So I see several inconsistencies I don't
Can anyone point me to a good primer on how spark decides where to send
what task, how it distributes them, and how it determines data locality?
I'm trying a pretty simple task - it's doing a foreach over cached data,
accumulating some (relatively complex) values.
So I see several
Hi, everyone? ? I come across with a problem about data locality, i found
these?example?code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》? ??? ??val
locData = InputFormatInfo.computePreferredLocations(Seq(new
InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt
Subject: problem with data locality api
Hi, everyone
I come across with a problem about data locality, i found these example
code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》
val locData = InputFormatInfo.computePreferredLocations(Seq(new
InputFormatInfo(conf, classOf[TextInputFormat
for your reply!
qinwei
?发件人:?Shao, Saisai发送时间:?2014-09-28?14:42收件人:?qinwei抄送:?user主题:?RE: problem with
data locality api
Hi
?
First conf is used for Hadoop to determine the locality distribution of HDFS
file. Second conf is used for Spark, though with the same name, actually
they are two
, 2014 at 4:13 AM, Tsai Li Ming mailingl...@ltsai.com wrote:
Hi,
In the standalone mode, how can we check data locality is working as
expected when tasks are assigned?
Thanks!
On 23 Jul, 2014, at 12:49 am, Sandy Ryza sandy.r...@cloudera.com wrote:
On standalone there is still special
for your patience!
--
*From:* Sandy Ryza [mailto:sandy.r...@cloudera.com]
*Sent:* 2014年7月22日 9:47
*To:* user@spark.apache.org
*Subject:* Re: data locality
This currently only works for YARN. The standalone default is to place an
executor on every node
you for your patience!
From: Sandy Ryza [mailto:sandy.r...@cloudera.com]
Sent: 2014年7月22日 9:47
To: user@spark.apache.org
Subject: Re: data locality
This currently only works for YARN. The standalone default is to place an
executor on every node for every
I have a standalone spark cluster and a HDFS cluster which share some of nodes.
When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS
the location for each file block in order to get a right worker node?
How about a spark cluster on Yarn?
Thank you very much!
any information about
where the input data for the jobs is located. If the executors occupy
significantly fewer nodes than exist in the cluster, it can be difficult
for Spark to achieve data locality. The workaround for this is an API that
allows passing in a set of preferred locations when
executors to use for this application?
Thanks again!
From: Sandy Ryza [mailto:sandy.r...@cloudera.com]
Sent: Friday, July 18, 2014 3:44 PM
To: user@spark.apache.org
Subject: Re: data locality
Hi Haopu,
Spark will ask HDFS for file block locations
HDFS on the same cluster as Spark, write
the data from the Actors to HDFS, and then use HDFS as input source for
Spark Streaming. Does this result in better performance due to data locality
(with HDFS data replication turned on)? I think performance should be almost
the same with actors, since
fault tolerance, and
the ability to checkpoint and recover even if master fails.
Cheers,
Nilesh
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317.html
Sent
76 matches
Mail list logo