subject:"Data Locality"

Streaming partition-by data locality for state lookupon executor

2022-04-13 Thread Sandip Khanzode

Kinesis? What I would want to finally achieve is that the flatMapGroupWithState() that I would call later in the pipeline should have the same (partition) key internally for key lookups in the (RocksDB) state so that data locality can be achieved. Is this redundant or implicit or not possible

Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer

way to figure out which IP or hostname of data-nodes returns > from name-node to the spark? or Can you offer me a debug approach? > >> On Farvardin 24, 1400 AP, at 17:45, Russell Spitzer >> mailto:russell.spit...@gmail.com>> wrote: >> >> Data locality can on

Re: [Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Russell Spitzer

Data locality can only occur if the Spark Executor IP address string matches the preferred location returned by the file system. So this job would only have local tasks if the datanode replicas for the files in question had the same ip address as the Spark executors you are using. If they don&#

[Spark Core][Advanced]: Problem with data locality when running Spark query with local nature on apache Hadoop

2021-04-13 Thread Mohamadreza Rostami

https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache <https://stackoverflow.com/questions/66612906/problem-with-data-locality-when-running-spark-query-with-local-nature-on-apache>

[ spark-streaming ] - Data Locality issue

2020-02-04 Thread Karthik Srinivas

Hi, I am using spark 2.3.2, i am facing issues due to data locality, even after giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL, can someone help me with this. Thank you

Data locality

2020-02-04 Thread Karthik Srinivas

Hi all, I am using spark 2.3.2, i am facing issues due to data locality, even after giving spark.locality.wait.rack=200, locality_level is always RACK_LOCAL, can someone help me with this. Thank you

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales

ight have to do with your container ips, it depends on your >> networking setup. You might want to try host networking so that the >> containers share the ip with the host. >> >> On Wed, Dec 28, 2016 at 1:46 AM, Karamba wrote: >>> Hi Sun Rui, >>> >>>

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba

ut at time of reading from hdfs in a spark program, the script is >> called in my hdfs namenode container. >> >>> However, node-level locality can still work for Standalone. >> I have a couple of physical hosts that run spark and hdfs docker >> containers. How does

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Miguel Morales

Although the Spark task scheduler is aware of rack-level data locality, it >> seems that only YARN implements the support for it. > > This explains why the script that I configured in core-site.xml > topology.script.file.name is not called in by the spark container. > But at time of

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-28 Thread Karamba

Hi Sun Rui, thanks for answering! > Although the Spark task scheduler is aware of rack-level data locality, it > seems that only YARN implements the support for it. This explains why the script that I configured in core-site.xml topology.script.file.name is not called in by the

Re: [Spark 2.0.2 HDFS]: no data locality

2016-12-27 Thread Sun Rui

Although the Spark task scheduler is aware of rack-level data locality, it seems that only YARN implements the support for it. However, node-level locality can still work for Standalone. It is not necessary to copy the hadoop config files into the Spark CONF directory. Set HADOOP_CONF_DIR to

[Spark 2.0.2 HDFS]: no data locality

2016-12-26 Thread Karamba

Hi, I am running a couple of docker hosts, each with an HDFS and a spark worker in a spark standalone cluster. In order to get data locality awareness, I would like to configure Racks for each host, so that a spark worker container knows from which hdfs node container it should load its data

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Eugene Morozov

> > Does Spark uses data locality information from HDFS, when running in > standalone mode? Or is it running on YARN mandatory for such purpose? I > can't find this information in the docs, and on Google I am only finding > contrasting opinion on that. > > Regards > Marco Capuccini >

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh

gt; HTH >> >> Dr Mich Talebzadeh >> >> >> LinkedIn * >> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw >> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* >> >> >> h

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh

com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>* > > > http://talebzadehmich.wordpress.com > > > > On 5 June 2016 at 10:50, Marco Capuccini > wrote: > >> Dear all, >> >> Does Spark uses data locality information from HDFS, when running in

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini

m<http://talebzadehmich.wordpress.com/> On 5 June 2016 at 10:50, Marco Capuccini mailto:marco.capucc...@farmbio.uu.se>> wrote: Dear all, Does Spark uses data locality information from HDFS, when running in standalone mode? Or is it running on YARN mandatory for such purpose?

Re: Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Mich Talebzadeh

uccini wrote: > Dear all, > > Does Spark uses data locality information from HDFS, when running in > standalone mode? Or is it running on YARN mandatory for such purpose? I > can't find this information in the docs, and on Google I am only finding > contrasting opinion

Does Spark uses data locality information from HDFS when running in standalone mode?

2016-06-05 Thread Marco Capuccini

Dear all, Does Spark uses data locality information from HDFS, when running in standalone mode? Or is it running on YARN mandatory for such purpose? I can't find this information in the docs, and on Google I am only finding contrasting opinion on that. Regards Marco Capuccini

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi

We are using spark in two ways 1. Yarn with spark support. Kafka running along with data nodes 2. Spark master and workers running with some of Kafka brokers. Data locality is important. Regards Diwakar Sent from Samsung Mobile. Original message From: أنس

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread أنس الليثي

iwakar . > > > > Sent from Samsung Mobile. > > > Original message > From: "Yuval.Itzchakov" > Date:07/02/2016 19:38 (GMT+05:30) > To: user@spark.apache.org > Cc: > Subject: Re: Apache Spark data locality when integrating with Kafka

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Diwakar Dhanuskodi

Fanoos, Where you want the solution to be deployed ?. On premise or cloud? Regards Diwakar . Sent from Samsung Mobile. Original message From: "Yuval.Itzchakov" Date:07/02/2016 19:38 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Re: Apache Spark dat

Re: Apache Spark data locality when integrating with Kafka

2016-02-07 Thread Yuval.Itzchakov

two clusters so you can, again, benefit from low IO latency and high throughput. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165p26170.html Sent from the Apache Spark User List mail

Re: Apache Spark data locality when integrating with Kafka

2016-02-06 Thread Koert Kuipers

spark can benefit from data locality and will try to launch tasks on the node where the kafka partition resides. however i think in production many organizations run a dedicated kafka cluster. On Sat, Feb 6, 2016 at 11:27 PM, Diwakar Dhanuskodi < diwakar.dhanusk...@gmail.com> wrote: >

RE: Apache Spark data locality when integrating with Kafka

2016-02-06 Thread Diwakar Dhanuskodi

Yes . To reduce network latency . Sent from Samsung Mobile. Original message From: fanooos Date:07/02/2016 09:24 (GMT+05:30) To: user@spark.apache.org Cc: Subject: Apache Spark data locality when integrating with Kafka Dears If I will use Kafka as a streaming source

Apache Spark data locality when integrating with Kafka

2016-02-06 Thread fanooos

/Apache-Spark-data-locality-when-integrating-with-Kafka-tp26165.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail

Re: How data locality is honored when spark is running on yarn

2016-01-27 Thread Saisai Shao

wared, this is same for different cluster manager. Thanks Saisai On Thu, Jan 28, 2016 at 10:50 AM, Todd wrote: > Hi, > I am kind of confused about how data locality is honored when spark is > running on yarn(client or cluster mode),can someone please elaberate on > this? Thanks! > > >

How data locality is honored when spark is running on yarn

2016-01-27 Thread Todd

Hi, I am kind of confused about how data locality is honored when spark is running on yarn(client or cluster mode),can someone please elaberate on this? Thanks!

Re: Data Locality Issue

2015-11-15 Thread Renu Yadav

what are the parameters on which locality depends On Sun, Nov 15, 2015 at 5:54 PM, Renu Yadav wrote: > Hi, > > I am working on spark 1.4 and reading a orc table using dataframe and > converting that DF to RDD > > I spark UI I observe that 50 % task are running on locality and ANY and > very few

Data Locality Issue

2015-11-15 Thread Renu Yadav

Hi, I am working on spark 1.4 and reading a orc table using dataframe and converting that DF to RDD I spark UI I observe that 50 % task are running on locality and ANY and very few on LOCAL. What would be the possible reason for this? Please help. I have even changed locality settings Thanks

Re: How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Calvin Jia

Hi Shane, Tachyon provides an api to get the block locations of the file which Spark uses when scheduling tasks. Hope this helps, Calvin On Fri, Oct 23, 2015 at 8:15 AM, Kinsella, Shane wrote: > Hi all, > > > > I am looking into how Spark handles data locality wrt Tachyon. My

How does Spark coordinate with Tachyon wrt data locality

2015-10-23 Thread Kinsella, Shane

Hi all, I am looking into how Spark handles data locality wrt Tachyon. My main concern is how this is coordinated. Will it send a task based on a file loaded from Tachyon to a node that it knows has that file locally and how does it know which nodes has what? Kind regards, Shane This email

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Adrian Tanase

fault of 3 seconds. -adrian From: Cody Koeninger Sent: Monday, September 21, 2015 10:19 PM To: Ashish Soni Cc: user Subject: Re: Spark Streaming and Kafka MultiNode Setup - Data Locality The direct stream already uses the kafka leader for a given partition as the

Re: Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Cody Koeninger

The direct stream already uses the kafka leader for a given partition as the preferred location. I don't run kafka on the same nodes as spark, and I don't know anyone who does, so that situation isn't particularly well tested. On Mon, Sep 21, 2015 at 1:15 PM, Ashish Soni wrote: > Hi All , > > J

Spark Streaming and Kafka MultiNode Setup - Data Locality

2015-09-21 Thread Ashish Soni

Hi All , Just wanted to find out if there is an benefits to installing kafka brokers and spark nodes on the same machine ? is it possible that spark can pull data from kafka if it is local to the node i.e. the broker or partition is on the same machine. Thanks, Ashish

Re: Data locality with HDFS not being seen

2015-08-21 Thread Sameer Farooqui

Hi Sunil, Have you seen this fix in Spark 1.5 that may fix the locality issue?: https://issues.apache.org/jira/browse/SPARK-4352 On Thu, Aug 20, 2015 at 4:09 AM, Sunil wrote: > Hello . I am seeing some unexpected issues with achieving HDFS > data > locality. I expect the ta

Data locality with HDFS not being seen

2015-08-20 Thread Sunil

Hello . I am seeing some unexpected issues with achieving HDFS data locality. I expect the tasks to be executed only on the node which has the data but this is not happening (ofcourse, unless the node is busy in which case, I understand tasks can go to some other node). Could anyone

Poor HDFS Data Locality on Spark-EC2

2015-08-04 Thread Jerry Lam

Hi Spark users and developers, I have been trying to use spark-ec2. After I launched the spark cluster (1.4.1) with ephemeral hdfs (using hadoop 2.4.0), I tried to execute a job where the data is stored in the ephemeral hdfs. It does not matter what I tried to do, there is no data locality at all

data locality in spark

2015-04-27 Thread Grandl Robert

Hi guys, I am running some SQL queries, but all my tasks are reported as either NODE_LOCAL or PROCESS_LOCAL. In case of Hadoop world, the reduce tasks are RACK or NON_RACK LOCAL because they have to aggregate data from multiple hosts. However, in Spark even the aggregation stages are reported a

Re: Data locality across jobs

2015-04-03 Thread Ajay Srivastava

unning an hourly job using Spark 1.2 on Yarn. It saves an RDD of Tuple2. At the end of day, a daily job is launched, which works on the outputs of the hourly jobs. For data locality and speed, we wish that when the daily job launches, it finds all instances of a given key at a single executor rather t

Re: Data locality across jobs

2015-04-02 Thread Sandy Ryza

At the end of day, a daily job is launched, which works on the > outputs of the hourly jobs. > > For data locality and speed, we wish that when the daily job launches, it > finds all instances of a given key at a single executor rather than > fetching > it from others durin

Data locality across jobs

2015-04-01 Thread kjsingh

Hi, We are running an hourly job using Spark 1.2 on Yarn. It saves an RDD of Tuple2. At the end of day, a daily job is launched, which works on the outputs of the hourly jobs. For data locality and speed, we wish that when the daily job launches, it finds all instances of a given key at a single

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-04-01 Thread Haoyuan Li

Response inline. On Tue, Mar 31, 2015 at 10:41 PM, Sean Bigdatafun wrote: > (resending...) > > I was thinking the same setup… But the more I think of this problem, and > the more interesting this could be. > > If we allocate 50% total memory to Tachyon statically, then the Mesos > benefits of dy

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Sean Bigdatafun

(resending...) I was thinking the same setup… But the more I think of this problem, and the more interesting this could be. If we allocate 50% total memory to Tachyon statically, then the Mesos benefits of dynamically scheduling resources go away altogether. Can Tachyon be resource managed by Me

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li

> > I am fairly new to the spark ecosystem and I have been trying to > > setup a spark on mesos deployment. I can't seem to figure out the > > "best practices" around HDFS and Tachyon. The documentation about > > Spark's data-locality section seems to poi

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan

ailto:achau...@brightcove.com>> wrote: > > Hi, > > I am fairly new to the spark ecosystem and I have been trying to > setup a spark on mesos deployment. I can't seem to figure out the > "best practices" around HDFS and Tachyon. The documentation about > Spa

Re: deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Haoyuan Li

s deployment. I can't seem to figure out the "best > practices" around HDFS and Tachyon. The documentation about Spark's > data-locality section seems to point that each of my mesos slave nodes > should also run a hdfs datanode. This seems fine but I can't seem to

deployment of spark on mesos and data locality in tachyon/hdfs

2015-03-31 Thread Ankur Chauhan

-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi, I am fairly new to the spark ecosystem and I have been trying to setup a spark on mesos deployment. I can't seem to figure out the "best practices" around HDFS and Tachyon. The documentation about Spark's data-locality section

Re: Re: How does Spark honor data locality when allocating computing resources for an application

2015-03-15 Thread bit1...@163.com

se). Thanks. bit1...@163.com From: eric wong Date: 2015-03-14 22:36 To: bit1...@163.com; user Subject: Re: How does Spark honor data locality when allocating computing resources for an application you seem like not to note the configuration varible "spreadOutApps" And it's comme

Re: How does Spark honor data locality when allocating computing resources for an application

2015-03-14 Thread eric wong

app.state = ApplicationState.RUNNING > } > } > } > } > > Looks that the resource allocation policy here is that Master will assign > as few workers as possible, so long as these few workers has enough > resources for the application. > My question is: Assume t

How does Spark honor data locality when allocating computing resources for an application

2015-03-13 Thread bit1...@163.com

Hi, sparkers, When I read the code about computing resources allocation for the newly submitted application in the Master#schedule method, I got a question about data locality: // Pack each app into as few nodes as possible until we've assigned all its cores for (worker <- wo

Ensuring data locality when opening files

2015-03-09 Thread Daniel Haviv

Hi, We wrote a spark steaming app that receives file names on HDFS from Kafka and opens them using Hadoop's libraries. The problem with this method is that I'm not utilizing data locality because any worker might open any file without giving precedence to data locality. I can't

Re: data locality in logs

2015-01-28 Thread hnahak

context: http://apache-spark-user-list.1001560.n3.nabble.com/data-locality-in-logs-tp1276p21416.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr

Re: Data Locality

2015-01-28 Thread hnahak

-list.1001560.n3.nabble.com/Data-Locality-tp21000p21413.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user

Re: Data Locality

2015-01-28 Thread Harihar Nahak

? - --Harihar -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000p21410.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user

Re: Data locality running Spark on Mesos

2015-01-11 Thread Michael V Le

with Mesos. Looking at the logs again, it looks like the locality info between the stand-alone and Mesos coarse-grained mode are very similar. I must have been hallucinating earlier thinking somehow the data locality information was different. So this whole thing might just simply be due to the fact

Re: Data locality running Spark on Mesos

2015-01-10 Thread Timothy Chen

egards to Mesos in fine-grained mode, do you have a feel for the > overhead of > launching executors for every task? Of course, any perceived slow down will > probably be very dependent > on the workload. I just want to have a feel of the possible overhead (e.g., > factor of 2 o

Re: Data locality running Spark on Mesos

2015-01-09 Thread Michael V Le

ching executors for every task? Of course, any perceived slow down will probably be very dependent on the workload. I just want to have a feel of the possible overhead (e.g., factor of 2 or 3 slowdown?). If not a data locality issue, perhaps this overhead can be a factor in the slowdown I observed, at

Re: Data locality running Spark on Mesos

2015-01-08 Thread Tim Chen

ssue might be the reason for the slow down but I > can't figure out why, especially for coarse-grained mode as the executors > supposedly do not go away until job completion. > > Any ideas? > > Thanks, > Mike > > > > -- > View this message in context: > http:/

Data locality running Spark on Mesos

2015-01-08 Thread mvle

own but I can't figure out why, especially for coarse-grained mode as the executors supposedly do not go away until job completion. Any ideas? Thanks, Mike -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-locality-running-Spark-on-Mesos-tp21041.h

Re: Data Locality

2015-01-06 Thread Andrew Ash

You can also read about locality here in the docs: http://spark.apache.org/docs/latest/tuning.html#data-locality On Tue, Jan 6, 2015 at 8:37 AM, Cody Koeninger wrote: > No, not all rdds have location information, and in any case tasks may be > scheduled on non-local nodes if there i

Re: Data Locality

2015-01-06 Thread Cody Koeninger

gt; is local ie Node1 and Node 2(assuming Node1 and Node2 have enough resources > to execute the tasks)? > > Gaurav > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000.html > Sent from the Apache

Data Locality

2015-01-06 Thread gtinside

data is local ie Node1 and Node 2(assuming Node1 and Node2 have enough resources to execute the tasks)? Gaurav -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Data-Locality-tp21000.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: data locality, task distribution

2014-11-13 Thread Aaron Davidson

wrote: > I am seeing skewed execution times. As far as I can tell, they are > attributable to differences in data locality - tasks with locality > PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest. > > This seems entirely as it should be - the question is, why the different

Re: data locality, task distribution

2014-11-13 Thread Nathan Kronenfeld

I am seeing skewed execution times. As far as I can tell, they are attributable to differences in data locality - tasks with locality PROCESS_LOCAL run fast, NODE_LOCAL, slower, and ANY, slowest. This seems entirely as it should be - the question is, why the different locality levels? I am

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson

; case? >> >> On Tue, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld < >> nkronenf...@oculusinfo.com> wrote: >> >>> Can anyone point me to a good primer on how spark decides where to send >>> what task, how it distributes them, and how it determines data loc

Re: data locality, task distribution

2014-11-12 Thread Nathan Kronenfeld

, Nov 11, 2014 at 8:58 AM, Nathan Kronenfeld < > nkronenf...@oculusinfo.com> wrote: > >> Can anyone point me to a good primer on how spark decides where to send >> what task, how it distributes them, and how it determines data locality? >> >> I'm trying a pretty

Re: data locality, task distribution

2014-11-12 Thread Aaron Davidson

enf...@oculusinfo.com> wrote: > Can anyone point me to a good primer on how spark decides where to send > what task, how it distributes them, and how it determines data locality? > > I'm trying a pretty simple task - it's doing a foreach over cached data, > accumulating some (relat

data locality, task distribution

2014-11-11 Thread Nathan Kronenfeld

Can anyone point me to a good primer on how spark decides where to send what task, how it distributes them, and how it determines data locality? I'm trying a pretty simple task - it's doing a foreach over cached data, accumulating some (relatively complex) values. So I s

回复: RE: problem with data locality api

2014-09-27 Thread qinwei

your reply! qinwei ?发件人：?Shao, Saisai发送时间：?2014-09-28?14:42收件人：?qinwei抄送：?user主题：?RE: problem with data locality api Hi ? First conf is used for Hadoop to determine the locality distribution of HDFS file. Second conf is used for Spark, though with the same name, actually they are two

RE: problem with data locality api

2014-09-27 Thread Shao, Saisai

Subject: problem with data locality api Hi, everyone I come across with a problem about data locality, i found these example code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》 val locData = InputFormatInfo.computePreferredLocations(Seq(new InputFormatInfo(conf, classOf[TextInputFormat

problem with data locality api

2014-09-27 Thread qinwei

Hi, everyone? ? I come across with a problem about data locality, i found these?example?code in 《Spark-on-YARN-A-Deep-Dive-Sandy-Ryza.pdf》? ??? ??val locData = InputFormatInfo.computePreferredLocations(Seq(new InputFormatInfo(conf, classOf[TextInputFormat], new Path(“myfile.txt

Re: data locality

2014-08-30 Thread Chris Fregly

On Fri, Jul 25, 2014 at 4:13 AM, Tsai Li Ming wrote: > Hi, > > In the standalone mode, how can we check data locality is working as > expected when tasks are assigned? > > Thanks! > > > On 23 Jul, 2014, at 12:49 am, Sandy Ryza wrote: > > On standalone there

Re: Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?

2014-08-04 Thread Martin Goodson

Just an update on this - I have benchmarked on a cluster built with spark-ec2 and again found that reading from hdfs is not much faster than from s3 (about 20%). Does anyone know how to check that data locality is being used by spark on my cluster? Is it surprising that access to HDFS on local

Reading from HDFS no faster than reading from S3 - how to tell if data locality respected?

2014-08-01 Thread Martin Goodson

Hi all, I'm consistently finding that reading from HDFS is not appreciably faster than reading from S3 using pyspark. How can I tell whether data locality is being respected? In this example, reading from HDFS is only about 10% faster than reading the same file from S3. The files were pulled

Re: data locality

2014-07-25 Thread Tsai Li Ming

Hi, In the standalone mode, how can we check data locality is working as expected when tasks are assigned? Thanks! On 23 Jul, 2014, at 12:49 am, Sandy Ryza wrote: > On standalone there is still special handling for assigning tasks within > executors. There just isn't special h

Re: data locality

2014-07-22 Thread Sandy Ryza

ease correct me if I'm wrong. Thank you for your patience! > > > -- > > *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] > *Sent:* 2014年7月22日 9:47 > > *To:* user@spark.apache.org > *Subject:* Re: data locality > > > > Th

RE: data locality

2014-07-21 Thread Haopu Wang

;m wrong. Thank you for your patience! From: Sandy Ryza [mailto:sandy.r...@cloudera.com] Sent: 2014年7月22日 9:47 To: user@spark.apache.org Subject: Re: data locality This currently only works for YARN. The standalone default is to place an executor on every

Re: data locality

2014-07-21 Thread Sandy Ryza

hosts, how does Spark decide how many total > executors to use for this application? > > > > Thanks again! > > > -- > > *From:* Sandy Ryza [mailto:sandy.r...@cloudera.com] > *Sent:* Friday, July 18, 2014 3:44 PM > *To:* user@spark.apac

RE: data locality

2014-07-18 Thread Haopu Wang

executors to use for this application? Thanks again! From: Sandy Ryza [mailto:sandy.r...@cloudera.com] Sent: Friday, July 18, 2014 3:44 PM To: user@spark.apache.org Subject: Re: data locality Hi Haopu, Spark will ask HDFS for file block locations and

Re: data locality

2014-07-18 Thread Sandy Ryza

before it has any information about where the input data for the jobs is located. If the executors occupy significantly fewer nodes than exist in the cluster, it can be difficult for Spark to achieve data locality. The workaround for this is an API that allows passing in a set of preferred loca

data locality

2014-07-18 Thread Haopu Wang

I have a standalone spark cluster and a HDFS cluster which share some of nodes. When reading HDFS file, how does spark assign tasks to nodes? Will it ask HDFS the location for each file block in order to get a right worker node? How about a spark cluster on Yarn? Thank you very much!

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty

p://apache-spark-user-list.1001560.n3.nabble.com/Performance-of-Akka-or-TCP-Socket-input-sources-vs-HDFS-Data-locality-in-Spark-Streaming-tp7317p7320.html Sent from the Apache Spark User List mailing list archive at Nabble.com.

Re: Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Michael Cutler

ast, and some optimization like this is definitely done I > assume? > > I suppose the only benefit with HDFS would be better fault tolerance, and > the ability to checkpoint and recover even if master fails. > > Cheers, > Nilesh > > > > -- > View this message in c

Performance of Akka or TCP Socket input sources vs HDFS: Data locality in Spark Streaming

2014-06-10 Thread Nilesh Chakraborty

setup HDFS on the same cluster as Spark, write the data from the Actors to HDFS, and then use HDFS as input source for Spark Streaming. Does this result in better performance due to data locality (with HDFS data replication turned on)? I think performance should be almost the same with actors,

84 matches

Mail list logo