RE: Tasks run only on one machine

2015-04-24 Thread Evo Eftimov
# of tasks = # of partitions, hence you can provide the desired number of 
partitions to the textFile API which should result a) in a better spatial 
distribution of the RDD b) each partition will be operated upon by a separate 
task 

You can provide the number of p

-Original Message-
From: Pat Ferrel [mailto:p...@occamsmachete.com] 
Sent: Thursday, April 23, 2015 5:51 PM
To: user@spark.apache.org
Subject: Tasks run only on one machine

Using Spark streaming to create a large volume of small nano-batch input files, 
~4k per file, thousands of ‘part-x’ files.  When reading the nano-batch 
files and doing a distributed calculation my tasks run only on the machine 
where it was launched. I’m launching in “yarn-client” mode. The rdd is created 
using sc.textFile(“list of thousand files”)

What would cause the read to occur only on the machine that launched the 
driver. 

Do I need to do something to the RDD after reading? Has some partition factor 
been applied to all derived rdds?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Argh, I looked and there really isn’t that much data yet. There will be 
thousands but starting small.

I bet this is just a total data size not requiring all workers thing—sorry, 
nevermind.


On Apr 23, 2015, at 10:30 AM, Pat Ferrel  wrote:

They are in HDFS so available on all workers

On Apr 23, 2015, at 10:29 AM, Pat Ferrel  wrote:

Physically? Not sure, they were written using the nano-batch rdds in a 
streaming job that is in a separate driver. The job is a Kafka consumer. 

Would that effect all derived rdds? If so is there something I can do to mix it 
up or does Spark know best about execution speed here?


On Apr 23, 2015, at 10:23 AM, Sean Owen  wrote:

Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?

On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel  wrote:
> Sure
> 
>  var columns = mc.textFile(source).map { line => line.split(delimiter) }
> 
> Here “source” is a comma delimited list of files or directories. Both the
> textFile and .map tasks happen only on the machine they were launched from.
> 
> Later other distributed operations happen but I suspect if I can figure out
> why the fist line is run only on the client machine the rest will clear up
> too. Here are some subsequent lines.
> 
>  if(filterColumn != -1) {
>columns = columns.filter { tokens => tokens(filterColumn) == filterBy
> }
>  }
> 
>  val interactions = columns.map { tokens =>
>tokens(rowIDColumn) -> tokens(columnIDPosition)
>  }
> 
>  interactions.cache()
> 
> On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele 
> wrote:
> 
> Will you be able to paste code here?
> 
> On 23 April 2015 at 22:21, Pat Ferrel  wrote:
>> 
>> Using Spark streaming to create a large volume of small nano-batch input
>> files, ~4k per file, thousands of ‘part-x’ files.  When reading the
>> nano-batch files and doing a distributed calculation my tasks run only on
>> the machine where it was launched. I’m launching in “yarn-client” mode. The
>> rdd is created using sc.textFile(“list of thousand files”)
>> 
>> What would cause the read to occur only on the machine that launched the
>> driver.
>> 
>> Do I need to do something to the RDD after reading? Has some partition
>> factor been applied to all derived rdds?
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> 
> 
> 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
They are in HDFS so available on all workers

On Apr 23, 2015, at 10:29 AM, Pat Ferrel  wrote:

Physically? Not sure, they were written using the nano-batch rdds in a 
streaming job that is in a separate driver. The job is a Kafka consumer. 

Would that effect all derived rdds? If so is there something I can do to mix it 
up or does Spark know best about execution speed here?


On Apr 23, 2015, at 10:23 AM, Sean Owen  wrote:

Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?

On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel  wrote:
> Sure
> 
>   var columns = mc.textFile(source).map { line => line.split(delimiter) }
> 
> Here “source” is a comma delimited list of files or directories. Both the
> textFile and .map tasks happen only on the machine they were launched from.
> 
> Later other distributed operations happen but I suspect if I can figure out
> why the fist line is run only on the client machine the rest will clear up
> too. Here are some subsequent lines.
> 
>   if(filterColumn != -1) {
> columns = columns.filter { tokens => tokens(filterColumn) == filterBy
> }
>   }
> 
>   val interactions = columns.map { tokens =>
> tokens(rowIDColumn) -> tokens(columnIDPosition)
>   }
> 
>   interactions.cache()
> 
> On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele 
> wrote:
> 
> Will you be able to paste code here?
> 
> On 23 April 2015 at 22:21, Pat Ferrel  wrote:
>> 
>> Using Spark streaming to create a large volume of small nano-batch input
>> files, ~4k per file, thousands of ‘part-x’ files.  When reading the
>> nano-batch files and doing a distributed calculation my tasks run only on
>> the machine where it was launched. I’m launching in “yarn-client” mode. The
>> rdd is created using sc.textFile(“list of thousand files”)
>> 
>> What would cause the read to occur only on the machine that launched the
>> driver.
>> 
>> Do I need to do something to the RDD after reading? Has some partition
>> factor been applied to all derived rdds?
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> 
> 
> 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Physically? Not sure, they were written using the nano-batch rdds in a 
streaming job that is in a separate driver. The job is a Kafka consumer. 

Would that effect all derived rdds? If so is there something I can do to mix it 
up or does Spark know best about execution speed here?


On Apr 23, 2015, at 10:23 AM, Sean Owen  wrote:

Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?

On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel  wrote:
> Sure
> 
>var columns = mc.textFile(source).map { line => line.split(delimiter) }
> 
> Here “source” is a comma delimited list of files or directories. Both the
> textFile and .map tasks happen only on the machine they were launched from.
> 
> Later other distributed operations happen but I suspect if I can figure out
> why the fist line is run only on the client machine the rest will clear up
> too. Here are some subsequent lines.
> 
>if(filterColumn != -1) {
>  columns = columns.filter { tokens => tokens(filterColumn) == filterBy
> }
>}
> 
>val interactions = columns.map { tokens =>
>  tokens(rowIDColumn) -> tokens(columnIDPosition)
>}
> 
>interactions.cache()
> 
> On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele 
> wrote:
> 
> Will you be able to paste code here?
> 
> On 23 April 2015 at 22:21, Pat Ferrel  wrote:
>> 
>> Using Spark streaming to create a large volume of small nano-batch input
>> files, ~4k per file, thousands of ‘part-x’ files.  When reading the
>> nano-batch files and doing a distributed calculation my tasks run only on
>> the machine where it was launched. I’m launching in “yarn-client” mode. The
>> rdd is created using sc.textFile(“list of thousand files”)
>> 
>> What would cause the read to occur only on the machine that launched the
>> driver.
>> 
>> Do I need to do something to the RDD after reading? Has some partition
>> factor been applied to all derived rdds?
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>> 
> 
> 
> 
> 


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tasks run only on one machine

2015-04-23 Thread Sean Owen
Where are the file splits? meaning is it possible they were also
(only) available on one node and that was also your driver?

On Thu, Apr 23, 2015 at 1:21 PM, Pat Ferrel  wrote:
> Sure
>
> var columns = mc.textFile(source).map { line => line.split(delimiter) }
>
> Here “source” is a comma delimited list of files or directories. Both the
> textFile and .map tasks happen only on the machine they were launched from.
>
> Later other distributed operations happen but I suspect if I can figure out
> why the fist line is run only on the client machine the rest will clear up
> too. Here are some subsequent lines.
>
> if(filterColumn != -1) {
>   columns = columns.filter { tokens => tokens(filterColumn) == filterBy
> }
> }
>
> val interactions = columns.map { tokens =>
>   tokens(rowIDColumn) -> tokens(columnIDPosition)
> }
>
> interactions.cache()
>
> On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele 
> wrote:
>
> Will you be able to paste code here?
>
> On 23 April 2015 at 22:21, Pat Ferrel  wrote:
>>
>> Using Spark streaming to create a large volume of small nano-batch input
>> files, ~4k per file, thousands of ‘part-x’ files.  When reading the
>> nano-batch files and doing a distributed calculation my tasks run only on
>> the machine where it was launched. I’m launching in “yarn-client” mode. The
>> rdd is created using sc.textFile(“list of thousand files”)
>>
>> What would cause the read to occur only on the machine that launched the
>> driver.
>>
>> Do I need to do something to the RDD after reading? Has some partition
>> factor been applied to all derived rdds?
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>
>
>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Tasks run only on one machine

2015-04-23 Thread Pat Ferrel
Sure

var columns = mc.textFile(source).map { line => line.split(delimiter) }

Here “source” is a comma delimited list of files or directories. Both the 
textFile and .map tasks happen only on the machine they were launched from.

Later other distributed operations happen but I suspect if I can figure out why 
the fist line is run only on the client machine the rest will clear up too. 
Here are some subsequent lines.

if(filterColumn != -1) {
  columns = columns.filter { tokens => tokens(filterColumn) == filterBy }
}

val interactions = columns.map { tokens =>
  tokens(rowIDColumn) -> tokens(columnIDPosition)
}

interactions.cache()

On Apr 23, 2015, at 10:14 AM, Jeetendra Gangele  wrote:

Will you be able to paste code here?

On 23 April 2015 at 22:21, Pat Ferrel mailto:p...@occamsmachete.com>> wrote:
Using Spark streaming to create a large volume of small nano-batch input files, 
~4k per file, thousands of ‘part-x’ files.  When reading the nano-batch 
files and doing a distributed calculation my tasks run only on the machine 
where it was launched. I’m launching in “yarn-client” mode. The rdd is created 
using sc.textFile(“list of thousand files”)

What would cause the read to occur only on the machine that launched the driver.

Do I need to do something to the RDD after reading? Has some partition factor 
been applied to all derived rdds?
-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 

For additional commands, e-mail: user-h...@spark.apache.org 








Re: Tasks run only on one machine

2015-04-23 Thread Jeetendra Gangele
Will you be able to paste code here?

On 23 April 2015 at 22:21, Pat Ferrel  wrote:

> Using Spark streaming to create a large volume of small nano-batch input
> files, ~4k per file, thousands of 'part-x' files.  When reading the
> nano-batch files and doing a distributed calculation my tasks run only on
> the machine where it was launched. I'm launching in "yarn-client" mode. The
> rdd is created using sc.textFile("list of thousand files")
>
> What would cause the read to occur only on the machine that launched the
> driver.
>
> Do I need to do something to the RDD after reading? Has some partition
> factor been applied to all derived rdds?
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>