Are the gz files roughly equal in size? Do you know that your partitions
are roughly balanced? Perhaps some cores get assigned tasks that end very
quickly, while others get most of the work.

On Sat Jan 17 2015 at 2:02:49 AM Gautham Anil <gautham.a...@gmail.com>
wrote:

> Hi,
>
> Thanks for getting back to me. Sorry for the delay. I am still having
> this issue.
>
> @sun: To clarify, The machine actually has 16 usable threads and the
> job has more than 100 gzip files. So, there are enough partitions to
> use all threads.
>
> @nicholas: The number of partitions match the number of files: > 100.
>
> @Sebastian: I understand the lazy loading behavior. For this reason, I
> usually use a .count() to force the transformation (.first() will not
> be enough). Still, during the transformation, only 4 cores are used
> for processing the input files.
>
> I don't know if this issue is noticed by other people. Can anyone
> reproduce it with v1.1?
>
>
> On Wed, Dec 17, 2014 at 2:14 AM, Nicholas Chammas
> <nicholas.cham...@gmail.com> wrote:
> > Rui is correct.
> >
> > Check how many partitions your RDD has after loading the gzipped files.
> e.g.
> > rdd.getNumPartitions().
> >
> > If that number is way less than the number of cores in your cluster (in
> your
> > case I suspect the number is 4), then explicitly repartition the RDD to
> > match the number of cores in your cluster, or some multiple thereof.
> >
> > For example:
> >
> > new_rdd = rdd.repartition(sc.defaultParallelism * 3)
> >
> > Operations on new_rdd should utilize all the cores in your cluster.
> >
> > Nick
> >
> >
> > On Wed Dec 17 2014 at 1:42:16 AM Sun, Rui <rui....@intel.com> wrote:
> >>
> >> Gautham,
> >>
> >> How many number of gz files do you have?  Maybe the reason is that gz
> file
> >> is compressed that can't be splitted for processing by Mapreduce. A
> single
> >> gz  file can only be processed by a single Mapper so that the CPU treads
> >> can't be fully utilized.
> >>
> >> -----Original Message-----
> >> From: Gautham [mailto:gautham.a...@gmail.com]
> >> Sent: Wednesday, December 10, 2014 3:00 AM
> >> To: u...@spark.incubator.apache.org
> >> Subject: pyspark sc.textFile uses only 4 out of 32 threads per node
> >>
> >> I am having an issue with pyspark launched in ec2 (using spark-ec2)
> with 5
> >> r3.4xlarge machines where each has 32 threads and 240GB of RAM. When I
> do
> >> sc.textFile to load data from a number of gz files, it does not
> progress as
> >> fast as expected. When I log-in to a child node and run top, I see only
> 4
> >> threads at 100 cpu. All remaining 28 cores were idle. This is not an
> issue
> >> when processing the strings after loading, when all the cores are used
> to
> >> process the data.
> >>
> >> Please help me with this? What setting can be changed to get the CPU
> usage
> >> back up to full?
> >>
> >>
> >>
> >> --
> >> View this message in context:
> >> http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-
> sc-textFile-uses-only-4-out-of-32-threads-per-node-tp20595.html
> >> Sent from the Apache Spark User List mailing list archive at Nabble.com.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For
> additional
> >> commands, e-mail: user-h...@spark.apache.org
> >>
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>
>
>
> --
> Gautham Anil
>
> "The first principle is that you must not fool yourself. And you are
> the easiest person to fool" - Richard P. Feynman
>

Reply via email to