coalesce() is a streaming operation if used without the second parameter,
it does not put all the data in RAM. If used with the second parameter
(shuffle = true), then it performs a shuffle, but still does not put all
the data in RAM.

On Sat, Nov 1, 2014 at 12:09 PM, <jan.zi...@centrum.cz> wrote:

> Now I am getting to problems using:
>
> distData = sc.textFile(sys.argv[2]).coalesce(10)
>
>
>
> The problem is that it seems that Spark is trying to put all the data to
> RAM first and then perform coalesce. Do you know if there is something
> that would do coalesce on fly with for example fixed size of the partition?
> Do you think that something like this is possible? Unfortunately I am not
> able to find anything like this in the Spark documentation.
>
> Thank you in advance for any advices or suggestions.
>
> Best regards,
> Jan
>
> ______________________________________________________________
>
>
> Thank you very much lot of very small json files was exactly the speed
> performance problem, using coalesce makes my Spark program to run on single
> node only twice slower (even with starting Spark) than single node Python
> program, which is acceptable.
>
> Jan
> ______________________________________________________________
>
> Because the overhead between JVM and Python, single task will be
> slower than your local Python scripts, but it's very easy to scale to
> many CPUs.
>
> Even one CPUs, it's not common that PySpark was 100 times slower. You
> have many small files, each file will be processed by a task, which
> will have about 100ms overhead (scheduled and executed), but the small
> file can be processed in your single thread Python script in less than
> 1ms.
>
> You could pack your json files into larger ones, or you could try to
> merge the small tasks into larger one by coalesce(N), such as:
>
> distData = sc.textFile(sys.argv[2]).coalesce(10)  # which will have 10
> partitons (tasks)
>
> Davies
>
> On Sat, Oct 18, 2014 at 12:07 PM,  <jan.zi...@centrum.cz> wrote:
> > Hi,
> >
> > I have program that I have for single computer (in Python) exection and
> also
> > implemented the same for Spark. This program basically only reads .json
> from
> > which it takes one field and saves it back. Using Spark my program runs
> > aproximately 100 times slower on 1 master and 1 slave. So I would like to
> > ask where possibly might be the problem?
> >
> > My Spark program looks like:
> >
> >
> >
> > sc = SparkContext(appName="Json data preprocessor")
> >
> > distData = sc.textFile(sys.argv[2])
> >
> > json_extractor = JsonExtractor(sys.argv[1])
> >
> > cleanedData = distData.flatMap(json_extractor.extract_json)
> >
> > cleanedData.saveAsTextFile(sys.argv[3])
> >
> > JsonExtractor only selects the data from field that is given by
> sys.argv[1].
> >
> >
> >
> > My data are basically many small one json files, where is one json per
> line.
> >
> > I have tried both, reading and writing the data from/to Amazon S3, local
> > disc on all the machines.
> >
> > I would like to ask if there is something that I am missing or if Spark
> is
> > supposed to be so slow in comparison with the local non parallelized
> single
> > node program.
> >
> >
> >
> > Thank you in advance for any suggestions or hints.
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> > For additional commands, e-mail: user-h...@spark.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>

Reply via email to