How many CPUs on the slave? Because the overhead between JVM and Python, single task will be slower than your local Python scripts, but it's very easy to scale to many CPUs.
Even one CPUs, it's not common that PySpark was 100 times slower. You have many small files, each file will be processed by a task, which will have about 100ms overhead (scheduled and executed), but the small file can be processed in your single thread Python script in less than 1ms. You could pack your json files into larger ones, or you could try to merge the small tasks into larger one by coalesce(N), such as: distData = sc.textFile(sys.argv[2]).coalesce(10) # which will have 10 partitons (tasks) Davies On Sat, Oct 18, 2014 at 12:07 PM, <jan.zi...@centrum.cz> wrote: > Hi, > > I have program that I have for single computer (in Python) exection and also > implemented the same for Spark. This program basically only reads .json from > which it takes one field and saves it back. Using Spark my program runs > aproximately 100 times slower on 1 master and 1 slave. So I would like to > ask where possibly might be the problem? > > My Spark program looks like: > > > > sc = SparkContext(appName="Json data preprocessor") > > distData = sc.textFile(sys.argv[2]) > > json_extractor = JsonExtractor(sys.argv[1]) > > cleanedData = distData.flatMap(json_extractor.extract_json) > > cleanedData.saveAsTextFile(sys.argv[3]) > > JsonExtractor only selects the data from field that is given by sys.argv[1]. > > > > My data are basically many small one json files, where is one json per line. > > I have tried both, reading and writing the data from/to Amazon S3, local > disc on all the machines. > > I would like to ask if there is something that I am missing or if Spark is > supposed to be so slow in comparison with the local non parallelized single > node program. > > > > Thank you in advance for any suggestions or hints. > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org