Full disclosure, I am *brand* new to Spark. I am trying to use [Py]SparkSQL standalone to pre-process a bunch of *local* (non HDFS) Parquet files. I have several thousand files and want to dispatch as many workers as my machine can handle to process the data in parallel; either at the per-file or per-record (or batch of records) within a single file.
My question is, how can this be achieved in a standalone scenario? I have plenty cores and RAM yet when I do `sc = SparkContext("local[8]")` in my stand alone script I see no speedup compared to, say, local[1]. I've also tried something like : distData = sc.parallelize(data) then distData.foreach(myFunction) after starting with local[N], yet that seems to return immediately without producing the expected side effects from myFunction (file output). I realize parallelizing Python code on a single node cluster is not what Spark was designed for but it seems to integrate Parquet and Python so well that it's my only option. :) Thanks, Kyle -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Parallelize-foreach-in-PySpark-with-Spark-Standalone-tp22756.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org