In Spark, certain functions have an optional parameter to determine the number of partitions (distinct, textFile, etc..). You can also use the coalesce () or repartiton() functions to change the number of partitions for your RDD. Thanks. On Oct 28, 2014 9:58 AM, "shahab" <[email protected]> wrote:
> Thanks for the useful comment. But I guess this setting applies only when > I use SparkSQL right= is there any similar settings for Spark? > > best, > /Shahab > > On Tue, Oct 28, 2014 at 2:38 PM, Wanda Hawk <[email protected]> > wrote: > >> Is this what are you looking for ? >> >> In Shark, default reducer number is 1 and is controlled by the property >> mapred.reduce.tasks. Spark SQL deprecates this property in favor of >> spark.sql.shuffle.partitions, whose default value is 200. Users may >> customize this property via SET: >> >> SET spark.sql.shuffle.partitions=10; >> SELECT page, count(*) c >> FROM logs_last_month_cached >> GROUP BY page ORDER BY c DESC LIMIT 10; >> >> >> Spark SQL Programming Guide - Spark 1.1.0 Documentation >> <http://spark.apache.org/docs/latest/sql-programming-guide.html> >> >> >> >> >> >> >> Spark SQL Programming Guide - Spark 1.1.0 Documentation >> <http://spark.apache.org/docs/latest/sql-programming-guide.html> >> Spark SQL Programming Guide Overview Getting Started Data Sources RDDs >> Inferring the Schema Using Reflection Programmatically Specifying the >> Schema Parquet Files Loading Data Programmatically >> View on spark.apache.org >> <http://spark.apache.org/docs/latest/sql-programming-guide.html> >> Preview by Yahoo >> >> >> ------------------------------ >> *From:* shahab <[email protected]> >> *To:* [email protected] >> *Sent:* Tuesday, October 28, 2014 3:20 PM >> *Subject:* How can number of partitions be set in "spark-env.sh"? >> >> I am running a stand alone Spark cluster, 2 workers each has 2 cores. >> Apparently, I am loading and processing relatively large chunk of data so >> that I receive task failure " " . As I read from some posts and >> discussions in the mailing list the failures could be related to the large >> size of processing data in the partitions and if I have understood >> correctly I should have smaller partitions (but many of them) ?! >> >> Is there any way that I can set the number of partitions dynamically in >> "spark-env.sh" or in the submiited Spark application? >> >> >> best, >> /Shahab >> >> >> >
