bq. *blocks being 64MB by default in HDFS*
*In hadoop 2.1+, default block size has been increased.* See https://issues.apache.org/jira/browse/HDFS-4053 Cheers On Thu, Feb 19, 2015 at 8:32 AM, Ted Yu <[email protected]> wrote: > What file system are you using ? > > If you use hdfs, the documentation you cited is pretty clear on how > partitions are determined. > > bq. file X replicated on 4 machines > > I don't think replication factor plays a role w.r.t. partitions. > > On Thu, Feb 19, 2015 at 8:05 AM, Alessandro Lulli <[email protected]> > wrote: > >> Hi All, >> >> Could you please help me understanding how Spark defines the number of >> partitions of the RDDs if not specified? >> >> I found the following in the documentation for file loaded from HDFS: >> *The textFile method also takes an optional second argument for >> controlling the number of partitions of the file. By default, Spark creates >> one partition for each block of the file (blocks being 64MB by default in >> HDFS), but you can also ask for a higher number of partitions by passing a >> larger value. Note that you cannot have fewer partitions than blocks* >> >> What is the rule for file loaded from the file systems? >> For instance, i have a file X replicated on 4 machines. If i load the >> file X in a RDD how many partitions are defined and why? >> >> Thanks for your help on this >> Alessandro >> > >
