I just did a test, even for a single node (local deployment), spark can
handle the data whose size is much larger than the total memory.
My test VM (2g ram, 2 cores):
$ free -m
total used free shared buff/cache
available
Mem: 1992 1845 92 19 54
36
Swap: 1023 285 738
The data size:
$ du -h rate.csv
3.2G rate.csv
Loading this file into spark for calculation can be done without error:
scala> val df = spark.read.format("csv").option("inferSchema",
true).load("skydrive/rate.csv")
val df: org.apache.spark.sql.DataFrame = [_c0: string, _c1: string ... 2
more fields]
scala> df.printSchema
warning: 1 deprecation (since 2.13.3); for details, enable `:setting
-deprecation` or `:replay -deprecation`
root
|-- _c0: string (nullable = true)
|-- _c1: string (nullable = true)
|-- _c2: double (nullable = true)
|-- _c3: integer (nullable = true)
scala>
df.groupBy("_c1").agg(avg("_c2").alias("avg_rating")).orderBy(desc("avg_rating")).show
warning: 1 deprecation (since 2.13.3); for details, enable `:setting
-deprecation` or `:replay -deprecation`
+----------+----------+
| _c1|avg_rating|
+----------+----------+
|0001360000| 5.0|
|0001711474| 5.0|
|0001360779| 5.0|
|0001006657| 5.0|
|0001361155| 5.0|
|0001018043| 5.0|
|000136118X| 5.0|
|0000202010| 5.0|
|0001371037| 5.0|
|0000401048| 5.0|
|0001371045| 5.0|
|0001203010| 5.0|
|0001381245| 5.0|
|0001048236| 5.0|
|0001436163| 5.0|
|000104897X| 5.0|
|0001437879| 5.0|
|0001056107| 5.0|
|0001468685| 5.0|
|0001061240| 5.0|
+----------+----------+
only showing top 20 rows
So as you see spark can handle file larger than its memory well. :)
Thanks
rajat kumar wrote:
With autoscaling can have any numbers of executors.
---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org