Scala doesn't support ranges >= Int.MaxValue
https://github.com/scala/scala/blob/2.12.x/src/library/scala/collection/immutable/Range.scala?utf8=✓#L89
You can create two RDDs and unionize them:
scala> val rdd = sc.parallelize(1L to
Int.MaxValue.toLong).union(sc.parallelize(1L to Int.MaxValue.toLong))
rdd: org.apache.spark.rdd.RDD[Long] = UnionRDD[10] at union at :24
scala> rdd.count
[Stage 0:> (0 + 4)
/ 8]
Also instead of creating the range on the driver, you can create your RDD
in parallel:
scala> :paste
// Entering paste mode (ctrl-D to finish)
val numberOfParts = 100
val numberOfElementsInEachPart = Int.MaxValue.toDouble / 100
val rdd = sc.parallelize(1 to numberOfParts).flatMap(partNum => {
val begin = ((partNum - 1) * numberOfElementsInEachPart).toLong
val end = (partNum * numberOfElementsInEachPart).toLong
begin to end
})
// Exiting paste mode, now interpreting.
numberOfParts: Int = 100
numberOfElementsInEachPart: Double = 2.147483647E7
rdd: org.apache.spark.rdd.RDD[Long] = MapPartitionsRDD[15] at flatMap at
:31
scala> rdd.count
res10: Long = 2147483747
On Tue, Aug 8, 2017 at 1:26 PM, makoto wrote:
> Hello,
> I'd like to count more than Int.MaxValue. But I encountered the following
> error.
>
> scala> val rdd = sc.parallelize(1L to Int.MaxValue*2.toLong)
> rdd: org.apache.spark.rdd.RDD[Long] = ParallelCollectionRDD[28] at
> parallelize at :24
>
> scala> rdd.count
> java.lang.IllegalArgumentException: More than Int.MaxValue elements.
> at scala.collection.immutable.NumericRange$.check$1(
> NumericRange.scala:304)
> at scala.collection.immutable.NumericRange$.count(
> NumericRange.scala:314)
> at scala.collection.immutable.NumericRange.numRangeElements$
> lzycompute(NumericRange.scala:52)
> at scala.collection.immutable.NumericRange.numRangeElements(
> NumericRange.scala:51)
> at scala.collection.immutable.NumericRange.length(NumericRange.scala:54)
> at org.apache.spark.rdd.ParallelCollectionRDD$.slice(
> ParallelCollectionRDD.scala:145)
> at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(
> ParallelCollectionRDD.scala:97)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2087)
> at org.apache.spark.rdd.RDD.count(RDD.scala:1158)
> ... 48 elided
>
> How can I avoid the error ?
> A similar problem is as follows:
> scala> rdd.reduce((a,b)=> (a + b))
> java.lang.IllegalArgumentException: More than Int.MaxValue elements.
> at scala.collection.immutable.NumericRange$.check$1(
> NumericRange.scala:304)
> at scala.collection.immutable.NumericRange$.count(
> NumericRange.scala:314)
> at scala.collection.immutable.NumericRange.numRangeElements$
> lzycompute(NumericRange.scala:52)
> at scala.collection.immutable.NumericRange.numRangeElements(
> NumericRange.scala:51)
> at scala.collection.immutable.NumericRange.length(NumericRange.scala:54)
> at org.apache.spark.rdd.ParallelCollectionRDD$.slice(
> ParallelCollectionRDD.scala:145)
> at org.apache.spark.rdd.ParallelCollectionRDD.getPartitions(
> ParallelCollectionRDD.scala:97)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
> at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:250)
> at scala.Option.getOrElse(Option.scala:121)
> at org.apache.spark.rdd.RDD.partitions(RDD.scala:250)
> at org.apache.spark.SparkContext.runJob(SparkContext.scala:2119)
> at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:151)
> at org.apache.spark.rdd.RDDOperationScope$.withScope(
> RDDOperationScope.scala:112)
> at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
> at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
> ... 48 elided
>
>
>