Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error
Thanks for the suggestions and links. The problem arises when I used DataFrame api to write but it works fine when doing insert overwrite in hive table. # Works good hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select * from temp_table".format(table_name)) # Doesn't work, throws java.lang.OutOfMemoryError: Requested array size exceeds VM limit df.write.mode('overwrite').partitionBy('e_dt','c_dt').parquet("/path/to/file/") Thanks, Bijay On Wed, May 4, 2016 at 3:02 PM, Prajwal Tuladhar wrote: > If you are running on 64-bit JVM with less than 32G heap, you might want > to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow > generating more than 2^31-1 number of arrays, you might have to rethink > your options. > > [1] https://spark.apache.org/docs/latest/tuning.html > > On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak > wrote: > >> Hi, >> >> I am reading the parquet file around 50+ G which has 4013 partitions with >> 240 columns. Below is my configuration >> >> driver : 20G memory with 4 cores >> executors: 45 executors with 15G memory and 4 cores. >> >> I tried to read the data using both Dataframe read and using hive context >> to read the data using hive SQL but for the both cases, it throws me below >> error with no further description on error. >> >> hive_context.sql("select * from test.base_table where >> date='{0}'".format(part_dt)) >> sqlcontext.read.parquet("/path/to/partion/") >> >> # >> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit >> # -XX:OnOutOfMemoryError="kill -9 %p" >> # Executing /bin/sh -c "kill -9 16953"... >> >> >> What could be wrong over here since I think increasing memory only will >> not help in this case since it reached the array size limit. >> >> Thanks, >> Bijay >> > > > > -- > -- > Cheers, > Praj >
Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error
If you are running on 64-bit JVM with less than 32G heap, you might want to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow generating more than 2^31-1 number of arrays, you might have to rethink your options. [1] https://spark.apache.org/docs/latest/tuning.html On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak wrote: > Hi, > > I am reading the parquet file around 50+ G which has 4013 partitions with > 240 columns. Below is my configuration > > driver : 20G memory with 4 cores > executors: 45 executors with 15G memory and 4 cores. > > I tried to read the data using both Dataframe read and using hive context > to read the data using hive SQL but for the both cases, it throws me below > error with no further description on error. > > hive_context.sql("select * from test.base_table where > date='{0}'".format(part_dt)) > sqlcontext.read.parquet("/path/to/partion/") > > # > # java.lang.OutOfMemoryError: Requested array size exceeds VM limit > # -XX:OnOutOfMemoryError="kill -9 %p" > # Executing /bin/sh -c "kill -9 16953"... > > > What could be wrong over here since I think increasing memory only will > not help in this case since it reached the array size limit. > > Thanks, > Bijay > -- -- Cheers, Praj
Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error
Have you seen this thread ? http://search-hadoop.com/m/q3RTtyXr2N13hf9O&subj=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak wrote: > Hi, > > I am reading the parquet file around 50+ G which has 4013 partitions with > 240 columns. Below is my configuration > > driver : 20G memory with 4 cores > executors: 45 executors with 15G memory and 4 cores. > > I tried to read the data using both Dataframe read and using hive context > to read the data using hive SQL but for the both cases, it throws me below > error with no further description on error. > > hive_context.sql("select * from test.base_table where > date='{0}'".format(part_dt)) > sqlcontext.read.parquet("/path/to/partion/") > > # > # java.lang.OutOfMemoryError: Requested array size exceeds VM limit > # -XX:OnOutOfMemoryError="kill -9 %p" > # Executing /bin/sh -c "kill -9 16953"... > > > What could be wrong over here since I think increasing memory only will > not help in this case since it reached the array size limit. > > Thanks, > Bijay >
SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error
Hi, I am reading the parquet file around 50+ G which has 4013 partitions with 240 columns. Below is my configuration driver : 20G memory with 4 cores executors: 45 executors with 15G memory and 4 cores. I tried to read the data using both Dataframe read and using hive context to read the data using hive SQL but for the both cases, it throws me below error with no further description on error. hive_context.sql("select * from test.base_table where date='{0}'".format(part_dt)) sqlcontext.read.parquet("/path/to/partion/") # # java.lang.OutOfMemoryError: Requested array size exceeds VM limit # -XX:OnOutOfMemoryError="kill -9 %p" # Executing /bin/sh -c "kill -9 16953"... What could be wrong over here since I think increasing memory only will not help in this case since it reached the array size limit. Thanks, Bijay
Requested array size exceeds VM limit Error
Hi, I have a 170GB data tab limited data set which I am converting into the RDD[LabeledPoint] format. I am then taking a 60% sample of this data set to be used for training a GBT model. I got the Size exceeds Integer.MAX_VALUE error which I fixed by repartitioning the data set to 1000 partitions. Now, the GBT code caches the data set, if it's not already cached, with this operation input.persist(StorageLevel.MEMORY_AND_DISK) (https://github.com/apache/spark/blob/branch-1.2/mllib/src/main/scala/org/apache/spark/mllib/tree/GradientBoostedTrees.scala ). To pre-empt this caching so I can better control it, I am caching the RDD (after repartition) with this command, trainingData.persist(StorageLevel.MEMORY_AND_DISK_SER_2) But now, I get the following error on one executor and the application fails after a retry. I am not sure how to fix this. Could someone help with this? One possible reason could be that I submit my job with "--driver-memory 11G --executor-memory 11G " but I am allotted only 5.7GB. I am not sure if this could actually cause an affect. My runtime environment: 120 executors with 5.7 GB each, Driver has 5.3 GB. My Spark Config: set("spark.default.parallelism", "300").set("spark.akka.frameSize", "256").set("spark.akka.timeout", "1000").set("spark.core.connection.ack.wait.timeout","200").set("spark.akka.threads", "10").set("spark.serializer", "org.apache.spark.serializer.KryoSerializer").set("spark.kryoserializer.buffer.mb", "256") java.lang.OutOfMemoryError: Requested array size exceeds VM limit at java.util.Arrays.copyOf(Arrays.java:2271) at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:113) at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93) at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:140) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) at java.io.BufferedOutputStream.write(BufferedOutputStream.java:126) at com.esotericsoftware.kryo.io.Output.flush(Output.java:155) at com.esotericsoftware.kryo.io.Output.require(Output.java:135) at com.esotericsoftware.kryo.io.Output.writeLong(Output.java:477) at com.esotericsoftware.kryo.io.Output.writeDouble(Output.java:596) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:212) at com.esotericsoftware.kryo.serializers.DefaultArraySerializers$DoubleArraySerializer.write(DefaultArraySerializers.java:200) at com.esotericsoftware.kryo.Kryo.writeObjectOrNull(Kryo.java:549) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:570) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeObject(Kryo.java:501) at com.esotericsoftware.kryo.serializers.FieldSerializer$ObjectField.write(FieldSerializer.java:564) at com.esotericsoftware.kryo.serializers.FieldSerializer.write(FieldSerializer.java:213) at com.esotericsoftware.kryo.Kryo.writeClassAndObject(Kryo.java:568) at org.apache.spark.serializer.KryoSerializationStream.writeObject(KryoSerializer.scala:128) at org.apache.spark.serializer.SerializationStream.writeAll(Serializer.scala:110) at org.apache.spark.storage.BlockManager.dataSerializeStream(BlockManager.scala:1175) at org.apache.spark.storage.BlockManager.dataSerialize(BlockManager.scala:1184) at org.apache.spark.storage.MemoryStore.putArray(MemoryStore.scala:103) at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:789) at org.apache.spark.storage.BlockManager.putArray(BlockManager.scala:669) at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:167) at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:70) at org.apache.spark.rdd.RDD.iterator(RDD.scala:228) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196) Thank You! Vinay