Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Thanks for the suggestions and links. The problem arises when I used
DataFrame api to write but it works fine when doing insert overwrite in
hive table.

# Works good
hive_context.sql("insert overwrite table {0} partiton (e_dt, c_dt) select *
from temp_table".format(table_name))
# Doesn't work, throws java.lang.OutOfMemoryError: Requested array size
exceeds VM limit
df.write.mode('overwrite').partitionBy('e_dt','c_dt').parquet("/path/to/file/")

Thanks,
Bijay

On Wed, May 4, 2016 at 3:02 PM, Prajwal Tuladhar  wrote:

> If you are running on 64-bit JVM with less than 32G heap, you might want
> to enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
> generating more than 2^31-1 number of arrays, you might have to rethink
> your options.
>
> [1] https://spark.apache.org/docs/latest/tuning.html
>
> On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak 
> wrote:
>
>> Hi,
>>
>> I am reading the parquet file around 50+ G which has 4013 partitions with
>> 240 columns. Below is my configuration
>>
>> driver : 20G memory with 4 cores
>> executors: 45 executors with 15G memory and 4 cores.
>>
>> I tried to read the data using both Dataframe read and using hive context
>> to read the data using hive SQL but for the both cases, it throws me below
>> error with no  further description on error.
>>
>> hive_context.sql("select * from test.base_table where
>> date='{0}'".format(part_dt))
>> sqlcontext.read.parquet("/path/to/partion/")
>>
>> #
>> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
>> # -XX:OnOutOfMemoryError="kill -9 %p"
>> #   Executing /bin/sh -c "kill -9 16953"...
>>
>>
>> What could be wrong over here since I think increasing memory only will
>> not help in this case since it reached the array size limit.
>>
>> Thanks,
>> Bijay
>>
>
>
>
> --
> --
> Cheers,
> Praj
>


Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Prajwal Tuladhar
If you are running on 64-bit JVM with less than 32G heap, you might want to
enable -XX:+UseCompressedOops[1]. And if your dataframe is somehow
generating more than 2^31-1 number of arrays, you might have to rethink
your options.

[1] https://spark.apache.org/docs/latest/tuning.html

On Wed, May 4, 2016 at 9:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>



-- 
--
Cheers,
Praj


Re: SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Ted Yu
Have you seen this thread ?

http://search-hadoop.com/m/q3RTtyXr2N13hf9O=java+lang+OutOfMemoryError+Requested+array+size+exceeds+VM+limit

On Wed, May 4, 2016 at 2:44 PM, Bijay Kumar Pathak  wrote:

> Hi,
>
> I am reading the parquet file around 50+ G which has 4013 partitions with
> 240 columns. Below is my configuration
>
> driver : 20G memory with 4 cores
> executors: 45 executors with 15G memory and 4 cores.
>
> I tried to read the data using both Dataframe read and using hive context
> to read the data using hive SQL but for the both cases, it throws me below
> error with no  further description on error.
>
> hive_context.sql("select * from test.base_table where
> date='{0}'".format(part_dt))
> sqlcontext.read.parquet("/path/to/partion/")
>
> #
> # java.lang.OutOfMemoryError: Requested array size exceeds VM limit
> # -XX:OnOutOfMemoryError="kill -9 %p"
> #   Executing /bin/sh -c "kill -9 16953"...
>
>
> What could be wrong over here since I think increasing memory only will
> not help in this case since it reached the array size limit.
>
> Thanks,
> Bijay
>


SqlContext parquet read OutOfMemoryError: Requested array size exceeds VM limit error

2016-05-04 Thread Bijay Kumar Pathak
Hi,

I am reading the parquet file around 50+ G which has 4013 partitions with
240 columns. Below is my configuration

driver : 20G memory with 4 cores
executors: 45 executors with 15G memory and 4 cores.

I tried to read the data using both Dataframe read and using hive context
to read the data using hive SQL but for the both cases, it throws me below
error with no  further description on error.

hive_context.sql("select * from test.base_table where
date='{0}'".format(part_dt))
sqlcontext.read.parquet("/path/to/partion/")

#
# java.lang.OutOfMemoryError: Requested array size exceeds VM limit
# -XX:OnOutOfMemoryError="kill -9 %p"
#   Executing /bin/sh -c "kill -9 16953"...


What could be wrong over here since I think increasing memory only will not
help in this case since it reached the array size limit.

Thanks,
Bijay