You can get some more insights by using the Spark history server
(http://spark.apache.org/docs/latest/monitoring.html), it can show you
which task is failing and some other information that might help you
debugging the issue.
On 05/10/2016 19:00, Babak Alipour wrote:
> The issue seems to lie in t
The issue seems to lie in the RangePartitioner trying to create equal
ranges. [1]
[1] https://spark.apache.org/docs/2.0.0/api/java/org/apache/
spark/RangePartitioner.html
The *Double* values I'm trying to sort are mostly in the range [0,1] (~70%
of the data which roughly equates 1 billion record
Thanks Vadim for sharing your experience, but I have tried multi-JVM setup
(2 workers), various sizes for spark.executor.memory (8g, 16g, 20g, 32g,
64g) and spark.executor.core (2-4), same error all along.
As for the files, these are all .snappy.parquet files, resulting from
inserting some data fr
oh, and try to run even smaller executors, i.e. with
`spark.executor.memory` <= 16GiB. I wonder what result you're going to get.
On Sun, Oct 2, 2016 at 1:24 AM, Vadim Semenov
wrote:
> > Do you mean running a multi-JVM 'cluster' on the single machine?
> Yes, that's what I suggested.
>
> You can g
> Do you mean running a multi-JVM 'cluster' on the single machine?
Yes, that's what I suggested.
You can get some information here:
http://blog.cloudera.com/blog/2015/03/how-to-tune-your-apache-spark-jobs-part-2/
> How would that affect performance/memory-consumption? If a multi-JVM
setup can han
To add one more note, I tried running more smaller executors each with
32-64g memory and executor.cores 2-4 (with 2 workers as well) and I'm still
getting the same exception:
java.lang.IllegalArgumentException: Cannot allocate a page with more than
17179869176 bytes
at
org.apache.spark.mem
Do you mean running a multi-JVM 'cluster' on the single machine? How would
that affect performance/memory-consumption? If a multi-JVM setup can handle
such a large input, then why can't a single-JVM break down the job into
smaller tasks?
I also found that SPARK-9411 mentions making the page_size c
Run more smaller executors: change `spark.executor.memory` to 32g and
`spark.executor.cores` to 2-4, for example.
Changing driver's memory won't help because it doesn't participate in
execution.
On Fri, Sep 30, 2016 at 2:58 PM, Babak Alipour
wrote:
> Thank you for your replies.
>
> @Mich, using
Thank you for your replies.
@Mich, using LIMIT 100 in the query prevents the exception but given the
fact that there's enough memory, I don't think this should happen even
without LIMIT.
@Vadim, here's the full stack trace:
Caused by: java.lang.IllegalArgumentException: Cannot allocate a page wi
Can you post the whole exception stack trace?
What are your executor memory settings?
Right now I assume that it happens in UnsafeExternalRowSorter ->
UnsafeExternalSorter:insertRecord
Running more executors with lower `spark.executor.memory` should help.
On Fri, Sep 30, 2016 at 12:57 PM, Babak
What will happen if you LIMIT the result set to 100 rows only -- select
from order by field LIMIT 100. Will that work?
How about running the whole query WITHOUT order by?
HTH
Dr Mich Talebzadeh
LinkedIn *
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
Greetings everyone,
I'm trying to read a single field of a Hive table stored as Parquet in
Spark (~140GB for the entire table, this single field should be just a few
GB) and look at the sorted output using the following:
sql("SELECT " + field + " FROM MY_TABLE ORDER BY " + field + " DESC")
​But
Hi,
I need to sort a dataframe and retrive the bounds of each partition.
The dataframe.sort() is using the range partitioning in the physical plan.
I need to retrieve partition bounds.
Many thanks for your help.
Thanks Davies, after I did a coalesce(1) to save as single parquet file I
was able to get the head() to return the correct order.
On Sun, May 8, 2016 at 12:29 AM, Davies Liu wrote:
> When you have multiple parquet files, the order of all the rows in
> them is not defined.
>
> On Sat, May 7, 2016
When you have multiple parquet files, the order of all the rows in
them is not defined.
On Sat, May 7, 2016 at 11:48 PM, Buntu Dev wrote:
> I'm using pyspark dataframe api to sort by specific column and then saving
> the dataframe as parquet file. But the resulting parquet file doesn't seem
> to
I'm using pyspark dataframe api to sort by specific column and then saving
the dataframe as parquet file. But the resulting parquet file doesn't seem
to be sorted.
Applying sort and doing a head() on the results shows the correct results
sorted by 'value' column in desc order, as shown below:
~~~
16 matches
Mail list logo