End of Stream errors in shuffle

2018-01-15 Thread Fernando Pereira
Hi,

I'm facing a very strange error that occurs halfway of long execution Spark
SQL jobs:

18/01/12 22:14:30 ERROR Utils: Aborting task
java.io.EOFException: reached end of stream after reading 0 bytes; 96 bytes
expected
at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735)
at
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127)
at
org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at
org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
(...)

Since I get this in several jobs, I wonder if it might be a problem at the
comm layer.
Did anyone face a similar problem?

It always happens in a job which does a shuffle of 200GB reading then in
partitions of ~64MB for a groupBy. And it is weird that it only fails when
it processed over 1000 partitions (16 cores on one node)

I even tried changing the spark.shuffle.file.buffer config but it just
seems to change the point when it occurs.

Really would appreciate some hints - what it could be, what to try, test,
how to debug - as I feel pretty much blocked here.

Thanks in advance
Fernando


Re: Dynamic data ingestion into SparkSQL - Interesting question

2017-11-21 Thread Fernando Pereira
Did you consider do string processing to build the SQL expression which you
can execute with spark.sql(...)?
Some examples:
https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables

Cheers

On 21 November 2017 at 03:27, Aakash Basu 
wrote:

> Hi all,
>
> Any help? PFB.
>
> Thanks,
> Aakash.
>
> On 20-Nov-2017 6:58 PM, "Aakash Basu"  wrote:
>
>> Hi all,
>>
>> I have a table which will have 4 columns -
>>
>> |  Expression|filter_condition| from_clause|
>> group_by_columns|
>>
>>
>> This file may have variable number of rows depending on the no. of KPIs I
>> need to calculate.
>>
>> I need to write a SparkSQL program which will have to read this file and
>> run each line of queries dynamically by fetching each column value for a
>> particular row and create a select query out of it and run inside a
>> dataframe, later saving it as a temporary table.
>>
>> Did anyone do this kind of exercise? If yes, can I get some help on it
>> pls?
>>
>> Thanks,
>> Aakash.
>>
>


Re: Multiple transformations without recalculating or caching

2017-11-17 Thread Fernando Pereira
Notice the fact that I have 1+ TB. If I didn't mind things to be slow I
wouldn't be using spark.

On 17 November 2017 at 11:06, Sebastian Piu <sebastian@gmail.com> wrote:

> If you don't want to recalculate you need to hold the results somewhere,
> of you need to save it why don't you so that and then read it again and get
> your stats?
>
> On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <ferdonl...@gmail.com> wrote:
>
>> Dear Spark users
>>
>> Is it possible to take the output of a transformation (RDD/Dataframe) and
>> feed it to two independent transformations without recalculating the first
>> transformation and without caching the whole dataset?
>>
>> Consider the case of a very large dataset (1+TB) which suffered several
>> transformations and now we want to save it but also calculate some
>> statistics per group.
>> So the best processing way would for: for each partition: do task A, do
>> task B.
>>
>> I don't see a way of instructing spark how to proceed that way without
>> caching to disk, which seems unnecessarily heavy. And if we don't cache
>> spark recalculates every partition all the way from the beginning. In
>> either case huge file reads happen.
>>
>> Any ideas on how to avoid it?
>>
>> Thanks
>>
>> Fernando
>>
>


Multiple transformations without recalculating or caching

2017-11-17 Thread Fernando Pereira
Dear Spark users

Is it possible to take the output of a transformation (RDD/Dataframe) and
feed it to two independent transformations without recalculating the first
transformation and without caching the whole dataset?

Consider the case of a very large dataset (1+TB) which suffered several
transformations and now we want to save it but also calculate some
statistics per group.
So the best processing way would for: for each partition: do task A, do
task B.

I don't see a way of instructing spark how to proceed that way without
caching to disk, which seems unnecessarily heavy. And if we don't cache
spark recalculates every partition all the way from the beginning. In
either case huge file reads happen.

Any ideas on how to avoid it?

Thanks
Fernando