End of Stream errors in shuffle
Hi, I'm facing a very strange error that occurs halfway of long execution Spark SQL jobs: 18/01/12 22:14:30 ERROR Utils: Aborting task java.io.EOFException: reached end of stream after reading 0 bytes; 96 bytes expected at org.spark_project.guava.io.ByteStreams.readFully(ByteStreams.java:735) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:127) at org.apache.spark.sql.execution.UnsafeRowSerializerInstance$$anon$2$$anon$3.next(UnsafeRowSerializer.scala:110) at scala.collection.Iterator$$anon$12.next(Iterator.scala:444) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30) at org.apache.spark.InterruptibleIterator.next(InterruptibleIterator.scala:40) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.sort_addToSorter$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) (...) Since I get this in several jobs, I wonder if it might be a problem at the comm layer. Did anyone face a similar problem? It always happens in a job which does a shuffle of 200GB reading then in partitions of ~64MB for a groupBy. And it is weird that it only fails when it processed over 1000 partitions (16 cores on one node) I even tried changing the spark.shuffle.file.buffer config but it just seems to change the point when it occurs. Really would appreciate some hints - what it could be, what to try, test, how to debug - as I feel pretty much blocked here. Thanks in advance Fernando
Re: Dynamic data ingestion into SparkSQL - Interesting question
Did you consider do string processing to build the SQL expression which you can execute with spark.sql(...)? Some examples: https://spark.apache.org/docs/latest/sql-programming-guide.html#hive-tables Cheers On 21 November 2017 at 03:27, Aakash Basuwrote: > Hi all, > > Any help? PFB. > > Thanks, > Aakash. > > On 20-Nov-2017 6:58 PM, "Aakash Basu" wrote: > >> Hi all, >> >> I have a table which will have 4 columns - >> >> | Expression|filter_condition| from_clause| >> group_by_columns| >> >> >> This file may have variable number of rows depending on the no. of KPIs I >> need to calculate. >> >> I need to write a SparkSQL program which will have to read this file and >> run each line of queries dynamically by fetching each column value for a >> particular row and create a select query out of it and run inside a >> dataframe, later saving it as a temporary table. >> >> Did anyone do this kind of exercise? If yes, can I get some help on it >> pls? >> >> Thanks, >> Aakash. >> >
Re: Multiple transformations without recalculating or caching
Notice the fact that I have 1+ TB. If I didn't mind things to be slow I wouldn't be using spark. On 17 November 2017 at 11:06, Sebastian Piu <sebastian@gmail.com> wrote: > If you don't want to recalculate you need to hold the results somewhere, > of you need to save it why don't you so that and then read it again and get > your stats? > > On Fri, 17 Nov 2017, 10:03 Fernando Pereira, <ferdonl...@gmail.com> wrote: > >> Dear Spark users >> >> Is it possible to take the output of a transformation (RDD/Dataframe) and >> feed it to two independent transformations without recalculating the first >> transformation and without caching the whole dataset? >> >> Consider the case of a very large dataset (1+TB) which suffered several >> transformations and now we want to save it but also calculate some >> statistics per group. >> So the best processing way would for: for each partition: do task A, do >> task B. >> >> I don't see a way of instructing spark how to proceed that way without >> caching to disk, which seems unnecessarily heavy. And if we don't cache >> spark recalculates every partition all the way from the beginning. In >> either case huge file reads happen. >> >> Any ideas on how to avoid it? >> >> Thanks >> >> Fernando >> >
Multiple transformations without recalculating or caching
Dear Spark users Is it possible to take the output of a transformation (RDD/Dataframe) and feed it to two independent transformations without recalculating the first transformation and without caching the whole dataset? Consider the case of a very large dataset (1+TB) which suffered several transformations and now we want to save it but also calculate some statistics per group. So the best processing way would for: for each partition: do task A, do task B. I don't see a way of instructing spark how to proceed that way without caching to disk, which seems unnecessarily heavy. And if we don't cache spark recalculates every partition all the way from the beginning. In either case huge file reads happen. Any ideas on how to avoid it? Thanks Fernando