This is not going to work without changes to Spark. InMemoryTableScanExec
supports columnar output, but not columnar input. You would have to write
code to support that in Spark itself. The second part is that there are
only a handful of operators that support columnar output. Really it is just
t
e or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On
On the data path, Spark will write to a local disk when it runs out of
memory and needs to spill or when doing a shuffle with the default shuffle
implementation. The spilling is a good thing because it lets you process
data that is too large to fit in memory. It is not great because the
processin
Sadly there isn't a lot you can do to fix this. All of the operations take
iterators of rows as input and produce iterators of rows as output. For
efficiency reasons, the timing is not done for each individual row. If we
did that in many cases it would take longer to measure how long something
to
So I should have done some back of the napkin math before all of this. You
are writing out 800 files, each < 128 MB. If they were 128 MB then it
would be 100GB of data being written, I'm not sure how much hardware you
have but, but the fact that you can shuffle about 100GB to a single thread
and w
First, you need to be careful with coalesce. It will impact upstream
processing, so if you are doing a lot of computation in the last stage
before the repartition then coalesce will make the problem worse because
all of that computation will happen in a single thread instead of being
spread out.
M
"So if I am
going to use GPU in my job running on the spark , I still need to code the
map and reduce function in cuda or in c++ and then invoke them throught jni
or something like GPUEnabler , is that right ?"
Sort of. You could go through all of that work yourself, or you could use
the plugin t
Charles,
I am sorry that you got the idea that Apache Spark is GPU accelerated out
of the box. Where did you get that information so we can try to make it
more clear? Apache Spark 3.0 opens up a set of plugin APIs that allow for
a plugin to provide GPU acceleration. You can look at SPARK-27396
You are missing a lot of the stack trace that could explain the exception.
All it shows is that an exception happened while writing out the orc file,
not what that underlying exception is, there should be at least one more
caused by under the one you included.
Thanks,
Bobby
On Mon, Jul 22, 2019
Let's do a few quick rules of thumb to get an idea of what kind of
processing power you will need in general to do what you want.
You need 3,000,000 ints by 50,000 rows. Each int is 4 bytes so that ends
up being about 560 GB that you need to fully process in 5 seconds.
If you are reading this fr
Map and flatmap are RDD operations, a UDF is a dataframe operation. The
big difference from a performance perspective is in the query optimizer. A
udf defines the set of input fields it needs and the set of output fields
it will produce, map operates on the entire row at a time. This means the
o
11 matches
Mail list logo