Re: Creating InMemory relations with data in ColumnarBatches

2023-04-04 Thread Bobby Evans
This is not going to work without changes to Spark. InMemoryTableScanExec supports columnar output, but not columnar input. You would have to write code to support that in Spark itself. The second part is that there are only a handful of operators that support columnar output. Really it is just t

Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Bobby Evans
e or destruction of data or any other property which may arise > from relying on this email's technical content is explicitly disclaimed. > The author will in no case be liable for any monetary damages arising from > such loss, damage or destruction. > > > > > On

Re: Is memory-only no-disk Spark possible?

2021-08-20 Thread Bobby Evans
On the data path, Spark will write to a local disk when it runs out of memory and needs to spill or when doing a shuffle with the default shuffle implementation. The spilling is a good thing because it lets you process data that is too large to fit in memory. It is not great because the processin

Re: Strange WholeStageCodegen UI values

2020-07-09 Thread Bobby Evans
Sadly there isn't a lot you can do to fix this. All of the operations take iterators of rows as input and produce iterators of rows as output. For efficiency reasons, the timing is not done for each individual row. If we did that in many cases it would take longer to measure how long something to

Re: Spark Small file issue

2020-06-29 Thread Bobby Evans
So I should have done some back of the napkin math before all of this. You are writing out 800 files, each < 128 MB. If they were 128 MB then it would be 100GB of data being written, I'm not sure how much hardware you have but, but the fact that you can shuffle about 100GB to a single thread and w

Re: Spark Small file issue

2020-06-24 Thread Bobby Evans
First, you need to be careful with coalesce. It will impact upstream processing, so if you are doing a lot of computation in the last stage before the repartition then coalesce will make the problem worse because all of that computation will happen in a single thread instead of being spread out. M

Re: GPU Acceleration for spark-3.0.0

2020-06-18 Thread Bobby Evans
"So if I am going to use GPU in my job running on the spark , I still need to code the map and reduce function in cuda or in c++ and then invoke them throught jni or something like GPUEnabler , is that right ?" Sort of. You could go through all of that work yourself, or you could use the plugin t

Re: GPU Acceleration for spark-3.0.0

2020-06-15 Thread Bobby Evans
Charles, I am sorry that you got the idea that Apache Spark is GPU accelerated out of the box. Where did you get that information so we can try to make it more clear? Apache Spark 3.0 opens up a set of plugin APIs that allow for a plugin to provide GPU acceleration. You can look at SPARK-27396

Re: Spark 2.3 Dataframe Grouby operation throws IllegalArgumentException on Large dataset

2019-07-22 Thread Bobby Evans
You are missing a lot of the stack trace that could explain the exception. All it shows is that an exception happened while writing out the orc file, not what that underlying exception is, there should be at least one more caused by under the one you included. Thanks, Bobby On Mon, Jul 22, 2019

Re: [Beginner] Run compute on large matrices and return the result in seconds?

2019-07-17 Thread Bobby Evans
Let's do a few quick rules of thumb to get an idea of what kind of processing power you will need in general to do what you want. You need 3,000,000 ints by 50,000 rows. Each int is 4 bytes so that ends up being about 560 GB that you need to fully process in 5 seconds. If you are reading this fr

Re: what is the difference between udf execution and map(someLambda)?

2019-03-18 Thread Bobby Evans
Map and flatmap are RDD operations, a UDF is a dataframe operation. The big difference from a performance perspective is in the query optimizer. A udf defines the set of input fields it needs and the set of output fields it will produce, map operates on the entire row at a time. This means the o