Re: Does spark support something like the bind function in R?

2022-02-08 Thread ayan guha
Hi

In python, or in general in spark, you can just "read" the files and select
the column. I am assuming you are reading each file individually in
separate dataframes and joining them. Instead, you can read all the files
in single dataframe and select 1 column.

On Wed, Feb 9, 2022 at 2:55 AM Andrew Davidson 
wrote:

> I need to create a single table by selecting one column from thousands of
> files. The columns are all of the same type, have the same number of rows
> and rows names. I am currently using join. I get OOM on mega-mem cluster
> with 2.8 TB.
>
>
>
> Does spark have something like cbind() “Take a sequence of vector, matrix
> or data-frame arguments and combine by *c*olumns or *r*ows,
> respectively. “
>
>
>
> https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind
>
>
>
> Digging through the spark documentation I found a udf example
>
> https://spark.apache.org/docs/latest/sparkr.html#dapply
>
>
>
> ```
>
> *# Convert waiting time from hours to seconds.*
>
> *# Note that we can apply UDF to DataFrame.*
>
> schema <- structType(structField("eruptions", "double"), structField(
> "waiting", "double"),
>
>  structField("waiting_secs", "double"))
>
> df1 <- dapply(df, *function*(x) { x <- cbind(x, x$waiting * 60) }, schema)
>
> head(collect(df1))
>
> *##  eruptions waiting waiting_secs*
>
> *##1 3.600  79 4740*
>
> *##2 1.800  54 3240*
>
> *##3 3.333  74 4440*
>
> *##4 2.283  62 3720*
>
> *##5 4.533  85 5100*
>
> *##6 2.883  55 3300*
>
> ```
>
>
>
> I wonder if this is just a wrapper around join? If so it is probably not
> going to help me out.
>
>
>
> Also I would prefer to work in python
>
>
>
> Any thoughts?
>
>
>
> Kind regards
>
>
>
> Andy
>
>
>
>
>


-- 
Best Regards,
Ayan Guha


Does spark support something like the bind function in R?

2022-02-08 Thread Andrew Davidson
I need to create a single table by selecting one column from thousands of 
files. The columns are all of the same type, have the same number of rows and 
rows names. I am currently using join. I get OOM on mega-mem cluster with 2.8 
TB.

Does spark have something like cbind() “Take a sequence of vector, matrix or 
data-frame arguments and combine by columns or rows, respectively. “

https://www.rdocumentation.org/packages/base/versions/3.6.2/topics/cbind

Digging through the spark documentation I found a udf example
https://spark.apache.org/docs/latest/sparkr.html#dapply

```
# Convert waiting time from hours to seconds.
# Note that we can apply UDF to DataFrame.
schema <- structType(structField("eruptions", "double"), structField("waiting", 
"double"),
 structField("waiting_secs", "double"))
df1 <- dapply(df, function(x) { x <- cbind(x, x$waiting * 60) }, schema)
head(collect(df1))
##  eruptions waiting waiting_secs
##1 3.600  79 4740
##2 1.800  54 3240
##3 3.333  74 4440
##4 2.283  62 3720
##5 4.533  85 5100
##6 2.883  55 3300
```

I wonder if this is just a wrapper around join? If so it is probably not going 
to help me out.

Also I would prefer to work in python

Any thoughts?

Kind regards

Andy