from:"Richard Primera"

Spark Dataframe Writer _temporary directory

2018-01-28 Thread Richard Primera

In a situation where multiple workflows write different partitions of the
same table.

Example:

10 Different processes are writing parquet or orc files for different
partitions of the same table foo, at 
/staging/tables/foo/partition_field=1,/staging/tables/foo/partition_field=2,/staging/tables/foo/partition_field=3...

It appears to me that it is currently not possible to do this simultaneously
for the same directory in a consistently stable way, since whenever a
Dataframe writer starts, it stores temporary files at
/staging/tables/foo/_temporary directory, which all writers use, and they
all eliminate it when they end writing. This has the effect that whatever
Dataframe writer ends up first, ends up deleting the temporary files of all
other writers that haven't finished. 

I believe this can be bypassed by having them all write to a
/staging/tables/foo/_temporary_someHash directory instead.

Is there currently a way to achieve this without having to edit the source
code?



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Partition Dataframe Using UDF On Partition Column

2017-12-27 Thread Richard Primera

Greetings,


In version 1.6.0, is it possible to write a partitioned dataframe into
parquet format using a UDF function on the partition column? I'm using
pyspark.

Let's say I have a dataframe with coumn `date`, of type string or int, which
contains values such as `20170825`. Is it possible to define a UDF called
`by_month` or `by_year`, which could then be used to write the table as
parquet, ideally in this way:

*dataframe.write.format("parquet").partitionBy(by_month(dataframe["date"])).save("/some/parquet")*

I haven't even tried this so I don't know if it's possible. If so, what are
the ways by which this can be done? Ideally, without having to resort to add
an additional column like `part_id` to the dataframe with the result of
`by_month(date)` and partitioning by that column instead.


Thanks in advance.



--
Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Spark Dataframe Writer _temporary directory

Partition Dataframe Using UDF On Partition Column

2 matches

Site Navigation

Mail list logo

Footer information