Re: how to add new column using regular expression within pyspark dataframe

Павел Mon, 17 Apr 2017 06:29:55 -0700

On Mon, Apr 17, 2017 at 3:25 PM, Zeming Yu <zemin...@gmail.com> wrote:
> I've got a dataframe with a column looking like this:
>
> display(flight.select("duration").show())
>
> +--------+
> |duration|
> +--------+
> |  15h10m|
> |   17h0m|
> |  21h25m|
> |  14h25m|
> |  14h30m|
> +--------+
> only showing top 20 rows
>
>
>
> I need to extract the hour as a number and store it as an additional column
> within the same dataframe. What's the best way to do that?


You don't actually need to either switch to rdd context or use python
regexps here, which are slow. I'd suggest to try the "split" dataframe
sql function and the "getItem" column method. Bear in mind the
boundary case when duration is less than 1 hour, i.e. it might be
either 30m or 0h30m.

--
Pavel Knoblokh

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: how to add new column using regular expression within pyspark dataframe

Reply via email to