On Mon, Apr 17, 2017 at 3:25 PM, Zeming Yu <zemin...@gmail.com> wrote:
> I've got a dataframe with a column looking like this:
>
> display(flight.select("duration").show())
>
> +--------+
> |duration|
> +--------+
> |  15h10m|
> |   17h0m|
> |  21h25m|
> |  14h25m|
> |  14h30m|
> +--------+
> only showing top 20 rows
>
>
>
> I need to extract the hour as a number and store it as an additional column
> within the same dataframe. What's the best way to do that?

You don't actually need to either switch to rdd context or use python
regexps here, which are slow. I'd suggest to try the "split" dataframe
sql function and the "getItem" column method. Bear in mind the
boundary case when duration is less than 1 hour, i.e. it might be
either 30m or 0h30m.

--
Pavel Knoblokh

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Reply via email to