On Mon, Apr 17, 2017 at 3:25 PM, Zeming Yu <zemin...@gmail.com> wrote: > I've got a dataframe with a column looking like this: > > display(flight.select("duration").show()) > > +--------+ > |duration| > +--------+ > | 15h10m| > | 17h0m| > | 21h25m| > | 14h25m| > | 14h30m| > +--------+ > only showing top 20 rows > > > > I need to extract the hour as a number and store it as an additional column > within the same dataframe. What's the best way to do that?
You don't actually need to either switch to rdd context or use python regexps here, which are slow. I'd suggest to try the "split" dataframe sql function and the "getItem" column method. Bear in mind the boundary case when duration is less than 1 hour, i.e. it might be either 30m or 0h30m. -- Pavel Knoblokh --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org