Hi,

For this particular case I'd use Column.substr (
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.Column),
e.g.

val ns = Seq(("hello world", 1, 5)).toDF("w", "b", "e")
scala> ns.select($"w".substr($"b", $"e" - $"b" + 1) as "demo").show
+-----+
| demo|
+-----+
|hello|
+-----+

Pozdrawiam,
Jacek Laskowski
----
https://about.me/JacekLaskowski
Mastering Spark SQL https://bit.ly/mastering-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski


On Tue, May 14, 2019 at 5:08 PM Qian He <hq.ja...@gmail.com> wrote:

> For example, I have a dataframe with 3 columns: URL, START, END. For each
> url from URL column, I want to fetch a substring of it starting from START
> and ending at END.
> +------------------------+----------+-----+
> |URL                        |START |END |
> +------------------------+----------+-----+
> |www.amazon.com  |4          |14 |
> |www.yahoo.com     |4          |13 |
> |www.amazon.com  |4          |14 |
> |www.google.com    |4          |14 |
>
> I have UDF1:
>
> def getSubString = (input: String, start: Int, end: Int) => {
>    input.substring(start, end)
> }
> val udf1 = udf(getSubString)
>
> and another UDF2:
>
> def getColSubString()(c1: Column, c2: Column, c3: Column): Column = {
>    c1.substr(c2, c3-c2)
> }
>
> Let's assume they can both generate the result I want. But, from performance 
> perspective, is there any difference between those two UDFs?
>
>
>

Reply via email to