Ayan,
no need for UDFs, the SQL API provides all you need (sha1, substring, conv):
https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html
>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
10).cast("long").alias("sha2long")).show()
+----------+
| sha2long|
+----------+
| 478797741|
|2520346415|
+----------+
This creates a lean query plan:
>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16,
10).cast("long").alias("sha2long")).explain()
== Physical Plan ==
Union
:- *(1) Project [478797741 AS sha2long#74L]
: +- Scan OneRowRelation[]
+- *(2) Project [2520346415 AS sha2long#76L]
+- Scan OneRowRelation[]
Enrico
Am 23.03.20 um 06:13 schrieb ayan guha:
Hi
I am trying to implement simple hashing/checksum logic. The key logic
is -
1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)
Here is the cut down version of the code:
---------------------------------------------------------------------------------------
/from pyspark.sql.functions import *
from pyspark.sql.types import *
from hashlib import sha1 as local_sha1
df = spark.sql("select '4104003141' value_to_hash union all select
'4102859263'")
f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
sha2Int1 = udf( f1 , StringType())
sha2Int2 = udf( f2 , IntegerType())
print(f('4102859263'))
dfr = df.select(df.value_to_hash,
sha2Int1(df.value_to_hash).alias('1'),
sha2Int2(df.value_to_hash).alias('2'))
/
/dfr.show(truncate=False)/
---------------------------------------------------------------------------------------------
I was expecting both columns should provide exact same values, however
thats not the case *"always" *
*
*
2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2
| +-------------+----------+-----------+ |4104003141 |478797741
|478797741 | |4102859263
|2520346415|-1774620881|+-------------+----------+-----------+ *
*
The function working fine, as shown in the print statement. However
values are not matching and vary widely.
Any pointer?
--
Best Regards,
Ayan Guha