Ayan,

no need for UDFs, the SQL API provides all you need (sha1, substring, conv):
https://spark.apache.org/docs/2.4.5/api/python/pyspark.sql.html

>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).show()
+----------+
|  sha2long|
+----------+
| 478797741|
|2520346415|
+----------+

This creates a lean query plan:

>>> df.select(conv(substring(sha1(col("value_to_hash")), 33, 8), 16, 10).cast("long").alias("sha2long")).explain()
== Physical Plan ==
Union
:- *(1) Project [478797741 AS sha2long#74L]
:  +- Scan OneRowRelation[]
+- *(2) Project [2520346415 AS sha2long#76L]
   +- Scan OneRowRelation[]


Enrico


Am 23.03.20 um 06:13 schrieb ayan guha:
Hi

I am trying to implement simple hashing/checksum logic. The key logic is -

1. Generate sha1 hash
2. Extract last 8 chars
3. Convert 8 chars to Int (using base 16)

Here is the cut down version of the code:

---------------------------------------------------------------------------------------
/from pyspark.sql.functions import *
from pyspark.sql.types import *
from hashlib import sha1 as local_sha1
df = spark.sql("select '4104003141' value_to_hash union all  select '4102859263'")
f1 = lambda x: str(int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16))
f2 = lambda x: int(local_sha1(x.encode('UTF-8')).hexdigest()[32:],16)
sha2Int1 = udf( f1 , StringType())
sha2Int2 = udf( f2 , IntegerType())
print(f('4102859263'))
dfr = df.select(df.value_to_hash, sha2Int1(df.value_to_hash).alias('1'), sha2Int2(df.value_to_hash).alias('2'))
/
/dfr.show(truncate=False)/
---------------------------------------------------------------------------------------------

I was expecting both columns should provide exact same values, however thats not the case *"always" *
*
*
2520346415 +-------------+----------+-----------+ |value_to_hash|1 |2 | +-------------+----------+-----------+ |4104003141 |478797741 |478797741 | |4102859263 |2520346415|-1774620881|+-------------+----------+-----------+ *
*

The function working fine, as shown in the print statement. However values are not matching and vary widely.

Any pointer?

--
Best Regards,
Ayan Guha


Reply via email to