Re: pyspark dataframe join with two different data type

2024-05-10 Thread Damien Hawes
Right now, with the structure of your data, it isn't possible.

The rows aren't duplicates of each other. "a" and "b" both exist in the
array. So Spark is correctly performing the join. It looks like you need to
find another way to model this data to get what you want to achieve.

Are the values of "a" and "b" related to each other in any way?

- Damien

Op vr 10 mei 2024 18:08 schreef Karthick Nk :

> Hi Mich,
>
> Thanks for the solution, But I am getting duplicate result by using
> array_contains. I have explained the scenario below, could you help me on
> that, how we can achieve i have tried different way bu i could able to
> achieve.
>
> For example
>
> data = [
> ["a"],
> ["b"],
> ["d"],
> ]
> column_names = ["column1"]
> df = spark.createDataFrame(data, column_names)
> df.display()
>
> [image: image.png]
>
> df_1 = spark.createDataFrame([(["a", "b", "c"],), ([],)], ['data'])
> df_1.display()
> [image: image.png]
>
>
> final_df = df_1.join(df, expr("array_contains(data, column1)"))
> final_df.display()
>
> Resul:
> [image: image.png]
>
> But i need the result like below:
>
> [image: image.png]
>
> Why because
>
> In the df_1 i have only two records, in that first records onlly i have
> matching value.
> But both records from the df i.e *a, b* are present in the first records
> itself, it is returning two records as resultant, but my expectation is to
> return only one records means if any of the records from the df is present
> in the df_1 it should return only one records from the df_1.
>
> Note:
> 1. Here we are able to filter the duplicate records by using distinct of
> ID field in the resultant df, bu I am thinking that shouldn't be effective
> way, rather i am thinking of updating in array_contains steps itself.
>
> Thanks.
>
>
> On Fri, Mar 1, 2024 at 4:11 AM Mich Talebzadeh 
> wrote:
>
>>
>> This is what you want, how to join two DFs with a string column in one
>> and an array of strings in the other, keeping only rows where the string
>> is present in the array.
>>
>> from pyspark.sql import SparkSession
>> from pyspark.sql import Row
>> from pyspark.sql.functions import expr
>>
>> spark = SparkSession.builder.appName("joins").getOrCreate()
>>
>> data1 = [Row(combined_id=[1, 2, 3])  # this one has a column combined_id
>> as an array of integers
>> data2 = [Row(mr_id=2), Row(mr_id=5)] # this one has column mr_id with
>> single integers
>>
>> df1 = spark.createDataFrame(data1)
>> df2 = spark.createDataFrame(data2)
>>
>> df1.printSchema()
>> df2.printSchema()
>>
>> # Perform the join with array_contains. It takes two arguments: an array
>> and a value. It returns True if the value exists as an element within
>> the array, otherwise False.
>> joined_df = df1.join(df2, expr("array_contains(combined_id, mr_id)"))
>>
>> # Show the result
>> joined_df.show()
>>
>> root
>>  |-- combined_id: array (nullable = true)
>>  ||-- element: long (containsNull = true)
>>
>> root
>>  |-- mr_id: long (nullable = true)
>>
>> +---+-+
>> |combined_id|mr_id|
>> +---+-+
>> |  [1, 2, 3]|2|
>> |  [4, 5, 6]|5|
>> +---+-+
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Dad | Technologist | Solutions Architect | Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* The information provided is correct to the best of my
>> knowledge but of course cannot be guaranteed . It is essential to note
>> that, as with any advice, quote "one test result is worth one-thousand
>> expert opinions (Werner
>> Von Braun
>> )".
>>
>>
>> On Thu, 29 Feb 2024 at 20:50, Karthick Nk  wrote:
>>
>>> Hi All,
>>>
>>> I have two dataframe with below structure, i have to join these two
>>> dataframe - the scenario is one column is string in one dataframe and in
>>> other df join column is array of string, so we have to inner join two df
>>> and get the data if string value is present in any of the array of string
>>> value in another dataframe,
>>>
>>>
>>> df1 = spark.sql("""
>>> SELECT
>>> mr.id as mr_id,
>>> pv.id as pv_id,
>>> array(mr.id, pv.id) as combined_id
>>> FROM
>>> table1 mr
>>> INNER JOIN table2 pv ON pv.id = Mr.recordid
>>>where
>>> pv.id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>> or pv.id = 'a5f03625-6cc5-49df-95eb-df741fe9139b'
>>> """)
>>>
>>> # df1.display()
>>>
>>> # Your second query
>>> df2 = spark.sql("""
>>> SELECT
>>> id
>>> FROM
>>> table2
>>> WHERE
>>> id = '35122806-4cd2-4916-a149-24ea55c2dc36'
>>>
>>> """)
>>>
>>>
>>>
>>> Result data:
>>> 35122806-4cd2-4916-a149-24ea55c2dc36 only, because this records alone is
>>> common between string and array of string value.
>>>
>>> Can you share the sample snippet, how we 

[SparkListener] Accessing classes loaded via the '--packages' option

2024-04-26 Thread Damien Hawes
Hi folks,

I'm contributing to the OpenLineage project, specifically the Apache Spark
integration. My current focus is on extending the project to support data
lineage extraction for Spark Streaming, beginning with Apache Kafka sources
and sinks.

I've encountered an obstacle when attempting to access information
essential for lineage extraction from Apache Kafka-related classes within
the OpenLineage Spark code base. Specifically, I need to access details
like Kafka topic names and bootstrap servers from objects like
StreamingDataSourceV2Relation.

While I can successfully access these details if the Kafka JARs are placed
directly in the 'spark/jars' directory, I'm unable to do so when using the
`--packages` option for dependency management. This creates a significant
obstacle for users who rely on `--packages` for their Spark applications.

I've taken initial steps to investigate (viewable in this GitHub PR
, the class in
question is *StreamingDataSourceV2RelationVisitor*), but I'd greatly
appreciate any insights or guidance on the following:

*1. Understanding the Issue:* Are there known reasons within Spark that
could explain this difference in behavior when loading dependencies via
`--packages` versus placing JARs directly?
*2. Alternative Approaches:*  Are there recommended techniques or patterns
to access the necessary Kafka class information within a SparkListener
extension, especially when dependencies are managed via `--packages`?

I'm eager to find a solution that avoids heavy reliance on reflection.

Thank you for your time and assistance!

Kind regards,
Damien