Hi folks,

I'm contributing to the OpenLineage project, specifically the Apache Spark
integration. My current focus is on extending the project to support data
lineage extraction for Spark Streaming, beginning with Apache Kafka sources
and sinks.

I've encountered an obstacle when attempting to access information
essential for lineage extraction from Apache Kafka-related classes within
the OpenLineage Spark code base. Specifically, I need to access details
like Kafka topic names and bootstrap servers from objects like
StreamingDataSourceV2Relation.

While I can successfully access these details if the Kafka JARs are placed
directly in the 'spark/jars' directory, I'm unable to do so when using the
`--packages` option for dependency management. This creates a significant
obstacle for users who rely on `--packages` for their Spark applications.

I've taken initial steps to investigate (viewable in this GitHub PR
<https://github.com/OpenLineage/OpenLineage/pull/2647>, the class in
question is *StreamingDataSourceV2RelationVisitor*), but I'd greatly
appreciate any insights or guidance on the following:

*1. Understanding the Issue:* Are there known reasons within Spark that
could explain this difference in behavior when loading dependencies via
`--packages` versus placing JARs directly?
*2. Alternative Approaches:*  Are there recommended techniques or patterns
to access the necessary Kafka class information within a SparkListener
extension, especially when dependencies are managed via `--packages`?

I'm eager to find a solution that avoids heavy reliance on reflection.

Thank you for your time and assistance!

Kind regards,
Damien

Reply via email to