+user for visibility

Begin forwarded message:

> From: Rion Williams <[email protected]>
> Date: October 27, 2020 at 12:28:37 PM CDT
> To: [email protected]
> Subject: Transform Logging Issues with Spark/Dataproc in GCP
> 
> Hi all,
> 
> Recently, I deployed a very simple Apache Beam pipeline to get some insights 
> into how it behaved executing in Dataproc as opposed to on my local machine. 
> I quickly realized that after executing that any DoFn or transform-level 
> logging didn't appear within the job logs within the Google Cloud Console as 
> I would have expected and I'm not entirely sure what might be missing.
> 
> All of the high level logging messages are emitted as expected:
> 
> // This works
> log.info("Testing logging operations")
> 
> pipeline
>     .apply(Create.of(...))
>     .apply(ParDo.of(LoggingDoFn))
> 
> The LoggingDoFn class here is a very basic transform that emits each of the 
> values that it encounters as seen below:
> 
> object LoggingDoFn : DoFn<String, ...>() {
>     private val log = LoggerFactory.getLogger(LoggingDoFn::class.java)
>     
>     @ProcessElement
>     fun processElement(c: ProcessContext) {
>         // This is never emitted within the logs
>         log.info("Attempting to parse ${c.element()}")
>     }
> }
> 
> As detailed in the comments, I can see logging messages outside of the 
> processElement() calls (presumably because those are being executed by the 
> Spark runner), but is there a way to easily expose those within the inner 
> transform as well?
> 
> When viewing the logs related to this job, we can see the higher-level 
> logging present, but no mention of a "Attempting to parse ..." message from 
> the DoFn:
> 
> 
> 
> The job itself is being executed by the following gcloud command, which has 
> the driver log levels explicitly defined, but perhaps there's another level 
> of logging or configuration that needs to be added:
> 
> gcloud dataproc jobs submit spark 
> --jar=gs://some_bucket/deployment/example.jar --project example-project 
> --cluster example-cluster --region us-example --driver-log-levels 
> com.example=DEBUG -- --runner=SparkRunner 
> --output=gs://some_bucket/deployment/out
> 
> To summarize, log messages are not being emitted to the Google Cloud Console 
> for tasks that would generally be assigned to the Spark runner itself (e.g. 
> processElement()). I'm unsure if it's a configuration-related issue or 
> something else entirely.
> 
> Any advice would be appreciated and I’d be glad to provide some additional 
> details however I can.
> 
> Thanks,
> 
> Rion

Reply via email to