+user for visibility Begin forwarded message:
> From: Rion Williams <[email protected]> > Date: October 27, 2020 at 12:28:37 PM CDT > To: [email protected] > Subject: Transform Logging Issues with Spark/Dataproc in GCP > > Hi all, > > Recently, I deployed a very simple Apache Beam pipeline to get some insights > into how it behaved executing in Dataproc as opposed to on my local machine. > I quickly realized that after executing that any DoFn or transform-level > logging didn't appear within the job logs within the Google Cloud Console as > I would have expected and I'm not entirely sure what might be missing. > > All of the high level logging messages are emitted as expected: > > // This works > log.info("Testing logging operations") > > pipeline > .apply(Create.of(...)) > .apply(ParDo.of(LoggingDoFn)) > > The LoggingDoFn class here is a very basic transform that emits each of the > values that it encounters as seen below: > > object LoggingDoFn : DoFn<String, ...>() { > private val log = LoggerFactory.getLogger(LoggingDoFn::class.java) > > @ProcessElement > fun processElement(c: ProcessContext) { > // This is never emitted within the logs > log.info("Attempting to parse ${c.element()}") > } > } > > As detailed in the comments, I can see logging messages outside of the > processElement() calls (presumably because those are being executed by the > Spark runner), but is there a way to easily expose those within the inner > transform as well? > > When viewing the logs related to this job, we can see the higher-level > logging present, but no mention of a "Attempting to parse ..." message from > the DoFn: > > > > The job itself is being executed by the following gcloud command, which has > the driver log levels explicitly defined, but perhaps there's another level > of logging or configuration that needs to be added: > > gcloud dataproc jobs submit spark > --jar=gs://some_bucket/deployment/example.jar --project example-project > --cluster example-cluster --region us-example --driver-log-levels > com.example=DEBUG -- --runner=SparkRunner > --output=gs://some_bucket/deployment/out > > To summarize, log messages are not being emitted to the Google Cloud Console > for tasks that would generally be assigned to the Spark runner itself (e.g. > processElement()). I'm unsure if it's a configuration-related issue or > something else entirely. > > Any advice would be appreciated and I’d be glad to provide some additional > details however I can. > > Thanks, > > Rion
