Hi all, I am trying to make our application check the Spark version before attempting to submit a job, to ensure the user is on a new enough version (in our case, 2.3.0 or later). I realize that there is a --version argument to spark-shell, but that prints the version next to some ASCII art so a bit of parsing would be needed. So my initial idea was to do something like the following:
echo 'System.out.println(sc.version)' | spark-shell 2>/dev/null | grep -A2 'System.out.println' | grep -v 'System.out.println' This works fine. If you run in a shell (assuming spark-shell is on your PATH), you get the version printed to the first line of stdout. But I'm noticing some strange behavior when I try to invoke this from either a Java or Scala application (via their respective ProcessBuilder functionalities). To be specific, if that Java/Scala application is run as a background job, then when the spark-shell invocation happens (after being forked by a ProcessBuilder), the job goes into suspended state due to tty output. This happens even if the Java/Scala program has had its stdout and stderr redirected to files. To demonstrate, compile this Java class, which is really just a simple wrapper for ProcessBuilder. It passes its main arguments directly to ProcessBuilder, and creates threads to consume stdout/stderr (printing them to its own stdout/stderr), then exits when that forked process dies. https://gist.github.com/jeff303/e5b44e220db20800752c932cbfbf7ed1 My environment is OS X 10.14.3, Spark 2.4.0 (installed via Homebrew), Scala stable 2.12.8 (also Homebrew), and Oracle HotSpot JDK 1.8.0_181. The behavior outlined below happens in both zsh 5.6.2 and bash 4.4.23(1). Consider the following terminal session # BEGIN SHELL SESSION # compile the ProcessBuilderRunner class javac ProcessBuilderRunner.java # sanity check; just invoke an echo command java ProcessBuilderRunner bash -c 'echo hello world' About to run: bash -c echo hello world stdout line: hello world exit value from process: 0 stderr from process: stdout from process: hello world # try running the "version check" sequence outlined above in foreground java ProcessBuilderRunner bash -c "echo 'System.out.println(sc.version)' | spark-shell 2>/dev/null | grep -A2 'System.out.println' | grep -v 'System.out.println'" [Python:system] [11:08:54] About to run: bash -c echo 'System.out.println(sc.version)' | spark-shell 2>/dev/null | grep -A2 'System.out.println' | grep -v 'System.out.println' stdout line: 2.4.0 stdout line: exit value from process: 0 stderr from process: stdout from process: 2.4.0 # run the same thing, but in the background, redirecting outputs to files java ProcessBuilderRunner bash -c "echo 'System.out.println(sc.version)' | spark-shell 2>/dev/null | grep -A2 'System.out.println' | grep -v 'System.out.println'" >/tmp/spark-check.out 2>/tmp/spark-check.err & # after a few seconds, the job is suspended due to tty output [1] + 8964 suspended (tty output) java ProcessBuilderRunner bash -c > /tmp/spark-check.out 2> # foreground the job; it will complete shortly thereafter fg # confirm the stdout is correct cat /tmp/spark-check.out About to run: bash -c echo 'System.out.println(sc.version)' | spark-shell 2>/dev/null | grep -A2 'System.out.println' | grep -v 'System.out.println' stdout line: 2.4.0 stdout line: exit value from process: 0 stdout from process: 2.4.0 # END SHELL SESSION Why is the backgrounded Java process getting suspended when it tries to invoke spark-shell here? Theoretically, all of the programs involved here should have a well defined sink for their stdout, and in the foreground everything works correctly. Also, of note, the same exact thing run from a Scala class* results in the same behavior. I am not that knowledgeable on the finer points of tty handling, so hopefully someone can point me in the right direction. Thanks. * Scala version: https://gist.github.com/jeff303/2c2c3daa49a9cb588a0de6f1a73255b2 --------------------------------------------------------------------- To unsubscribe e-mail: user-unsubscr...@spark.apache.org