Hi,

I think that reading Matei Zaharia's book "SPARK the definitive guide" will
be a good and best starting point.

Regards,
Gourav Sengupta

On Wed, Jun 30, 2021 at 3:47 PM Kartik Ohri <kartikohr...@gmail.com> wrote:

> Hi all!
>
> I am working on a Pyspark application and would like suggestions on how it
> should be structured.
>
> We have a number of possible jobs, organized in modules. There is also a "
> RequestConsumer
> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py>"
> class which consumes from a messaging queue. Each message contains the name
> of the job to invoke and the arguments to be passed to it. Messages are put
> into the message queue by cronjobs, manually etc.
>
> We submit a zip file containing all python files to a Spark cluster
> running on YARN and ask it to run the RequestConsumer. This
> <https://github.com/metabrainz/listenbrainz-server/blob/master/docker/start-spark-request-consumer.sh#L23-L34>
> is the exact spark-submit command for the interested. The results of the
> jobs are collected
> <https://github.com/metabrainz/listenbrainz-server/blob/master/listenbrainz_spark/request_consumer/request_consumer.py#L120-L122>
> by the request consumer and pushed into another queue.
>
> My question is whether this type of structure makes sense. Should the
> Request Consumer instead run independently of Spark and invoke spark-submit
> scripts when it needs to trigger a job? Or is there another recommendation?
>
> Thank you all in advance for taking the time to read this email and
> helping.
>
> Regards,
> Kartik.
>
>
>

Reply via email to