That's exactly my question was, whether Spark can do parallel read, not data-frame driven parallel query or processing, because our ML query is very simple, but the data ingestion part seams to be the bottleneck.  Can someone confirm that Spark just can't do parallel read?  If not, what would be the alternative?  Creating our own customized scheduler or listener?

Thanks!

On 10/16/20 4:25 PM, Lalwani, Jayesh wrote:

One you are talking about ML, you aren’t talking about “simple” transformations. Spark is a good platform to do ML on. You can easily configure Spark to read your data in one node, and then run ML transformations in parallel

*From: *Artemis User <arte...@dtechspace.com>
*Date: *Friday, October 16, 2020 at 3:52 PM
*To: *"user@spark.apache.org" <user@spark.apache.org>
*Subject: *RE: [EXTERNAL] How to Scale Streaming Application to Multiple Workers

*CAUTION*: This email originated from outside of the organization. Do not click links or open attachments unless you can confirm the sender and know the content is safe.

We can't use AWS since the target production has to be on-prem.  The reason we choose Spark is because of its ML libraries.  Lambda would be a great model for stream processing from a functional programming perspective.  Not sure how well can it be integrated with Spark ML or other ML libraries.  Any suggestions would be highly appreciated..

ND

On 10/16/20 2:49 PM, Lalwani, Jayesh wrote:

    With a file based source, Spark is going to take maximum use of
    memory before it tries to scaling to more nodes. Parallelization
    adds overhead. This overhead is negligible if your data is several
    gigs or above. If your entire data can fit into memory of one
    node, then it’s better to process everything in one node. Forcing
    Spark to parallelize processing that can be done in a single node
    will reduce throughput.

    You are right, though. Spark is overkill for a simple
    transformation for a 300KB file. A lot of people implement simple
    transformations using serverless AWS Lambda. Spark’s power comes
    in when you are joining streaming sources and/or joining streaming
    sources with batch sources. It’s not that Spark can’t do simple
    transformations. It’s perfectly capable of doing it. It make sense
    to implement simple transformations in Spark if you have a data
    pipeline that is implemented in Spark, and this ingestion is one
    of many other things that you do with Spark. But, if your entire
    pipeline consists of ingestion of small files, then you might be
    better off with simpler solutions.

    *From: *Artemis User <arte...@dtechspace.com>
    <mailto:arte...@dtechspace.com>
    *Date: *Friday, October 16, 2020 at 2:19 PM
    *Cc: *user <user@spark.apache.org> <mailto:user@spark.apache.org>
    *Subject: *RE: [EXTERNAL] How to Scale Streaming Application to
    Multiple Workers

    *CAUTION*: This email originated from outside of the organization.
    Do not click links or open attachments unless you can confirm the
    sender and know the content is safe.

    Thank you all for the responses.  Basically we were dealing with
    file source (not Kafka, therefore no topics involved) and dumping
    csv files (about 1000 lines, 300KB per file) at a pretty high
    speed (10 - 15 files/second) one at a time to the stream source
    directory.  We have a Spark 3.0.1. cluster configured with 4
    workers, each one is allocated with 4 cores.  We tried numerous
    options, including setting the
    spark.streaming.dynamicAllocation.enabled parameter to true, and
    setting the maxFilesPerTrigger to 1, but were unable to scale the
    #cores*#workers >4.

    What I am trying to understand is that what makes spark to
    allocate jobs to more workers?  Is it based on the size of the
    data frame, batch sizes or trigger intervals?  Looks like the
    Spark master scheduler doesn't consider the number of input files
    waiting to be processed, only consider the data size (i.e. the
    size of data frames) that has been read or already imported,
    before allocating new workers.  If that that case, then Spark
    really missed the point and wasn't really designed for real-time
    streaming applications.  I could write my own stream processor
    that would distribute the load based on the number of input files,
    given the fact, that each batch query is atomic/independent from
    each other..

    Thanks in advance for your comment/input.

    ND

    On 10/15/20 7:13 PM, muru wrote:

        File streaming in SS, you can try setting "maxFilesPerTrigger"
        per batch. The forEachBatch is an action, the output is
        written to various sinks. Are you doing any post
        transformation in forEachBatch?

        On Thu, Oct 15, 2020 at 1:24 PM Mich Talebzadeh
        <mich.talebza...@gmail.com <mailto:mich.talebza...@gmail.com>>
        wrote:

            Hi,

            This in general depends on how many topics you want to
            process at the same time and whether this is done
            on-premise running Spark in cluster mode.

            Have you looked at Spark GUI to see if one worker (one
            JVM) is adequate for the task?

            Also how these small files are read and processed. Is it
            the same data microbatched? Spark streaming does not
            process one event at a time which is in general I think
            what people call "Streaming." It instead processes groups
            of events. Each group is a "MicroBatch" that gets
            processed at the same time.


            What parameters (BatchInterval,
            WindowsLength,SlidingInterval) are you using?

            Parallelism helps when you have reasonably large data and
            your cores are running on different sections of data in
            parallel. Roughly how much do you have in every CSV file

            HTH,

            Mich

            LinkedIn
            
///https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw/

            *Disclaimer:* Use it at your own risk.Any and all
            responsibility for any loss, damage or destruction of data
            or any other property which may arise from relying on this
            email's technical content is explicitly disclaimed. The
            author will in no case be liable for any monetary damages
            arising from such loss, damage or destruction.

            On Thu, 15 Oct 2020 at 20:02, Artemis User
            <arte...@dtechspace.com <mailto:arte...@dtechspace.com>>
            wrote:

                Thanks for the input.  What I am interested is how to
                have multiple
                workers to read and process the small files in
                parallel, and certainly
                one file per worker at a time.  Partitioning data
                frame doesn't make
                sense since the data frame is small already.

                On 10/15/20 9:14 AM, Lalwani, Jayesh wrote:
                > Parallelism of streaming depends on the input
                source. If you are getting one small file per
                microbatch, then Spark will read it in one worker. You
                can always repartition your data frame after reading
                it to increase the parallelism.
                >
                > On 10/14/20, 11:26 PM, "Artemis User"
                <arte...@dtechspace.com
                <mailto:arte...@dtechspace.com>> wrote:
                >
                >      CAUTION: This email originated from outside of
                the organization. Do not click links or open
                attachments unless you can confirm the sender and know
                the content is safe.
                >
                >
                >
                >      Hi,
                >
                >      We have a streaming application that read
                microbatch csv files and
                >      involves the foreachBatch call. Each microbatch
                can be processed
                >      independently.  I noticed that only one worker
                node is being utilized.
                >      Is there anyway or any explicit method to
                distribute the batch work load
                >      to multiple workers?  I would think Spark would
                execute foreachBatch
                >      method on different workers since each batch
                can be treated as atomic?
                >
                >      Thanks!
                >
                >      ND
                >
                >
                >
                
---------------------------------------------------------------------
                >      To unsubscribe e-mail:
                user-unsubscr...@spark.apache.org
                <mailto:user-unsubscr...@spark.apache.org>
                >
                >

                
---------------------------------------------------------------------
                To unsubscribe e-mail:
                user-unsubscr...@spark.apache.org
                <mailto:user-unsubscr...@spark.apache.org>

Reply via email to