Re: Sorting on a streaming dataframe

2018-04-24 Thread Chayapan Khannabha
Perhaps your use case fits to Apache Kafka better.

More info at:
https://kafka.apache.org/documentation/streams/ 


Everything really comes down to the architecture design and algorithm spec. 
However, from my experience with Spark, there are many good reasons why this 
requirement is not supported ;)

Best,

Chayapan (A)


> On Apr 24, 2018, at 2:18 PM, Hemant Bhanawat  wrote:
> 
> Thanks Chris. There are many ways in which I can solve this problem but they 
> are cumbersome. The easiest way would have been to sort the streaming 
> dataframe. The reason I asked this question is because I could not find a 
> reason why sorting on streaming dataframe is disallowed. 
> 
> Hemant
> 
> On Mon, Apr 16, 2018 at 6:09 PM, Bowden, Chris  > wrote:
> You can happily sort the underlying RDD of InternalRow(s) inside a sink, 
> assuming you are willing to implement and maintain your own sink(s). That is, 
> just grabbing the parquet sink, etc. isn’t going to work out of the box. 
> Alternatively map/flatMapGroupsWithState is probably sufficient and requires 
> less working knowledge to make effective reuse of internals. Just group by 
> foo and then sort accordingly and assign ids. The id counter can be stateful 
> per group. Sometimes this problem may not need to be solved at all. For 
> example, if you are using kafka, a proper partitioning scheme and message 
> offsets may be “good enough”. 
> From: Hemant Bhanawat >
> Sent: Thursday, April 12, 2018 11:42:59 PM
> To: Reynold Xin
> Cc: dev
> Subject: Re: Sorting on a streaming dataframe
>  
> Well, we want to assign snapshot ids (incrementing counters) to the incoming 
> records. For that, we are zipping the streaming rdds with that counter using 
> a modified version of ZippedWithIndexRDD. We are ok if the records in the 
> streaming dataframe gets counters in random order but the counter should 
> always be incrementing. 
> 
> This is working fine until we have a failure. When we have a failure, we 
> re-assign the records to snapshot ids  and this time same snapshot id can get 
> assigned to a different record. This is a problem because the primary key in 
> our storage engine is . So we want to sort the 
> dataframe so that the records always get the same snapshot id. 
> 
> 
> 
> On Fri, Apr 13, 2018 at 11:43 AM, Reynold Xin  > wrote:
> Can you describe your use case more?
> 
> On Thu, Apr 12, 2018 at 11:12 PM Hemant Bhanawat  > wrote:
> Hi Guys, 
> 
> Why is sorting on streaming dataframes not supported(unless it is complete 
> mode)? My downstream needs me to sort the streaming dataframe.
> 
> Hemant 
> 
> 



Re: spark job scheduling

2016-01-27 Thread Chayapan Khannabha
I would start at this wiki page
https://spark.apache.org/docs/1.2.0/job-scheduling.html

Although I'm sure this depends a lot on your cluster environment and the
deployed Spark version.

IMHO

On Thu, Jan 28, 2016 at 10:27 AM, Niranda Perera <niranda.per...@gmail.com>
wrote:

> Sorry I have made typos. let me rephrase
>
> 1. As I understand, the smallest unit of work an executor can perform, is
> a 'task'. In the 'FAIR' scheduler mode, let's say a job is submitted to the
> spark ctx which has a considerable amount of work to do in a single task.
> While such a 'big' task is running, can we still submit another smaller job
> (from a separate thread) and get it done? or does that smaller job has to
> wait till the bigger task finishes and the resources are freed from the
> executor?
> (essentially, what I'm asking is, in the FAIR scheduler mode, jobs are
> scheduled fairly, but at the task granularity they are still FIFO?)
>
> 2. When a job is submitted without setting a scheduler pool, the 'default'
> scheduler pool is assigned to it, which employs FIFO scheduling. but what
> happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
> without specifying a scheduler pool (which has FAIR scheduling)? would the
> jobs still run in FIFO mode with the default pool?
> essentially, for us to really set FAIR scheduling, do we have to assign a
> FAIR scheduler pool also to the job?
>
> On Thu, Jan 28, 2016 at 8:47 AM, Chayapan Khannabha <chaya...@gmail.com>
> wrote:
>
>> I think the smallest unit of work is a "Task", and an "Executor" is
>> responsible for getting the work done? Would like to understand more about
>> the scheduling system too. Scheduling strategy like FAIR or FIFO do have
>> significant impact on a Spark cluster architecture design decision.
>>
>> Best,
>>
>> Chayapan (A)
>>
>> On Thu, Jan 28, 2016 at 10:07 AM, Niranda Perera <
>> niranda.per...@gmail.com> wrote:
>>
>>> hi all,
>>>
>>> I have a few questions on spark job scheduling.
>>>
>>> 1. As I understand, the smallest unit of work an executor can perform.
>>> In the 'fair' scheduler mode, let's say  a job is submitted to the spark
>>> ctx which has a considerable amount of work to do in a task. While such a
>>> 'big' task is running, can we still submit another smaller job (from a
>>> separate thread) and get it done? or does that smaller job has to wait till
>>> the bigger task finishes and the resources are freed from the executor?
>>>
>>> 2. When a job is submitted without setting a scheduler pool, the default
>>> scheduler pool is assigned to it, which employs FIFO scheduling. but what
>>> happens when we have the spark.scheduler.mode as FAIR, and if I submit jobs
>>> without specifying a scheduler pool (which has FAIR scheduling)? would the
>>> jobs still run in FIFO mode with the default pool?
>>> essentially, for us to really set FAIR scheduling, do we have to assign
>>> a FAIR scheduler pool?
>>>
>>> best
>>>
>>> --
>>> Niranda
>>> @n1r44 <https://twitter.com/N1R44>
>>> +94-71-554-8430
>>> https://pythagoreanscript.wordpress.com/
>>>
>>
>>
>
>
> --
> Niranda
> @n1r44 <https://twitter.com/N1R44>
> +94-71-554-8430
> https://pythagoreanscript.wordpress.com/
>