[web2py] Re: scheduler task_id assigned to multiple workers

Niphlod Thu, 08 Sep 2016 14:12:41 -0700

what driver are you using to connect to mssql ? and what are the properties 
of the database ? (sp_helpdb dbname).
seems rather strange that a box that handles "over 20k a second" can't 
stand the pressure of an additional 20 (without the "k").


On Tuesday, September 6, 2016 at 5:19:30 PM UTC+2, Jason Solack wrote:
>
> we're handing over 20k a second, the odd thing is if i move the scheduler 
> to mysql the deadlocks stop.  The mysql box is much lower specs and we only 
> use for internal stuff so i don't want that to be final solution.  As far 
> as load balancing i'm refering to our actualy webserver.  We load balance 
> on 3 machines, and each machine has 3 worker processes running on it.  it 
> seems that those web2py processes lock the table and cause deadlocks
>
> On Tuesday, August 30, 2016 at 3:37:41 PM UTC-4, Niphlod wrote:
>>
>> if 24cores 200+ GB RAM is the backend, how many transactions per second 
>> is that thing handling ? 
>> I saw lots of ultrabeefy servers that were poorly configured, hence did 
>> have poor performances, but it'd be criminal to blame on the product in 
>> that case (also, one the person who configured it). 
>>
>> I run 10 workers on 3 frontends backed by a single 2cpu 4GB RAM mssql 
>> backend and have no issue at all, so, network connectivity hiccups aside, 
>> sizing shouldn't be a problem. 
>> Since we're talking my territory here (I'm a DBA in real life), my 
>> backend doesn't sweat with 1k batchreq/sec.
>> To put theory into real data, 10 idle workers consume roughly 18 
>> batchreq/sec with the default heartbeat. And from 5 to 10 transactions/sec. 
>> That's less than 1% of "pressure".
>>
>> You're referring here and there "when we are load balancing the 
>> server"... are you talking about the server where workers live or the 
>> server that holds the database ?
>>
>> On Tuesday, August 30, 2016 at 6:03:56 PM UTC+2, Jason Solack wrote:
>>>
>>> the machine is plenty big (24 cores and over 200gb of RAM)... another 
>>> note, when we use mysql on a weaker machine the deadlocks go away, so i 
>>> feel that this must be something related to MSSQL. Also it only happens 
>>> when we are load balancing hte server.   
>>>
>>> we have it set up so each of the 3 machines is running 4 workers. they 
>>> all have the same group name, is that the proper way to configure on a load 
>>> balanced setup?
>>>
>>> On Tuesday, August 30, 2016 at 11:48:42 AM UTC-4, Niphlod wrote:
>>>>
>>>> when the backend has orrible performances :D
>>>> 12 workers with the default heartbeat are easily taken care by a dual 
>>>> core 4GB RAM backend (without anything beefy on top of that).
>>>>
>>>> On Tuesday, August 30, 2016 at 5:41:01 PM UTC+2, Jason Solack wrote:
>>>>>
>>>>> So after more investigation we are seeing that our load balanced 
>>>>> server with processes runnin on all three machines are causing a lot of 
>>>>> deadlocks in MSSQL. Have you seen that before?
>>>>>
>>>>> On Friday, August 19, 2016 at 2:40:35 AM UTC-4, Niphlod wrote:
>>>>>>
>>>>>> yep. your worker setup clearly can't stably be connected to your 
>>>>>> backend.
>>>>>>
>>>>>> On Thursday, August 18, 2016 at 7:41:38 PM UTC+2, Jason Solack wrote:
>>>>>>>
>>>>>>> so after some digging what i'm seeing is the sw.insert(...) is not 
>>>>>>> committing and the mybackedstatus is None, this happens 5 times and 
>>>>>>> then 
>>>>>>> the worker appears and almost instantly disappers.  There are no 
>>>>>>> errors.  i 
>>>>>>> tried manually doing a db.executesql but i'm having trouble getting 
>>>>>>> self.w_stats converted to something i can insert via sql.
>>>>>>>
>>>>>>> another things i'm noticing is my "distribution" in w_stats is 
>>>>>>> None...
>>>>>>>
>>>>>>> Any ideas as to why this is happening?
>>>>>>>
>>>>>>> On Thursday, August 18, 2016 at 12:21:26 PM UTC-4, Jason Solack 
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> doing that now, what i'm seeing is some problems here:
>>>>>>>>
>>>>>>>>             # record heartbeat
>>>>>>>>            mybackedstatus = db(sw.worker_name == self
>>>>>>>> .worker_name).select().first()
>>>>>>>>            if not mybackedstatus:
>>>>>>>>                sw.insert(status=ACTIVE, worker_name=self
>>>>>>>> .worker_name,
>>>>>>>>                          first_heartbeat=now, last_heartbeat=now,
>>>>>>>>                          group_names=self.group_names,
>>>>>>>>                          worker_stats=self.w_stats)
>>>>>>>>                self.w_stats.status = ACTIVE
>>>>>>>>                self.w_stats.sleep = self.heartbeat
>>>>>>>>                mybackedstatus = ACTIVE
>>>>>>>>
>>>>>>>> mybackedstatus is consistently coming back as "None" i'm guessing 
>>>>>>>> there is an error somewhere in that try block and the db commit is 
>>>>>>>> being 
>>>>>>>> rolled back
>>>>>>>>
>>>>>>>> i'm using MSSQL and nginx... currently upgrading web2py to see it 
>>>>>>>> continues
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> On Thursday, August 18, 2016 at 10:44:28 AM UTC-4, Niphlod wrote:
>>>>>>>>>
>>>>>>>>> turn on workers debugging level and grep for errors.
>>>>>>>>>
>>>>>>>>> On Thursday, August 18, 2016 at 4:38:31 PM UTC+2, Jason Solack 
>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>> I think we have this scenario happening:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> https://groups.google.com/forum/#%21searchin/web2py/task_id%7csort:relevance/web2py/AYH5IzCIEMo/hY6aNplbGX8J
>>>>>>>>>>
>>>>>>>>>> our workers seems to be restarting quickly and we're trying to 
>>>>>>>>>> figure out why
>>>>>>>>>>
>>>>>>>>>> On Thursday, August 18, 2016 at 3:55:55 AM UTC-4, Niphlod wrote:
>>>>>>>>>>>
>>>>>>>>>>> small recap.......a single worker is tasked with assigning tasks 
>>>>>>>>>>> (the one with is_ticker=True) and then that task is picked up only 
>>>>>>>>>>> by the 
>>>>>>>>>>> assigned worker (you can see it on the 
>>>>>>>>>>> scheduler_task.assigned_worker_name 
>>>>>>>>>>> column of the task). 
>>>>>>>>>>> There's no way the same task (i.e. a scheduler_task "row") is 
>>>>>>>>>>> executed while it is RUNNING (i.e. processed by some worker).
>>>>>>>>>>> The process running the task is stored also in 
>>>>>>>>>>> scheduler_run.worker_name.
>>>>>>>>>>>
>>>>>>>>>>> <tl;dr> you shouldn't EVER have scheduler_run records with the 
>>>>>>>>>>> same task_id and 12 different worker_name all in the RUNNING status.
>>>>>>>>>>>
>>>>>>>>>>> For a single task to be processed by ALL 12 workers at the same 
>>>>>>>>>>> time... is quite impossible, if everything is running smoothly. And 
>>>>>>>>>>> frankly 
>>>>>>>>>>> I can't fathom any scenario in which it is possible.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Wednesday, August 17, 2016 at 6:25:41 PM UTC+2, Jason Solack 
>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> I only see the task_id in the scheduler_run table, it seems to 
>>>>>>>>>>>> be added as many times as it can while the run is going... a short 
>>>>>>>>>>>> run will 
>>>>>>>>>>>> add just 2 of the workers and stop adding them once the initial 
>>>>>>>>>>>> run is 
>>>>>>>>>>>> completed
>>>>>>>>>>>>
>>>>>>>>>>>> On Wednesday, August 17, 2016 at 11:15:52 AM UTC-4, Niphlod 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>
>>>>>>>>>>>>> task assignment is quite "beefy" (sadly, or fortunately in 
>>>>>>>>>>>>> your case, it favours consistence vs speed) : I don't see any 
>>>>>>>>>>>>> reason why a 
>>>>>>>>>>>>> single task gets picked up by ALL of the 12 workers at the same 
>>>>>>>>>>>>> time if the 
>>>>>>>>>>>>> backend isn't lying (i.e. slaves not replicating master 
>>>>>>>>>>>>> data),.... if your 
>>>>>>>>>>>>> mssql is "single", there shouldn't absolutely be those kind of 
>>>>>>>>>>>>> problems...
>>>>>>>>>>>>>
>>>>>>>>>>>>> Are you sure all are crunching the same exact task (i.e. same 
>>>>>>>>>>>>> task id and uuid) ?
>>>>>>>>>>>>>
>>>>>>>>>>>>> On Wednesday, August 17, 2016 at 2:47:11 PM UTC+2, Jason 
>>>>>>>>>>>>> Solack wrote:
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> I'm using nginx and MSSQL for the db
>>>>>>>>>>>>>>
>>>>>>>>>>>>>> On Wednesday, August 17, 2016 at 3:11:11 AM UTC-4, Niphlod 
>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> nothing in particular. what backend are you using ?
>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>> On Tuesday, August 16, 2016 at 8:35:17 PM UTC+2, Jason 
>>>>>>>>>>>>>>> Solack wrote:
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>         task = scheduler.queue_task(tab_run, 
>>>>>>>>>>>>>>>> pvars=dict(tab_file_name=tab_file_name, 
>>>>>>>>>>>>>>>> the_form_file=the_form_file), 
>>>>>>>>>>>>>>>> timeout=60 * 60 * 24, sync_output=2, immediate=False, 
>>>>>>>>>>>>>>>> group_name=scheduler_group_name)
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> anything look amiss here?
>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>> On Tuesday, August 16, 2016 at 2:14:38 PM UTC-4, Dave S 
>>>>>>>>>>>>>>>> wrote:
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> On Tuesday, August 16, 2016 at 9:38:09 AM UTC-7, Jason 
>>>>>>>>>>>>>>>>> Solack wrote:
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Hello all, i am having a situation where my scheduled 
>>>>>>>>>>>>>>>>>> jobs are being picked up by multiple workers.  My last task 
>>>>>>>>>>>>>>>>>> was picked up 
>>>>>>>>>>>>>>>>>> by all 12 workers and is crushing the machines.  This is a 
>>>>>>>>>>>>>>>>>> load balanced 
>>>>>>>>>>>>>>>>>> machine with 3 machine and 4 workers on each machine.  has 
>>>>>>>>>>>>>>>>>> anyone 
>>>>>>>>>>>>>>>>>> experienced something like this?
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> Thanks for your help in advance!
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>> jason
>>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> What does your queue_task() code look like?
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>> /dps
>>>>>>>>>>>>>>>>>  
>>>>>>>>>>>>>>>>>
>>>>>>>>>>>>>>>>

-- 
Resources:
- http://web2py.com
- http://web2py.com/book (Documentation)
- http://github.com/web2py/web2py (Source code)
- https://code.google.com/p/web2py/issues/list (Report Issues)
--- 
You received this message because you are subscribed to the Google Groups 
"web2py-users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[web2py] Re: scheduler task_id assigned to multiple workers

Reply via email to