Re: Multiple issues with Pulse (1.0.0.Incubating)

Kirk Lund Mon, 21 Nov 2016 10:21:50 -0800

[cc'ing user@geode]

Hi Dharam,


[Regarding #3 and leaving the Locator process running]

Turns out the Locator process is still running because it has gone into
reconnect mode. Killing the Server triggered network partition detection in
the Locator causing it to force disconnect and go into reconnect mode [1].

There's some more info about reconnect on the wiki [2] and [3]. Enabling
network partition detection was previously disabled by default but the
default was changed to be enabled, possibly as part of the work for
GEODE-77 [4].

I filed GEODE-2125 [5] to make changes to GFSH so that it can communicate
with a Locator in reconnect mode. This way, start/status/stop would work.
Start would say there's a Locator already running (in that dir, on that
port, etc). Status would say the Locator is disconnected. What you're
seeing now is "expected" until we can improve GFSH usability with reconnect
(probably in GEODE 1.1.0).

PS: if you restart the server you killed, the reconnecting Locator should
come back online. Disabling network partition detection as mentioned before
will prevent this behavior.

[1]
http://geode.apache.org/docs/guide/managing/autoreconnect/member-reconnect.html
[2] https://cwiki.apache.org/confluence/display/GEODE/
MembershipManager+Functional+Specification
[3] https://cwiki.apache.org/confluence/display/GEODE/Auto+
Reconnect+Sequence+Diagram
[4] https://issues.apache.org/jira/browse/GEODE-77
[5] https://issues.apache.org/jira/browse/GEODE-2125

Thanks,
Kirk

On Wed, Nov 16, 2016 at 7:36 PM, Dharam Thacker <[email protected]>
wrote:

> Hi Kirk,
>
> Thank you for the updates!
>
> "I'm not sure what's going on with the PID in #3. When the locator shuts
> down due to partition detection, does it seem to leave that Java process
> running? If it does that's another issue that should be fixed."
>
> Yes, it leaves java process running and PID file holds a lock. I have to
> manually kill the java process to start locator on same port later on.
>
> I would re-verify for Issue#1.
>
> Thanks & Regards,
> Dharam
>
> - Dharam Thacker
>
> On Thu, Nov 17, 2016 at 3:27 AM, Kirk Lund <[email protected]> wrote:
>
>> #1 is the old URL as Jinmei mentioned. Using the new URL should clear
>> that up for you.
>>
>> #2 is a newly identified bug filed as GEODE-2117 "Pulse fails to handle
>> float type mbean attributes" which was introduced by commits for GEODE-907.
>> We're working on a fix for GEODE-2117.
>>
>> #3 is being caused by a change made in GEODE to have network partition
>> detection enabled by default.  In a small test system this can cause this
>> kind of behavior.  You should set enable-network-partition-detection=false
>> to avoid this behavior. I filed GEODE-2118 to hopefully make additional
>> changes so that partition detection doesn't trigger shutdown by losing one
>> server in such a small cluster.
>>
>> I'm not sure what's going on with the PID in #3. When the locator shuts
>> down due to partition detection, does it seem to leave that Java process
>> running? If it does that's another issue that should be fixed.
>>
>> Thanks,
>> Kirk
>>
>>
>> On Wed, Nov 16, 2016 at 10:49 AM, Dharam Thacker <
>> [email protected]> wrote:
>>
>>> Thanks Kirk! I have sent logs in private email to you.
>>>
>>> Regards,
>>> Dharam
>>>
>>> - Dharam Thacker
>>>
>>> On Wed, Nov 16, 2016 at 11:48 PM, Kirk Lund <[email protected]> wrote:
>>>
>>>> Hi Dharam,
>>>>
>>>> Can you send us the locator1 log file from #3? We're unable to
>>>> reproduce what you're seeing.
>>>>
>>>> Thanks,
>>>> Kirk
>>>>
>>>>
>>>> On Tue, Nov 15, 2016 at 10:17 PM, Dharam Thacker <
>>>> [email protected]> wrote:
>>>>
>>>>> Hi Team,
>>>>>
>>>>> We are seeing multiple issues with pulse in latest version of Apache
>>>>> geode. (It was fine till M3)
>>>>>
>>>>> 1) Unable to login from Firefox,only works with IE
>>>>>
>>>>> Error:
>>>>>
>>>>> HTTP ERROR 404
>>>>>
>>>>> Problem accessing /pulse/j_spring_security_check
>>>>>
>>>>> Reason:
>>>>>     Not Found
>>>>>
>>>>> Even within IE, it does not open pulse consistently and we get error
>>>>> 503 sometimes due to issue explained in point 3)
>>>>>
>>>>> 2) Bugs in pulse statistics as per pulse logs,
>>>>>
>>>>> [INFO 2016/11/10 21:20:40.594 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18882) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: AverageWrites
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.594 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18883) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: DiskWritesRate
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.595 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18884) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: AverageWrites
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.595 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18885) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: AverageReads
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.595 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18886) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute:
>>>>> QueryRequestRate Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.595 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18887) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: DiskReadsRate
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.596 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18888) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: DiskWritesRate
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.596 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18889) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: DiskReadsRate
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.596 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18890) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: CpuUsage
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>> [INFO 2016/11/10 21:20:45.596 IST 
>>>>> PULSE-dharam-ThinkPad-Edge-E431:1099:null
>>>>> tid=0x6d] (msgTID=109 msgSN=18891) [PULSE]
>>>>> [org.apache.geode.tools.pulse.internal.log.PulseLogWriter]
>>>>> ************************Unexpected type for attribute: AverageReads
>>>>> Expected type: java.lang.Double Received type:
>>>>> java.lang.Float************************
>>>>>
>>>>>
>>>>> 3) Related to locator : gfsh disconnects from cluster abruptly
>>>>>
>>>>> Steps to reproduce:
>>>>>
>>>>> 1) Start simple locator from gfsh
>>>>>
>>>>> start locator --name=locator1
>>>>>
>>>>> 2) Start server *from spring boot* with minimal configuration
>>>>> (locators[10334]) and at least 1 region filled with data
>>>>>
>>>>> 3) Kill the server
>>>>>
>>>>> 4) You will see gfsh gets disconnected from cluster
>>>>>
>>>>> 5) And even most of the time locator gets killed
>>>>>
>>>>> 6) Post 7070 for HTTP goes into TIME_WAIT state and you will have to
>>>>> manually kill PID of process.
>>>>>
>>>>>
>>>>> Thanks & Regards,
>>>>> Dharam
>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Multiple issues with Pulse (1.0.0.Incubating)

Reply via email to