The Cluster Guy

Highly Available Ramblings

Release Candidate: 1.1.10-rc5

Lets try this again… Announcing the fourth and a half release candidate for Pacemaker 1.1.10

I previously tagged rc4 but ended up making several changes shortly afterwards, so it was pointless to announce it.

This RC is a result of cleanup work in several ancient areas of the codebase:

  • A number of internal membership caches have been combined
  • The three separate CPG code paths have been combined

As well as:

  • Moving clones is now both possible and sane
  • Improved behavior on systemd based nodes
  • and other assorted bugfixes (see below)

Please keep the bug reports coming in!

Help is specifically requested for testing plugin-based clusters, ACLs, the new –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

Also any light that can be shed on possible memory leaks would be much appreciated.

If everything looks good in a week from now, I will re-tag rc5 as final.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc5

Changesets  168
Diff 96 files changed, 4983 insertions(+), 3097 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc5

  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check|start|stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • Turn off auto-respawning of systemd services when the cluster starts them

Changes since Pacemaker-1.1.10-rc3

  • Bug pengine: cl#5155 - Block the stop of resources if any depending resource is unmanaged
  • Convert all exit codes to positive errno values
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Everyone who gets a fencing notification should mark the node as down
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Update the status section with details of nodes for which we only know the nodeid
  • crm_report: Find logs in compressed files
  • logging: If SIGTRAP is sent before tracing is turned on, turn it on
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd

Release Candidate: 1.1.10-rc3

Announcing the third release candidate for Pacemaker 1.1.10

This RC is a result of work in several problem areas reported by users, some of which date back to 1.1.8:

  • manual fencing confirmations
  • potential problems reported by Coverity
  • the way anonymous clones are displayed
  • handling of resource output that includes non-printing characters
  • handling of on-fail=block

Please keep the bug reports coming in. There is a good chances that this will be the final release candidate and 1.1.10 will be tagged on May 30th.

Help is specifically requested for testing plugin-based clusters, ACLs and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc3

Changesets  116
Diff 59 files changed, 707 insertions(+), 408 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc3

  • PE: Display a list of nodes on which stopped anonymous clones are not active instead of meaningless clone IDs
  • PE: Suppress meaningless IDs when displaying anonymous clone status

Changes since Pacemaker-1.1.10-rc2

  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • cib: CID#1023858 - Explicit null dereferenced
  • cib: CID#1023862 - Improper use of negative value
  • cib: CID#739562 - Improper use of negative value
  • cman: Our daemons have no need to connect to pacemakerd in a cman based cluster
  • crmd: Do not record pending delete operations in the CIB
  • crmd: Ensure pending and lost actions have values for last-run and last-rc-change
  • crmd: Insert async failures so that they appear in the correct order
  • crmd: Store last-run and last-rc-change for fail operations
  • Detect child processes that terminate before our SIGCHLD handler is installed
  • fencing: CID#739461 - Double close
  • fencing: Correctly broadcast manual fencing ACKs
  • fencing: Correctly mark manual confirmations as complete
  • fencing: Do not send duplicate replies for manual confirmation operations
  • fencing: Restore the ability to manually confirm that fencing completed
  • lrmd: CID#1023851 - Truncated stdio return value
  • lrmd: Don’t complain when heartbeat invokes us with -r
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • xml: Restore the ability to embed comments in the cib

Pacemaker Logging

Normal operation

Pacemaker inherits most of its logging setting from either CMAN or Corosync - depending on what its running on top of.

In order to avoid spamming syslog, Pacemaker only logs a summary of its actions (NOTICE and above) to syslog.

If the level of detail in syslog is insufficient, you should enable a cluster log file. Normally one is configured by default and it contains everything except debug and trace messages.

To find the location of this file, either examine your CMAN (cluster.conf) or Corosync (corosync.conf) configuration file or look for syslog entries such as:

pacemakerd[1823]:   notice: crm_add_logfile: Additional logging available in /var/log/cluster/corosync.log

If you do not see a line like this, either update the cluster configuration or set PCMK_debugfile in /etc/sysconfig/pacemaker

crm_report also knows how to find all the Pacemaker related logs and blackbox files

If the level of detail in the cluster log file is still insufficient, or you simply wish to go blind, you can turn on debugging in Corosync/CMAN, or set PCMK_debug in /etc/sysconfig/pacemaker.

A minor advantage of setting PCMK_debug is that the value can be a comma-separated list of processes which should produce debug logging instead of a global yes/no.

When an ERROR occurs

Pacemaker includes support for a blackbox.

When enabled, the blackbox contains a rolling buffer of all logs (not just those sent to syslog or a file) and is written to disk after a crash or assertion failure.

The blackbox recorder can be enabled by setting PCMK_blackbox in /etc/sysconfig/pacemaker or at runtime by sending SIGUSR1. Eg.

killall -USR1 crmd

When enabled you’ll see a log such as:

crmd[1811]:   notice: crm_enable_blackbox: Initiated blackbox recorder: /var/lib/pacemaker/blackbox/crmd-1811

If a crash occurs, the blackbox will be available at that location. To extract the contents, pass it to qb-blackbox:

qb-blackbox /var/lib/pacemaker/blackbox/crmd-1811

Which produces output like:

Dumping the contents of /var/lib/pacemaker/blackbox/crmd-1811
[debug] shm size:5242880; real_size:5242880; rb->word_size:1310720
[debug] read total of: 5242892
Ringbuffer:
 ->NORMAL
 ->write_pt [5588]
 ->read_pt [0]
 ->size [1310720 words]
 =>free [5220524 bytes]
 =>used [22352 bytes]
...
trace   May 19 23:20:55 gio_read_socket(368):0: 0x11ab920.5 1 (ref=1)
trace   May 19 23:20:55 pcmk_ipc_accept(458):0: Connection 0x11aee00
info    May 19 23:20:55 crm_client_new(302):0: Connecting 0x11aee00 for uid=0 gid=0 pid=24425 id=0e943a2a-dd64-49bc-b9d5-10fa6c6cb1bd
debug   May 19 23:20:55 handle_new_connection(465):2147483648: IPC credentials authenticated (24414-24425-14)
...
[debug] Free'ing ringbuffer: /dev/shm/qb-create_from_file-header

When an ERROR occurs you’ll also see the function and line number that produced it such as:

crmd[1811]: Problem detected at child_death_dispatch:872 (mainloop.c), please see /var/lib/pacemaker/blackbox/crmd-1811.1 for additional details
crmd[1811]: Problem detected at main:94 (crmd.c), please see /var/lib/pacemaker/blackbox/crmd-1811.2 for additional details

Again, simply pass the files to qb-blackbox to extract and query the contents.

Note the a counter is added to the end so as to avoid name collisions.

Diving into files and functions

In case you have not already guessed, all logs include the name of the function that generated them. So:

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

came from the function crm_update_peer_state().

To obtain more detail from that or any other function, you can set PCMK_trace_functions in /etc/sysconfig/pacemaker to a comma separated list of function names. Eg.

PCMK_trace_functions=crm_update_peer_state,run_graph

For a bigger stick, you may also activate trace logging for all the functions in a particular source file or files by setting PCMK_trace_files as well.

PCMK_trace_files=cluster.c,election.c

These additional logs are sent to the cluster log file. Note that enabling tracing options also alters the output format.

Instead of:

crmd:  notice: crm_cluster_connect:     Connecting to cluster infrastructure: cman

the output includes file and line information:

crmd: (   cluster.c:215   )  notice: crm_cluster_connect:   Connecting to cluster infrastructure: cman

But wait there’s still more

Still need more detail? You’re in luck! The blackbox can be dumped at any time, not just when an error occurs.

First, make sure the blackbox is active (we’ll assume its the crmd that needs to be debugged):

killall -USR1 crmd

Next, discard any previous contents by dumping them to disk

killall -TRAP crmd

now cause whatever condition you’re trying to debug, and send -TRAP when you’re ready to see the result.

killall -TRAP crmd

You can now look for the result in syslog:

grep -e crm_write_blackbox: /var/log/messages

This will include a filename containing the trace logging:

crmd[1811]:   notice: crm_write_blackbox: Blackbox dump requested, please see /var/lib/pacemaker/blackbox/crmd-1811.1 for contents
crmd[1811]:   notice: crm_write_blackbox: Blackbox dump requested, please see /var/lib/pacemaker/blackbox/crmd-1811.2 for contents

To extract the trace loging for our test, pass the most recent file to qb-blackbox:

qb-blackbox /var/lib/pacemaker/blackbox/crmd-1811.2

At this point you’ll probably want to use grep :)

Debugging the Policy Engine

Finding the right node

The Policy Engine is the component that takes the cluster’s current state, decides on the optimal next state and produces an ordered list of actions to achieve that state.

You can get a summary of what the cluster did in response to resource failures and nodes joining/leaving the cluster by looking at the logs from pengine:

grep -e pengine\\[ -e pengine: /var/log/messages

Although the pengine process is active on all cluster nodes, it is only doing work on one of them. The “active” instance is chosen through the crmd’s DC election process and may move around as nodes leave/join the cluster.

If you do not see anything from pengine at the time the problem occurs, continue to the next machine.

If you do not see anything from pengine on any node, check your cluster if logging to syslog is enabled and the syslog configuration to see where it is being sent. If in doubt, refer to Pacemaker Logging.

Once you have located the correct node to investigate, the first thing to do is look for the terms ERROR and WARN, eg.

grep -e pengine\\[ -e pengine: /var/log/messages | grep -e ERROR -e WARN

This will highlight any problems the software encountered.

Next expand the query to all pengine logs:

grep -e pengine\\[ -e pengine: /var/log/messages

The output will look a little like:

pengine[6132]:   notice: LogActions: Move    mysql  (Started corosync-host-1 -> corosync-host-4)
pengine[6132]:   notice: LogActions: Start   www    (corosync-host-6)
pengine[6132]:   notice: process_pe_message: Calculated Transition 7: /var/lib/pacemaker/pengine/pe-input-4424.bz2
pengine[6132]:   notice: process_pe_message: Calculated Transition 8: /var/lib/pacemaker/pengine/pe-input-4425.bz2

In the above logs, transition 7 resulted in mysql being moved and www being started. Later, transition 8 occurred but everything was where it should be and no action was required.

Other notable entries include:

pengine[6132]:  warning: cluster_status: We do not have quorum - fencing and resource management disabled
pengine[6132]:   notice: stage6: Scheduling Node corosync-host-1 for shutdown
pengine[6132]:  warning: stage6: Scheduling Node corosync-host-8 for STONITH

as well as

pengine[6132]:   notice: LogActions: Start   Fencing      (corosync-host-1 - blocked)

which indicates that the cluster would like to start the Fencing resource, but some dependancy is not satisfied.

pengine[6132]:  warning: determine_online_status: Node corosync-host-8 is unclean

which indicates that either corosync-host-8 has failed, or a resource on it has failed to stop when requested.

pengine[6132]:  warning: unpack_rsc_op: Processing failed op monitor for www on corosync-host-4: unknown error (1)

which indicates a health check for the www resource failed with a return code of 1 (aka. OCF_ERR_GENERIC). See Pacemaker Explained for more details on OCF return codes.

  • Is there anything from the Policy Engine at about the time of the problem?
    If not, go back to the crmd logs and see why no recovery was attempted.

  • Did pengine log why something happened? does that sound correct?
    Excellent, thanks for playing.

Getting more detail from the Policy Engine

The job performed by the Policy engine is a very complex and frequent task, so to avoid filling up the disk with logs, it only indicates what it is doing and rarely the reason why. Normally the why can be found in the crmd logs, but it also saves the current state (the cluster configuration and the state of all resources) to disk for situations when it can’t.

These files can later be replayed using crm_simulate with a higher level of verbosity to diagnose issues and, as part of our regression suite, to make sure they stay fixed afterwards.

Finding these state files is a matter of looking for logs such as

crmd[1811]:   notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
pengine[1810]:   notice: process_pe_message: Calculated Transition 0: /var/lib/pacemaker/pengine/pe-input-473.bz2

The “correct” entry will depend on the context of your query.

Please note, sometimes events occur while the pengine is performing its calculation. In this situation, the calculation logged by process_pe_message() is discarded and a new one performed. As a result, not all transitions/files listed by the pengine process are executed by the crmd.

After obtaining the file named by run_graph() or process_pe_message(), either directly or from a crm_report archive, pass it to crm_simulate which will display its view of the cluster at that time:

crm_simulate --xml-file ./pe-input-473.bz2
  • Does the cluster state look correct?

    If not, file a bug. It is possible we have misparsed the state of the resources, any calculation we make based on this would therefor also be wrong.

Next, see what recovery actions the cluster thinks need to be performed:

crm_simulate --xml-file ./pe-input-473.bz2 --save-graph problem.graph --save-dotfile problem.dot --run

In addition to the normal output, this command creates:

  • problem.graph, the ordered graph of actions, their parameters and prerequisites
  • problem.dot, a more human readable version of the same graph focussed on the action ordering.

Open problem.dot in dotty or graphviz to obtain a graphical representation:

  • Arrows indicate ordering dependencies
  • Dashed-arrows indicate dependencies that are not present in the transition graph
  • Actions with a dashed border of any color do not form part of the transition graph
  • Actions with a green border form part of the transition graph
  • Actions with a red border are ones the cluster would like to execute but cannot run
  • Actions with a blue border are ones the cluster does not feel need to be executed
  • Actions with orange text are pseudo/pretend actions that the cluster uses to simplify the graph
  • Actions with black text are sent to the lrmd
  • Resource actions have text of the form ${rsc}_${action}_${interval} ${node}
  • Actions of the form ${rsc}_monitor_0 ${node} is the cluster’s way of finding out the resource’s status before we try and start it anywhere
  • Any action depending on an action with a red border will not be able to execute.
  • Loops are really bad. Please report them to the development team.

Check the relative ordering of actions:

  • Are there any extra ones?
    Do they need to be removed from the configuration?
    Are they implied by the group construct?
  • Are there any missing?
    Are they specified in the configuration?

You can obtain excruicating levels of detial by adding additional -V options to the crm_simulate command line.

Now see what the cluster thinks the “next state” will look like:

crm_simulate --xml-file ./pe-input-473.bz2 --save-graph problem.graph --save-dotfile problem.dot --simulate
  • Does the new cluster state look correct based on the input and actions performed?
    If not, file a bug.

Debugging Pacemaker

Where to start

The first thing to do is look in syslog for the terms ERROR and WARN, eg.

grep -e ERROR -e WARN /var/log/messages

If nothing looks appropriate, find the logs from crmd

grep -e crmd\\[ -e crmd: /var/log/messages

If you do not see anything from crmd, check your cluster if logging to syslog is enabled and the syslog configuration to see where it is being sent. If in doubt, refer to Pacemaker Logging for how to obtain more detail.

Although the crmd process is active on all cluster nodes, decisions are only occuring on one of them. The “DC” is chosen through the crmd’s election process and may move around as nodes leave/join the cluster.

For node failures, you’ll always want the logs from the DC (or the node that becomes the DC).
For resource failures, you’ll want the logs from the DC and the node on which the resource failed.

Log entries like:

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now lost (was member)

indicate a node is no longer part of the cluster (either because it failed or was shut down)

crmd[1811]:   notice: crm_update_peer_state: cman_event_callback: Node corosync-host-1[1] - state is now member (was lost)

indicates a node has (re)joined the cluster

crmd[1811]:   notice: do_state_transition: State transition S_IDLE -> S_POLICY_ENGINE ...
crmd[1811]:   notice: run_graph: Transition 2 (... Source=/var/lib/pacemaker/pengine/pe-input-473.bz2): Complete
crmd[1811]:   notice: do_state_transition: State transition S_TRANSITION_ENGINE -> S_IDLE ...

indicates recovery was attempted

crmd[1811]:   notice: te_rsc_command: Initiating action 36: monitor www_monitor_0 on corosync-host-5
crmd[1811]:   notice: te_rsc_command: Initiating action 54: monitor mysql_monitor_10000 on corosync-host-4

indicates we performed a resource action, in this case we are checking the status of the www resource on corosync-host-5 and starting a recurring health check for mysql on corosync-host-4.

crmd[1811]:   notice: te_fence_node: Executing reboot fencing operation (83) on corosync-host-8 (timeout=60000)

indicates that we are attempting to fence corosync-host-8.

crmd[1811]:   notice: tengine_stonith_notify: Peer corosync-host-8 was terminated (st_notify_fence) by corosync-host-1 for corosync-host-1: OK

indicates that corosync-host-1 successfully fenced corosync-host-8.

Node-level failures

  • Did the crmd fail to notice the failure?

    If you do not see any entries from crm_update_peer_state(), check the corosync logs to see if membership was correct/timely

  • Did the crmd fail to initiate recovery?

    If you do not see entries from do_state_transition() and run_graph(), then the cluster failed to react at all. Refer to Pacemaker Logging to for how to obtain more detail about why the crmd ignored the failure.

  • Did the crmd fail to perform recovery?

    If you DO see entries from do_state_transition() but the run_graph() entry(ies) include the text Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, then the cluster did not think it needed to do anything.

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Was fencing attempted?

    Check if the stonith-enabled property is set to true/1/yes, if so obtain file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Did fencing complete?

    Check the configuration of fencing resources and if so proceed to Debugging Stonith.

Resource-level failures

  • Did the resource actually fail?

    If not, check for logs matching the resource name to see why the resource agent thought a failure occurred.

    Check the resource agent source to see what code paths could have produced those logs (or the lack of them)

  • Did crmd notice the resource failure?

    If not, check for logs matching the resource name to see if the resource agent noticed.

    Check a recurring monitor was configured.

  • Did the crmd fail to initiate recovery?

    If you do not see entries from do_state_transition() and run_graph(), then the cluster failed to react at all. Refer to Pacemaker Logging to for how to obtain more detail about why the crmd ignored the failure.

  • Did the crmd fail to perform recovery?

    If you DO see entries from do_state_transition() but the run_graph() entry(ies) include the text Complete=0, Pending=0, Fired=0, Skipped=0, Incomplete=0, then the cluster did not think it needed to do anything.

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

  • Did resources stop/start/move unexpectedly or fail to stop/start/move when expected?

    Obtain the file named by run_graph() (eg. /var/lib/pacemaker/pengine/pe-input-473.bz2) either directly or from a crm_report archive and continue to debugging the Policy Engine.

Pacemaker on RHEL6.4

Over the last couple of years, we have been evolving the stack in two ways of particular relevance to RHEL customers:

  • minimizing the differences to the default RHEL-6 stack to reduce the implication of supporting Pacemaker there (making it more likely to happen)

  • adapting to changes to Corosync’s direction (ie. the removal of plugins and the addition of a quorum API) for the future

As a general rule, Red Hat does not ship packages it doesn’t at least plan to support. So part of readying Pacemaker for “supported” status is removing or deprecating the parts of Red Hat’s packages that they have no interest and/or capacity to support.

Removal of crmsh

For reasons that you may or may not agree with, Red Hat has decided to rely on pcs for command line and GUI cluster management in RHEL-7.

As a result there is no future, in RHEL, for the original cluster shell crmsh.

Normally it would have been deprecated. However since crmsh is now a stand-alone project, it’s removal from the Pacemaker codebase also resulted in it’s removal from RHEL-6 once the packages were refreshed.

To fill the void and help prepare people for RHEL-7, pcs is now also available on RHEL-6.

Status of the Plugin

Anyone taking the recommended approach of using Pacemaker with CMAN (ie. cluster.conf) on RHEL-6 or any of its derivatives can stop reading for now (we’ll need to talk again when RHEL 7 comes out with corosync-2, but thats another conversation).

Anyone using corosync.conf on RHEL 6 should keep reading…

One of the differences between the Pacemaker and rgmanager stacks is where membership and quorum come from.

Pacemaker has traditionally obtained it from a custom plugin, whereas rgmanager used CMAN. Neither source is “better” than the other, the only thing that matters is that everyone obtains it from the same place.

Since the rest of the components in a RHEL-6 cluster use CMAN, support for it was added to Pacemaker which also helps minimize the support load. Additionally, in RHEL-7, Corosync’s support for plugins such as Pacemaker’s (and CMAN’s) goes away.

Without any chance of being supported in the short or long-term, configuring plugin-based clusters (ie. via corosync.conf ) is now officially deprecated in RHEL. As some of you may have already noticed, starting corosync in 6.4 produces the following entries in the logs:

Apr 23 17:35:36 corosync [pcmk  ] ERROR: process_ais_conf: You have configured a cluster using the Pacemaker plugin for Corosync. The plugin is not supported in this environment and will be removed very soon.
Apr 23 17:35:36 corosync [pcmk  ] ERROR: process_ais_conf:  Please see Chapter 8 of 'Clusters from Scratch' (http://clusterlabs.org/doc) for details on using Pacemaker with CMAN

Everyone is highly encouraged to switch to CMAN-based Pacemaker clusters

While the plugin will still be around, running Pacemaker in configurations that are not well tested by Red Hat (or, for the most part, by upstream either) contains an element of risk.

For example, the messages above were originally added for 6.3, however since logging from the plugin was broken for over a year, no-one noticed. It only got fixed when I was trying to figure out why no-one had complained about them yet!

A lack of logging is annoying but not usually problematic, unfortunately there is also something far worse…

Fencing Failures when using the Pacemaker plugin

It has come to light that fencing for plugin-based clusters is critically broken.

The cause was a single stray ‘n’-character, probably from a copy+paste, that prevents the crmd from correctly reacting to a membership-level failures (ie. killall -9 corosync) of it’s peers.

The problem did not show up in any of Red Hat’s testing because of the way Pacemaker processes talk to their peers on other nodes when CMAN (or Corosync 2.0) is in use.

For CMAN and Corosync 2.0 we use Corosync’s CPG API which provides notifications when peer processes (the crmd in this case) join or leave the messaging group. These additional notifications from CPG follow a different code path and are unaffected by the bug… allowing the cluster to function as intended.

Unfortunately, despite the size and obviousness of the fix, a z-stream update for a bug affecting a deprecated use-case of an as-yet-unsupported package is a total non-starter.

People wanting to stick with plugin-based clusters should obtain 1.1.9 or later from the Clusterlabs repos that includes the fix

You can read more about the bug and the fix on the Red Hat bugzilla

For details on converting to a CMAN-based stack, please see Clusters from Scratch.

Switching to CMAN is really far less painful than it sounds

There is also a quickstart guide for easily generating cluster.conf, just substitute the name of your cluster nodes.

Release Candidate: 1.1.10-rc2

Announcing the second release candidate for Pacemaker 1.1.10

No major changes have been introduced, just some fixes for a few niggles that were discovered since RC1.

Unless blocker bugs are found, this will be the final release candidate and 1.1.10 will be tagged on May 10th.

Help is specifically requested for testing plugin-based clusters, ACLs and admin actions (such as moving and stopping resources) which are hard to test in an automated manner.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    On Fedora/RHEL and its derivatives, you can do this by running:

    # yum install -y yum-utils
    # make yumdep
    

    Otherwise you will need to investigate the spec file and/or wait for rpmbuild to report missing packages.

  3. Build Pacemaker

    # make rpm
    
  4. Copy and deploy as needed

Details - 1.1.10-rc2

Changesets  31
Diff 30 files changed, 687 insertions(+), 138 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc2

N/A

Changes since Pacemaker-1.1.10-rc1

  • Bug cl#5152 - Correctly clean up fenced nodes during membership changes
  • Bug cl#5153 - Correctly display clone failcounts in crm_mon
  • Bug cl#5154 - Do not expire failures when on-fail=block is present
  • cman: Skip cman_pre_stop in the init script if fenced is not running
  • Core: Ensure the last field in transition keys is 36 characters
  • crm_mon: Check if a process can be daemonized before forking so the parent can report an error
  • crm_mon: Ensure stale pid files are updated when a new process is started
  • crm_report: Correctly collect logs when ‘uname -n’ reports fully qualified names
  • crm_resource: Allow –cleanup without a resource name
  • init: Unless specified otherwise, assume cman is in use if cluster.conf exists
  • mcp: inhibit error messages without cman
  • pengine: Ensure per-node resource parameters are used during probes
  • pengine: Implement the rest of get_timet_now() and rename to get_effective_time

Mixing Pacemaker Versions

When mixing Pacemaker versions, there are two factors that need to be considered. The first is obviously the package version - if that is the same, then there is no problem.

If not, then the Pacemaker feature set needs to be checked. This feature set increases far less regularly than the normal package version. Newer versions of Pacemaker expose this value in the output of pacemakerd --features:

$ pacemakerd –features
Pacemaker 1.1.9 (Build: 9048b7b)
Supporting v3.0.7: generated-manpages agent-manpages ascii-docs publican-docs ncurses gcov libqb-logging libqb-ipc lha-fencing upstart systemd nagios heartbeat corosync-native snmp

In this case, the feature set is 3.0.7 (major 3, minor 0, revision 7).

For older versions, you should refer to the definition of CRM_FEATURE_SET in crm.h, usually this will be located at /usr/include/pacemaker/crm/crm.h.

If two packages or versions share the same feature set, then the expectation is that they are fully compatible. Any other behavior is a bug which needs to be reported.

If the feature sets between two versions differ but have the same major value (ie. the 3 in 3.0.7 and 3.1.5), then they are said to be upgrade compatible.

What does upgrade compatible mean?

When two versions are upgrade compatible, it means that they will co-exist during a rolling upgrade but not on an extended or permanent basis as the newer version requires all its peers to support API feature(s) that the old one does not have.

The following two rules apply when mixing installations with different feature sets:

  • When electing a node to run the cluster (the Designated Co-ordinator or “DC”), the node with the lowest feature set always wins.
  • The DC records its feature set in the CIB
  • Nodes may not join the cluster if their feature set is less than the one recorded in the CIB

Example

Consider node1 with a feature set of 3.0.7 and node2 with feature set 3.0.8… when node2 first joins the cluster, node1 will naturally remain the DC.

However if node1 leaves the cluster, either by being shut down or due to a failure, node2 will become the DC (as it is by itself and by definition has the lowest feature set of any active node).

At this point, node1 will be rejected if it attempts to rejoin the cluster and will shut down, as its feature set is lower than that of the DC (node2).

Is this happening to me?

If you are affected by this, you will see an error in the logs along the lines of:

error: We can only support up to CRM feature set 3.0.7 (current=3.0.8)

In this case, the DC (node2) has feature set 3.0.8 but we (node1) only have 3.0.7.

To get these two nodes talking to each other again:

  1. stop the cluster on both nodes
  2. on both nodes, run:
    CIB_file=/path/to/cib.xml cibadmin -M -X '<cib crm_feature_set="3.0.7"/>'
    
  3. start node1 and wait until it is elected as the DC
  4. start node2

Release Candidate: 1.1.10-rc1

A funny thing happened on the way to 1.1.9…

Between tagging it on the Friday, and announcing it on the following Monday, people started actually testing it and found a couple of problems.

Specifically, there were some significant memory leaks and problems in a couple of areas that our unit and integration tests can’t sanely test.

So while 1.1.9 is out, it was never formally announced. Instead we’ve been fixing the reported bugs (as well as looking for a few more by running valgrind on a live cluster) and preparing for 1.1.10.

Also, in an attempt to learn from previous mistakes, the new release procedure involves release candidates. If no blocker bugs are reported in the week following a release candidate, it is re-tagged as the official release.

So without further ado, here is the 1.1.9 release notes as well as what changed in 1.1.10-rc1.

Details - 1.1.10-rc1

Changesets  143
Diff 104 files changed, 3327 insertions(+), 1186 deletions(-)

Highlights

Features added in Pacemaker-1.1.10

  • crm_resource: Allow individual resources to be reprobed
  • mcp: Alternate Upstart job controlling both pacemaker and corosync
  • mcp: Prevent the cluster from trying to use cman even when it is installed

Changes since Pacemaker-1.1.9

  • Allow programs in the haclient group to use CRM_CORE_DIR
  • cman: Do not unconditionally start cman if it is already running
  • core: Ensure custom error codes are less than 256
  • crmd: Clean up memory at exit
  • crmd: Do not update fail-count and last-failure for old failures
  • crmd: Ensure we return to a stable state if there have been too many fencing failures
  • crmd: Indicate completion of refresh to callers
  • crmd: Indicate completion of re-probe to callers
  • crmd: Only perform a dry run for deletions if built with ACL support
  • crmd: Prevent use-after-free when the blackbox is enabled
  • crmd: Suppress secondary errors when no metadata is found
  • doc: Pacemaker Remote deployment and reference guide
  • fencing: Avoid memory leak in can_fence_host_with_device()
  • fencing: Clean up memory at exit
  • fencing: Correctly filter devices when no nodes are configured yet
  • fencing: Correctly unpack device parameters before using them
  • fencing: Fail the operation once all peers have been exhausted
  • fencing: Fix memory leaks during query phase
  • fencing: Prevent empty call-id during notification processing
  • fencing: Prevent invalid read in parse_host_list()
  • fencing: Prevent memory leak when registering devices
  • crmd: lrmd: stonithd: fixed memory leaks
  • ipc: Allow unpriviliged clients to clean up after server failures
  • ipc: Restore the ability for members of the haclient group to connect to the cluster
  • legacy: cl#5148 - Correctly remove a node that used to have a different nodeid
  • legacy: Support “crm_node –remove” with a node name for corosync plugin (bnc#805278)
  • logging: Better checks when determining if file based logging will work
  • Pass errors from lsb metadata generation back to the caller
  • pengine: Do not use functions from the cib library during unpack
  • Prevent use-of-NULL when reading CIB_shadow from the environment
  • Skip WNOHANG when waiting after sending SIGKILL to child processes
  • tools: crm_mon - Print a timing field only if its value is non-zero
  • Use custom OCF_ROOT_DIR if requested
  • xml: Prevent lockups by setting a more reliable buffer allocation strategy
  • xml: Prevent use-after-free in cib_process_xpath()
  • xml: Prevent use-after-free when not processing all xpath query results

Details - 1.1.9

Changesets  731
Diff 1301 files changed, 92909 insertions(+), 57455 deletions(-)

Highlights

Features added in Pacemaker-1.1.9

  • corosync: Allow cman and corosync 2.0 nodes to use a name other than uname()
  • corosync: Use queues to avoid blocking when sending CPG messages
  • ipc: Compress messages that exceed the configured IPC message limit
  • ipc: Use queues to prevent slow clients from blocking the server
  • ipc: Use shared memory by default
  • lrmd: Support nagios remote monitoring
  • lrmd: Pacemaker Remote Daemon for extending pacemaker functionality outside corosync cluster.
  • pengine: Check for master/slave resources that are not OCF agents
  • pengine: Support a ‘requires’ resource meta-attribute for controlling whether it needs quorum, fencing or nothing
  • pengine: Support for resource container
  • pengine: Support resources that require unfencing before start

Changes since Pacemaker-1.1.8

  • attrd: Correctly handle deletion of non-existant attributes
  • Bug cl#5135 - Improved detection of the active cluster type
  • Bug rhbz#913093 - Use crm_node instead of uname
  • cib: Avoid use-after-free by correctly support cib_no_children for non-xpath queries
  • cib: Correctly process XML diff’s involving element removal
  • cib: Performance improvements for non-DC nodes
  • cib: Prevent error message by correctly handling peer replies
  • cib: Prevent ordering changes when applying xml diffs
  • cib: Remove text nodes from cib replace operations
  • cluster: Detect node name collisions in corosync
  • cluster: Preserve corosync membership state when matching node name/id entries
  • cman: Force fenced to terminate on shutdown
  • cman: Ignore qdisk ‘nodes’
  • core: Drop per-user core directories
  • corosync: Avoid errors when closing failed connections
  • corosync: Ensure peer state is preserved when matching names to nodeids
  • corosync: Clean up CMAP connections after querying node name
  • corosync: Correctly detect corosync 2.0 clusters even if we don’t have permission to access it
  • crmd: Bug cl#5144 - Do not updated the expected status of failed nodes
  • crmd: Correctly determin if cluster disconnection was abnormal
  • crmd: Correctly relay messages for remote clients (bnc#805626, bnc#804704)
  • crmd: Correctly stall the FSA when waiting for additional inputs
  • crmd: Detect and recover when we are evicted from CPG
  • crmd: Differentiate between a node that is up and coming up in peer_update_callback()
  • crmd: Have cib operation timeouts scale with node count
  • crmd: Improved continue/wait logic in do_dc_join_finalize()
  • crmd: Prevent election storms caused by getrusage() values being too close
  • crmd: Prevent timeouts when performing pacemaker level membership negotiation
  • crmd: Prevent use-after-free of fsa_message_queue during exit
  • crmd: Store all current actions when stalling the FSA
  • crm_mon: Do not try to render a blank cib and indicate the previous output is now stale
  • crm_mon: Fixes crm_mon crash when using snmp traps.
  • crm_mon: Look for the correct error codes when applying configuration updates
  • crm_report: Ensure policy engine logs are found
  • crm_report: Fix node list detection
  • crm_resource: Have crm_resource generate a valid transition key when sending resource commands to the crmd
  • date/time: Bug cl#5118 - Correctly convert seconds-since-epoch to the current time
  • fencing: Attempt to provide more information that just ‘generic error’ for failed actions
  • fencing: Correctly record completed but previously unknown fencing operations
  • fencing: Correctly terminate when all device options have been exhausted
  • fencing: cov#739453 - String not null terminated
  • fencing: Do not merge new fencing requests with stale ones from dead nodes
  • fencing: Do not start fencing until entire device topology is found or query results timeout.
  • fencing: Do not wait for the query timeout if all replies have arrived
  • fencing: Fix passing of parameters from CMAN containing ‘=’
  • fencing: Fix non-comparison when sorting devices by priority
  • fencing: On failure, only try a topology device once from the remote level.
  • fencing: Only try peers for non-topology based operations once
  • fencing: Retry stonith device for duration of action’s timeout period.
  • heartbeat: Remove incorrect assert during cluster connect
  • ipc: Bug cl#5110 - Prevent 100% CPU usage when looking for synchronous replies
  • ipc: Use 50k as the default compression threshold
  • legacy: Prevent assertion failure on routing ais messages (bnc#805626)
  • legacy: Re-enable logging from the pacemaker plugin
  • legacy: Relax the ‘active’ check for plugin based clusters to avoid false negatives
  • legacy: Skip peer process check if the process list is empty in crm_is_corosync_peer_active()
  • mcp: Only define HA_DEBUGLOG to avoid agent calls to ocf_log printing everything twice
  • mcp: Re-attach to existing pacemaker components when mcp fails
  • pengine: Any location constraint for the slave role applies to all roles
  • pengine: Avoid leaking memory when cleaning up failcounts and using containers
  • pengine: Bug cl#5101 - Ensure stop order is preserved for partially active groups
  • pengine: Bug cl#5140 - Allow set members to be stopped when the subseqent set has require-all=false
  • pengine: Bug cl#5143 - Prevent shuffling of anonymous master/slave instances
  • pengine: Bug rhbz#880249 - Ensure orphan masters are demoted before being stopped
  • pengine: Bug rhbz#880249 - Teach the PE how to recover masters into primitives
  • pengine: cl#5025 - Automatically clear failcount for start/monitor failures after resource parameters change
  • pengine: cl#5099 - Probe operation uses the timeout value from the minimum interval monitor by default (#bnc776386)
  • pengine: cl#5111 - When clone/master child rsc has on-fail=stop, insure all children stop on failure.
  • pengine: cl#5142 - Do not delete orphaned children of an anonymous clone
  • pengine: Correctly unpack active anonymous clones
  • pengine: Ensure previous migrations are closed out before attempting another one
  • pengine: Introducing the whitebox container resources feature
  • pengine: Prevent double-free for cloned primitive from template
  • pengine: Process rsc_ticket dependencies earlier for correctly allocating resources (bnc#802307)
  • pengine: Remove special cases for fencing resources
  • pengine: rhbz#902459 - Remove rsc node status for orphan resources
  • systemd: Gracefully handle unexpected DBus return types
  • Replace the use of the insecure mktemp(3) with mkstemp(3)

Now Powered by Octopress

With Posterous being shut down (even though I wasn’t using it for TheClusterGuy), I’ve decided the time has come to take back control of my content.

As a result I’ve started using Octopress for publishing.

It is pretty nifty. Octopress generates a static site (good for performance!) that can either be hosted at GitHub or (if GitHub ever goes dark) anywhere Apache can run. As you will notice, I was even able to easily import all my old posts!

For now I’m taking the GitHub path with a custom domain name (not the same as this one so that the old links still work).

I’m quite linking the theme/layout and feature set. I’m even taking the opportunity to support comments for the first time… we’ll see how long that lasts before the spammers make it no longer worth the time.