The Cluster Guy

Highly Available Ramblings

Potential for Data Corruption Affecting Pacemaker 1.1.6 Through 1.1.9

It has come to my attention that the potential for data corruption exists in Pacemaker versions 1.1.6 to 1.1.9

Everyone is strongly encouraged to upgrade to 1.1.10 or later.

Those using RHEL 6.4 or later (or a RHEL clone) should already have access to 1.1.10 via the normal update channels.

At issue is some faulty logic in a function called tengine_stonith_notify() which can incorrectly add successfully fenced nodes to a list, causing Pacemaker to subsequently erase that node’s status section when the next DC election occurs.

With the status section erased, the cluster thinks that node is safely down and begins starting any services it has on other nodes - despite those already being active.

In order to trigger the logic, the fenced node must:

  1. have been the previous DC
  2. been sufficently functional to request its own fencing, and
  3. the fencing notification must arrive after the new DC has been elected, but before it invokes the policy engine

That this is the first we have heard of the issue since the problem was introduced in August 2011, the above sequence of events is apparently hard to hit under normal conditions.

Logs symptomatic of the issue look as follows:

# grep -e do_state_transition -e reboot  -e do_dc_takeover -e tengine_stonith_notify -e S_IDLE /var/log/corosync.log

Mar 08 08:43:22 [9934] lorien       crmd:     info: do_dc_takeover:     Taking over DC status for this partition
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     Peer gandalf was terminated (st_notify_fence) by mordor for gandalf: OK (ref=10d27664-33ed-43e0-a5bd-7d0ef850eb05) by client crmd.31561
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     Notified CMAN that 'gandalf' is now fenced
Mar 08 08:43:22 [9934] lorien       crmd:   notice: tengine_stonith_notify:     Target may have been our leader gandalf (recorded: <unset>)
Mar 08 09:13:52 [9934] lorien       crmd:     info: do_dc_takeover:     Taking over DC status for this partition
Mar 08 09:13:52 [9934] lorien       crmd:   notice: do_dc_takeover:     Marking gandalf, target of a previous stonith action, as clean
Mar 08 08:43:22 [9934] lorien       crmd:     info: do_state_transition:    State transition S_INTEGRATION -> S_FINALIZE_JOIN [ input=I_INTEGRATED cause=C_FSA_INTERNAL origin=check_join_state ]
Mar 08 08:43:28 [9934] lorien       crmd:     info: do_state_transition:    State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE [ input=I_FINALIZED cause=C_FSA_INTERNAL origin=check_join_state ]

Note in particular the final entry from tengine_stonith_notify():

Target may have been our leader gandalf (recorded: <unset>)

If you see this after Taking over DC status for this partition but prior to State transition S_FINALIZE_JOIN -> S_POLICY_ENGINE, then you are likely to have resources running in more than one location after the next DC election.

The issue was fixed during a routine cleanup prior to Pacemaker-1.1.10 in @f30e1e43 However the implications of what the old code allowed were not fully appreciated at the time.

Announcing 1.1.11 Beta Testing

With over 400 updates since the release of 1.1.10, its time to start thinking about a new release.

Today I have tagged release candidate 1. The most notable fixes include:

  • attrd: Implementation of a truely atomic attrd for use with corosync 2.x
  • cib: Allow values to be added/updated and removed in a single update
  • cib: Support XML comments in diffs
  • Core: Allow blackbox logging to be disabled with SIGUSR2
  • crmd: Do not block on proxied calls from pacemaker_remoted
  • crmd: Enable cluster-wide throttling when the cib heavily exceeds its target load
  • crmd: Use the load on our peers to know how many jobs to send them
  • crm_mon: add –hide-headers option to hide all headers
  • crm_report: Collect logs directly from journald if available
  • Fencing: On timeout, clean up the agent’s entire process group
  • Fencing: Support agents that need the host to be unfenced at startup
  • ipc: Raise the default buffer size to 128k
  • PE: Add a special attribute for distinguishing between real nodes and containers in constraint rules
  • PE: Allow location constraints to take a regex pattern to match against resource IDs
  • pengine: Distinguish between the agent being missing and something the agent needs being missing
  • remote: Properly version the remote connection protocol
  • services: Detect missing agents and permission errors before forking
  • Bug cl#5171 - pengine: Don’t prevent clones from running due to dependant resources
  • Bug cl#5179 - Corosync: Attempt to retrieve a peer’s node name if it is not already known
  • Bug cl#5181 - corosync: Ensure node IDs are written to the CIB as unsigned integers

If you are a user of pacemaker_remoted, you should take the time to read about changes to the online wire protocol that are present in this release.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. If you haven’t already, install Pacemaker’s dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy the rpms and deploy as needed

Pacemaker and RHEL 6.4 (Redux)

The good news is that as of Novemeber 1st, Pacemaker is now supported on RHEL 6.4 - with two caveats.

  1. You must be using the updated pacemaker, resource-agents and pcs packages
  2. You must be using CMAN for membership and quorum (background)

Technically, support is currently limited to Pacemaker’s use in the context of OpenStack. In practice however, any bug that can be shown to affect OpenStack deployments has a good chance of being fixed.

Since a cluster with no services is rather pointless, the heartbeat OCF agents are now also officially supported. However, as Red Hat’s policy is to only ship supported agents, some agents are not present for this initial release.

The three primary reasons for not shipping agents were:

  1. The software the agent controls is not shipped in RHEL
  2. Insufficient experience to provide support
  3. Avoiding agent duplication

Filing bugs is definitly the best way to get agents in the second categories prioritized for inclusion.

Likewise, if there is no shipping agent that provides the functionality of agents in the third category (IPv6addr and IPaddr2 might be an example here), filing bugs is the best way to get that fixed.

In the meantime, since most of the agents are just shell scripts, downloading the latest upstream agents is a viable work-around in most cases. For example:

    agents="Raid1 Xen"
    for a in $agents; do wget -O /usr/lib/ocf/resource.d/heartbeat/$a https://github.com/ClusterLabs/resource-agents/raw/master/heartbeat/$a; done

Changes to the Remote Wire Protocol in 1.1.11

Unfortunately the current wire protocol used by pacemaker_remoted for exchanging messages was found to be suboptimal and we have taken the decision to change it now before it becomes widely adopted.

We attempted to do this in a backwards compatibile manner, however the two methods we tried were either overly complicated and fragile, or not possible due to the way the released crm_remote_parse_buffer() function operated.

The changes include a versioned binary header that contains the size of the header, payload and total message, control flags and a big/little-endian detector.

These changes will appear in the upstream repo shortly and ship in 1.1.11. Anyone for this will be a problem is encouraged to get in contact to discuss possible options.

For RHEL users, any version on which pacemaker_remoted is supported will have the new versioned protocol. That means 7.0 and potentially a future 6.x release.

Pacemaker 1.1.10 - Final

Announcing the release of Pacemaker 1.1.10

There were three changes of note since rc7:

  • Bug cl#5161 - crmd: Prevent memory leak in operation cache
  • cib: Correctly read back archived configurations if the primary is corrupted
  • cman: Do not pretend we know the state of nodes we’ve never seen

Along with assorted bug fixes, the major topics for this release were:

  • stonithd fixes
  • fixing memory leaks, often caused by incorrect use of glib reference counting
  • supportability improvements (code cleanup and deduplication, standardized error codes)

Release candidates for the next Pacemaker release (1.1.11) can be expected some time around Novemeber.

A big thankyou to everyone that spent time testing the release candidates and/or contributed patches. However now that Pacemaker is perfect, anyone reporting bugs will be shot :-)

To build rpm packages:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make release
    
  4. Copy and deploy as needed

Details - 1.1.10 - final

Changesets  602
Diff 143 files changed, 8162 insertions(+), 5159 deletions(-)

Highlights

Features added since Pacemaker-1.1.9

  • Core: Convert all exit codes to positive errno values
  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Allow options to be set recursively
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check|start|stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • PE: Suppress meaningless IDs when displaying anonymous clone status
  • Turn off auto-respawning of systemd services when the cluster starts them
  • Bug cl#5128 - pengine: Support maintenance mode for a single node

Changes since Pacemaker-1.1.9

  • crmd: cib: stonithd: Memory leaks resolved and improved use of glib reference counting
  • attrd: Fixes deleted attributes during dc election
  • Bug cf#5153 - Correctly display clone failcounts in crm_mon
  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5148 - legacy: Correctly remove a node that used to have a different nodeid
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Bug cl#5152 - crmd: Correctly clean up fenced nodes during membership changes
  • Bug cl#5154 - Do not expire failures when on-fail=block is present
  • Bug cl#5155 - pengine: Block the stop of resources if any depending resource is unmanaged
  • Bug cl#5157 - Allow migration in the absence of some colocation constraints
  • Bug cl#5161 - crmd: Prevent memory leak in operation cache
  • Bug cl#5164 - crmd: Fixes crash when using pacemaker-remote
  • Bug cl#5164 - pengine: Fixes segfault when calculating transition with remote-nodes.
  • Bug cl#5167 - crm_mon: Only print “stopped” node list for incomplete clone sets
  • Bug cl#5168 - Prevent clones from being bounced around the cluster due to location constraints
  • Bug cl#5170 - Correctly support on-fail=block for clones
  • cib: Correctly read back archived configurations if the primary is corrupted
  • cib: The result is not valid when diffs fail to apply cleanly for CLI tools
  • cib: Restore the ability to embed comments in the configuration
  • cluster: Detect and warn about node names with capitals
  • cman: Do not pretend we know the state of nodes we’ve never seen
  • cman: Do not unconditionally start cman if it is already running
  • cman: Support non-blocking CPG calls
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Ensure removed peers are erased from all caches
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Do not update fail-count and last-failure for old failures
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Ensure operations for cleaned up resources don’t block recovery
  • crmd: Ensure we return to a stable state if there have been too many fencing failures
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Prevent messages for remote crmd clients from being relayed to wrong daemons
  • crmd: Properly handle recurring monitor operations for remote-node agent
  • crmd: Store last-run and last-rc-change for all operations
  • crm_mon: Ensure stale pid files are updated when a new process is started
  • crm_report: Correctly collect logs when ‘uname -n’ reports fully qualified names
  • fencing: Fail the operation once all peers have been exhausted
  • fencing: Restore the ability to manually confirm that fencing completed
  • ipc: Allow unpriviliged clients to clean up after server failures
  • ipc: Restore the ability for members of the haclient group to connect to the cluster
  • legacy: Support “crm_node –remove” with a node name for corosync plugin (bnc#805278)
  • lrmd: Default to the upstream location for resource agent scratch directory
  • lrmd: Pass errors from lsb metadata generation back to the caller
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Delete the old resource state on every node whenever the resource type is changed
  • pengine: Detect constraints with inappropriate actions (ie. promote for a clone)
  • pengine: Ensure per-node resource parameters are used during probes
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • pengine: Implement the rest of get_timet_now() and rename to get_effective_time
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd
  • systemd: Reload systemd after adding/removing override files for cluster services
  • xml: Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • xml: Prevent lockups by setting a more reliable buffer allocation strategy

Release Candidate: 1.1.10-rc7

Announcing the seventh release candidate for Pacemaker 1.1.10

This RC is a result of bugfixes to the policy engine, fencing daemon and crmd. We’ve squashed a bug involving constructing compressed messages and stonith-ng can now recover when a configuration ordering change is detected.

Please keep the bug reports coming in!

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc7

Changesets  57
Diff 37 files changed, 414 insertions(+), 331 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc7

  • N/A

Changes since Pacemaker-1.1.10-rc6

  • Bug cl#5168 - Prevent clones from being bounced around the cluster due to location constraints
  • Bug cl#5170 - Correctly support on-fail=block for clones
  • Bug cl#5164 - crmd: Fixes crmd crash when using pacemaker-remote
  • cib: The result is not valid when diffs fail to apply cleanly for CLI tools
  • cluster: Correctly construct the header for compressed messages
  • cluster: Detect and warn about node names with capitals
  • Core: remove the mainloop_trigger that are no longer needed.
  • corosync: Ensure removed peers are erased from all caches
  • cpg: Correctly free sent messages
  • crmd: Prevent messages for remote crmd clients from being relayed to wrong daemons
  • crmd: Properly handle recurring monitor operations for remote-node agent
  • crm_mon: Bug cl#5167 - Only print “stopped” node list for incomplete clone sets
  • crm_node: Return 0 if –remove passed
  • fencing: Correctly detect existing device entries when registering a new one
  • lrmd: Prevent use-of-NULL in client library
  • pengine: cl5164 - Fixes pengine segfault when calculating transition with remote-nodes.
  • pengine: Do the right thing when admins specify the internal resource instead of the clone
  • pengine: Re-allow ordering constraints with fencing devices now that it is safe to do so

Release Candidate: 1.1.10-rc6

Announcing the sixth release candidate for Pacemaker 1.1.10

This RC is a result of bugfixes in the policy engine, fencing daemon and crmd. Previous fixes in rc5 have also now been confirmed.

Help is specifically requested for testing plugin-based clusters, ACLs, the –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

There is one bug open for David’s remote nodes feature (involving managing services on non-cluster nodes), but everything else seems good.

Please keep the bug reports coming in!

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies (if you haven’t already)

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc6

Changesets  63
Diff 24 files changed, 356 insertions(+), 133 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc6

  • tools: crm_mon –neg-location drbd-fence-by-handler
  • pengine: cl#5128 - Support maintenance mode for a single node

Changes since Pacemaker-1.1.10-rc5

  • cluster: Correctly remove duplicate peer entries
  • crmd: Ensure operations for cleaned up resources don’t block recovery
  • pengine: Bug cl#5157 - Allow migration in the absence of some colocation constraints
  • pengine: Delete the old resource state on every node whenever the resource type is changed
  • pengine: Detect constraints with inappropriate actions (ie. promote for a clone)
  • pengine: Do the right thing when admins specify the internal resource instead of the clone

GPG Quickstart

It seemed timely that I should refresh both my GPG knowledge and my keys. I am summarizing my method (and sources) below in the event that they may prove useful to others:

Preparation

The following settings ensure that any keys you create in the future are strong ones by 2013’s standards. Paste the following into ~/.gnupg/gpg.conf:

# when multiple digests are supported by all recipients, choose the strongest one:
personal-digest-preferences SHA512 SHA384 SHA256 SHA224
# preferences chosen for new keys should prioritize stronger algorithms: 
default-preference-list SHA512 SHA384 SHA256 SHA224 AES256 AES192 AES CAST5 BZIP2 ZLIB ZIP Uncompressed
# when making an OpenPGP certification, use a stronger digest than the default SHA1:
cert-digest-algo SHA512

The next batch of settings are optional but aim to improve the output of gpg commands in various ways - particularly against spoofing. Again, paste them into ~/.gnupg/gpg.conf:

# when outputting certificates, view user IDs distinctly from keys:
fixed-list-mode
# long keyids are more collision-resistant than short keyids (it's trivial to make a key with any desired short keyid)
keyid-format 0xlong
# If you use a graphical environment (and even if you don't) you should be using an agent:
# (similar arguments as  https://www.debian-administration.org/users/dkg/weblog/64)
use-agent
# You should always know at a glance which User IDs gpg thinks are legitimately bound to the keys in your keyring:
verify-options show-uid-validity
list-options show-uid-validity
# include an unambiguous indicator of which key made a signature:
# (see http://thread.gmane.org/gmane.mail.notmuch.general/3721/focus=7234)
sig-notation issuer-fpr@notations.openpgp.fifthhorseman.net=%g

Create a New Key

There are several checks for deciding if your old key(s) are any good. However, if you created a key more than a couple of years ago, then realistically you probably need a new one.

I followed instructions from Ana Guerrero’s post, which were the basis of the current debian guide, but selected the 2013 default key type:

  1. run gpg --gen-key
  2. Select (1) RSA and RSA (default)
  3. Select a keysize greater than 2048
  4. Set a key expiration of 2-5 years. [rationale]
  5. Do NOT specify a comment for User ID. [rationale]

Add Additional UIDs and Setting a Default

At this point my keyring gpg --list-keys looked like this:

pub   4096R/0x726724204C644D83 2013-06-24
uid                 [ultimate] Andrew Beekhof <andrew@beekhof.net>
sub   4096R/0xC88100891A418A6B 2013-06-24 [expires: 2015-06-24]

Like most people, I have more than one email address and I will want to use GPG with them too. So now is the time to add them to the key. You’ll want the gpg --edit-key command for this. Ana has a good exmaple of adding UIDs and setting a preferred one. Just search her instructions for Add other UID.

Separate Subkeys for Encryption and Signing

The general consensus is that separate keys should be used for signing versus encryption.

tl;dr - you want to be able to encrypt things without signing them as “signing” may have unintended legal implications. There is also the possibility that signed messages can be used in an attack against encrypted data.

By default gpg will create a subkey for encryption, but I followed Debian’s subkey guide for creating one for signing too (instead of using the private master key).

Doing this allows you to make your private master key even safer by removing it from your day-to-day keychain.

The idea is to make a copy first and keep it in an even more secure location, so that if a subkey (or the machine its on) gets compromised, your master key remains safe and you are always in a position to revoke subkeys and create new ones.

Sign the New Key with the Old One

If you have an old key, you should sign the new one with it. This tells everyone who trusted the old key that the new one is legitimate and can therefor also be trusted.

Here I went back to Ana’s instructions. Basically:

gpg --default-key OLDKEY --sign-key NEWKEY

or, in my case:

gpg --default-key 0xEC3584EFD449E59A --sign-key 0x726724204C644D83

Send it to a Key Server

Tell the world so they can verfiy your signature and send you encrypted messages:

gpg --send-key 0x726724204C644D83

Revoking Old UIDs

If you’re like me, your old key might have some addresses which you have left behind. You can’t remove addresses from your keys, but you can tell the world to stop using them.

To do this for my old key, I followed instructions on the gnupg mailing list

Everything still looks the same when you search for my old key:

pub  1024D/D449E59A 2007-07-20 Andrew Beekhof <beekhof@mac.com>
                               Andrew Beekhof <abeekhof@suse.de>
                               Andrew Beekhof <beekhof@gmail.com>
                               Andrew Beekhof <andrew@beekhof.net>
                               Andrew Beekhof <abeekhof@novell.com>
     Fingerprint=E5F5 BEFC 781F 3637 774F  C1F8 EC35 84EF D449 E59A 

But if you click through to the key details, you’ll see the addresses associated with my time at Novell/SuSE now show revok in red.

pub  1024D/D449E59A 2007-07-20            
     Fingerprint=E5F5 BEFC 781F 3637 774F  C1F8 EC35 84EF D449 E59A 

uid Andrew Beekhof <beekhof@mac.com>
sig  sig3  D449E59A 2007-07-20 __________ __________ [selfsig]

uid Andrew Beekhof <abeekhof@suse.de>
sig  sig3  D449E59A 2007-07-20 __________ __________ [selfsig]
sig revok  D449E59A 2013-06-24 __________ __________ [selfsig]
...

This is how other people’s copy of gpg knows not to use this key for that address anymore. And also why its important to refresh your keys periodically.

Revoking Old Keys

Realistically though, you probably don’t want people using old and potentially compromised (or compromise-able) keys to send you sensitive messages. The best thing to do is revoke the entire key.

Since keys can’t be removed once you’ve uploaded them, you’re actually updating the existing entry. To do this you need the original private key - so keep it safe!

Some people advise you to pre-generate the revocation key - personally that seems like just one more thing to keep track of.

Orphaned keys that can’t be revoked still appear valid to anyone wanting to send you a secure message - a good reason to set an expiry date as a failsafe!

This is what one of my old revoked key looks like:

pub  1024D/DABA170E 2004-10-11 *** KEY REVOKED *** [not verified]
                               Andrew Beekhof (SuSE VPN Access) <andrew@beekhof.net>
     Fingerprint=9A53 9DBB CF73 AB8F B57B  730A 3279 4AE9 DABA 170E 

Final Result

My new key:

pub  4096R/4C644D83 2013-06-24 Andrew Beekhof <andrew@beekhof.net>
                               Andrew Beekhof <beekhof@mac.com>
                               Andrew Beekhof <abeekhof@redhat.com>
     Fingerprint=C503 7BA2 D013 6342 44C0  122C 7267 2420 4C64 4D83 

Closing word

I am by no means an expert at this, I would be very grateful to hear about any mistakes I may have made above.

Release Candidate: 1.1.10-rc5

Lets try this again… Announcing the fourth and a half release candidate for Pacemaker 1.1.10

I previously tagged rc4 but ended up making several changes shortly afterwards, so it was pointless to announce it.

This RC is a result of cleanup work in several ancient areas of the codebase:

  • A number of internal membership caches have been combined
  • The three separate CPG code paths have been combined

As well as:

  • Moving clones is now both possible and sane
  • Improved behavior on systemd based nodes
  • and other assorted bugfixes (see below)

Please keep the bug reports coming in!

Help is specifically requested for testing plugin-based clusters, ACLs, the new –ban and –clear commands, and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

Also any light that can be shed on possible memory leaks would be much appreciated.

If everything looks good in a week from now, I will re-tag rc5 as final.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc5

Changesets  168
Diff 96 files changed, 4983 insertions(+), 3097 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc5

  • crm_error: Add the ability to list and print error symbols
  • crm_resource: Allow individual resources to be reprobed
  • crm_resource: Implement –ban for moving resources away from nodes and –clear (replaces –unmove)
  • crm_resource: Support OCF tracing when using –force-(check|start|stop)
  • PE: Allow active nodes in our current membership to be fenced without quorum
  • Turn off auto-respawning of systemd services when the cluster starts them

Changes since Pacemaker-1.1.10-rc3

  • Bug pengine: cl#5155 - Block the stop of resources if any depending resource is unmanaged
  • Convert all exit codes to positive errno values
  • Core: Ensure the blackbox is saved on abnormal program termination
  • corosync: Detect the loss of members for which we only know the nodeid
  • corosync: Do not pretend we know the state of nodes we’ve never seen
  • corosync: Nodes that can persist in sending CPG messages must be alive afterall
  • crmd: Do not get stuck in S_POLICY_ENGINE if a node we couldn’t fence returns
  • crmd: Ensure all membership operations can complete while trying to cancel a transition
  • crmd: Everyone who gets a fencing notification should mark the node as down
  • crmd: Initiate node shutdown if another node claims to have successfully fenced us
  • crmd: Update the status section with details of nodes for which we only know the nodeid
  • crm_report: Find logs in compressed files
  • logging: If SIGTRAP is sent before tracing is turned on, turn it on
  • pengine: If fencing is unavailable or disabled, block further recovery for resources that fail to stop
  • remote: Workaround for inconsistent tls handshake behavior between gnutls versions
  • systemd: Ensure we get shut down correctly by systemd

Release Candidate: 1.1.10-rc3

Announcing the third release candidate for Pacemaker 1.1.10

This RC is a result of work in several problem areas reported by users, some of which date back to 1.1.8:

  • manual fencing confirmations
  • potential problems reported by Coverity
  • the way anonymous clones are displayed
  • handling of resource output that includes non-printing characters
  • handling of on-fail=block

Please keep the bug reports coming in. There is a good chances that this will be the final release candidate and 1.1.10 will be tagged on May 30th.

Help is specifically requested for testing plugin-based clusters, ACLs and admin actions (such as moving and stopping resources, calls to stonith_admin) which are hard to test in an automated manner.

To build rpm packages for testing:

  1. Clone the current sources:

    # git clone --depth 0 git://github.com/ClusterLabs/pacemaker.git
    # cd pacemaker
    
  2. Install dependancies

    [Fedora] # sudo yum install -y yum-utils
    [ALL]    # make rpm-dep
    
  3. Build Pacemaker

    # make rc
    
  4. Copy and deploy as needed

Details - 1.1.10-rc3

Changesets  116
Diff 59 files changed, 707 insertions(+), 408 deletions(-)

Highlights

Features added in Pacemaker-1.1.10-rc3

  • PE: Display a list of nodes on which stopped anonymous clones are not active instead of meaningless clone IDs
  • PE: Suppress meaningless IDs when displaying anonymous clone status

Changes since Pacemaker-1.1.10-rc2

  • Bug cl#5133 - pengine: Correctly observe on-fail=block for failed demote operation
  • Bug cl#5151 - Ensure node names are consistently compared without case
  • Check for and replace non-printing characters with their octal equivalent while exporting xml text
  • cib: CID#1023858 - Explicit null dereferenced
  • cib: CID#1023862 - Improper use of negative value
  • cib: CID#739562 - Improper use of negative value
  • cman: Our daemons have no need to connect to pacemakerd in a cman based cluster
  • crmd: Do not record pending delete operations in the CIB
  • crmd: Ensure pending and lost actions have values for last-run and last-rc-change
  • crmd: Insert async failures so that they appear in the correct order
  • crmd: Store last-run and last-rc-change for fail operations
  • Detect child processes that terminate before our SIGCHLD handler is installed
  • fencing: CID#739461 - Double close
  • fencing: Correctly broadcast manual fencing ACKs
  • fencing: Correctly mark manual confirmations as complete
  • fencing: Do not send duplicate replies for manual confirmation operations
  • fencing: Restore the ability to manually confirm that fencing completed
  • lrmd: CID#1023851 - Truncated stdio return value
  • lrmd: Don’t complain when heartbeat invokes us with -r
  • pengine: Correctly handle resources that recover before we operate on them
  • pengine: Re-initiate active recurring monitors that previously failed but have timed out
  • xml: Restore the ability to embed comments in the cib