From Agentgroup
Jump to: navigation, search

The fault-tolerance mechanism: an high level overview

One of the key advantages of the PIM approach lies in its robustness to component failures. The main idea behind the realization of the fault-tolerance mechanism is that to attain robustness to component failures, the CP should refer to the team members in terms of component capabilities, rather than relying only on static identifiers (e.g. the network address). This way, if a component is disabled, then another component with similar capabilities can take its place.

Typologies of failures

Two kinds of failures can occur in the PIM model:

  • failure of a team member that was not executing the CP: when the CP tries to migrate to the missing node, the PIMRuntime will detect the problem and will try to rebind the corresponding logical node to another robot of the same type. As for now, this rebind phase is blocking, so the execution of the CP is stopped until a suitable robot is available to join the team. However, less restrictive policies could be desirable in some cases. To see an example of how these policies could be implemented, see Node skipping behavior.
  • failure of the team member that was executing the CP: a special recovery procedure is triggered in order to identify the node with the most recent version of the state of the CP. This is possible because each node always keeps a copy of the latest CP version that was executed on that node. After this procedure, the system must resume this copy, substitute the missing node as in the former case and restart the normal execution of the CP.

The fault-tolerance mechanism: how we realized it

As you can see in General model discussions, to obtain such behavior we included two new modules in the resident part of the PIM architecture:

  • the NodeMonitor;
  • the GroupManager.

The NodeMonitor

The NodeMonitor manages the dynamic binding of logical nodes to physical robots in the environment, handling possible failures and triggering a rapid recovery procedure. The chosen strategy is completely decentralized and peer-to-peer, since an instance of the NodeMonitor component runs on each node belonging to the PIM and it handles its own view of the other nodes in the network.

The GroupManager

The GroupManager <bibref>Suri2006</bibref> has been taken from the Agile Computing kernel. It is a Java-based distributed component that facilitates peer-to-peer node and resource discovery. It supports peer group management, node discovery, and node loss detection. In the case of the PIM, each robot automatically receives a specific UUID (Universally Unique IDentifier) at startup, which will identify the node throughout the life of the PIM. Besides, nodes in the GroupManager can organize themselves into groups. By partitioning the nodes of the network into different groups and allowing resource sharing only between nodes in the same group, the Group Manager provides significant flexibility in resource and service management. Groups are used in the PIM as “containers of similar nodes”. Robot nodes of the same type (e.g. Herder) will join the same peer group, in order to be recognized as such by other nodes.

The interaction between the NodeMonitor and the GroupManager

The mapping of logical nodes to node UUIDs is handled by the NodeMonitor, according to the following criteria:

  • when a new node comes up, other nodes will detect its presence with their local GroupManager instance. When the CP will try to migrate to a missing/unbound node, the NodeMonitor will try to rebind this node with an available robot of the same type. If no binding occurs, the execution of the CP is blocked until a suitable node comes up;
  • if a bound node dies or becomes unreachable, every GroupManager will generate an appropriate event for the local NodeMonitor, which will try to understand if the CP is still alive. For doing so, every node will publish, using its local GroupManager, the version number of the local CP backup copy. In this way, after a short time, each node is able to know if a CP still exists and, if not, the location of its latest backup copy;
  • the death of an unbound node is simply ignored by the PIM.

List of references