From Agentgroup
Jump to: navigation, search

New Proposed PIM model

Pages here will describe some ideas on how to improve the PIM model from both the programming model and architectural (e.g. fault-tolerance) standpoint.

Overview of the PIM model

The PIM model consists of a single Coordinating Process (CP) and a set of Components, each capable of running the CP. The CP cycles amongst the components at a speed sufficient to meet the overall coordination needs of the PIM. Each component maintains the code for the CP, so the controlling process can move from component to component by passing only a small runtime state using strong mobility.

Overall architecture for the PIM

The new PIM model is depicted in the following picture.

PIMarchitecture.png

CP

The model outlined here does not constrain the programmer of the CP. For instance, the current implementation of the CTF follows a very sequential programming style that we want to keep in the new model. Nevertheless, here we give some rules of thumb to better leverage the features of the new model and achieve an increased reactivity of the algorithm.

PIM object

The PIM is the unique object representing the whole system to the CP. It provides the CP with the following features:

  • Unified view of nodes, their number and types.
  • Access to the state of each node.
  • High-level control on the migration process (enable/disable migration).

According to the previous PIM model, the PIM object provides the CP with access to the state of each node (e.g. PIM.getPIMState()) and is also used to issue commands to the various nodes (e.g. rPIM.setXYZ() in the RobotPIM). This model can turn out to be confusing if different kinds of nodes are involved in the coordination problem. One new aspect of our model is supporting different kinds of components/nodes in the same PIM distributed system. The reason for having such diversity is that different components may provide different functionalities / capabilities or play different roles in the same coordination problem.

Node types

The current PIM model can be improved by clearly introducing the concept of Node in the object-oriented programming model of the PIM. A Node object represents a logical component taking part in the coordination algorithm and it maps dynamically to some physical node in the real system. Different kinds of nodes may be defined (and used) in a specific problem and all of them will have to inherit from the abstract Node class. The node type is defined at bootstrap time for each node and cannot change dynamically at runtime. It furthermore refers to the physical interface/capabilities of the node (e.g. p3dx robot) rather than to the role played in the CP algorithm (e.g. Defender vs. Attacker in CTF).

Example: the sheep-herding problem. Suppose we need to coordinate some robots to herding a set of independently acting sheep. 
Some of our robots will be used for herding and others for sensing in order to   distinguish sheep from the rocks and cats.  There are 
three types of robots: seers, trackers and herders. The seers are unmanned ground vehicles with shape sensors with range of 30 meters. 
Based on shape seers  cannot distinguish between animals and rocks, but can tell large (sheep or rock) from small (cat or rock). The trackers 
are aerial vehicles and have infrared sensors, so they can distinguish animals from rocks, but cannot distinguish a cat from a sheep. Finally, 
the herders have both sensors but only limited range. Given these robots to control, the task is to herd a set of sheep into the center of the 
field and keep them there.

The above example is an excerpt from our AAMAS article and it shows clearly why do we need to introduce the concept of node type in the programming model. Three node types can be identified in the previous problem: herder, seer, and tracker. In the figure, these three types are represented in a hierarchical fashion.

NodeHierarchy.png

A possible mapping between physical nodes and logical nodes for the herding problem is reported in figure.

LogicalVsPhysicalNodes.png

The expected benefits of our new PIM model are listed here:

  • It better highlights the concept of “node capability” than the mere “node index”. In other words, writing rPIM.setXYZ(2,x,y,z) is less clear than writing herderNode.setXYZ(x,y,z), where herderNode is Node object number 2 in the PIM. Grouping node functionalities into a specific class helps achieving a model closer to a component-oriented model than the previous one.
  • The Node object will be used to implement the PIM caching mechanism of the node state. For instance, operations invoked on a Node object will be performed only if the CP is running on that node, otherwise the invocation is cached and executed later (when the destination node is reached).
  • As already said and exemplified in the figure above, the logical Node object can be bound to a physical node entity joining the PIM (e.g. a robot). In order to enable Coping with Fault-tolerance in the PIM, the physical node may be replaced dynamically, e.g. when a certain node fails. The new node will have to be a node of the same type as the failed one.
The Node class

The abstract Node class is the basic super-class of all nodes. Each new node type has to inherit from Node and implement its declared abstract methods. A Node object will likely include a nodeUUID field assigned to it by the GroupManager. If no nodeUUID is set, then the logical node is bound to no physical node in the system. The binding between an unbound logical node and a new physical node may occur only at the node hosting the CP: when the new node is detected by the system and its type matches the one of a logical node (e.g. node 1 in the figure above is expected to be a Tracker node and a new node of type Tracker joins the PIM). Each node knows and manifests its node type after reading a node.cfg file deployed on the physical machine.

A possible example of node.cfg file for the Tracker node is the following:

nodeType=”us/ihmc/pim/herdingCP/Tracker”
isBootstrapNode=true
useNorthStar=true
useGarmin=true
...

According to the above node.cfg file, the node above will broadcast its presence as of the us/ihmc/pim/herdingCP/Tracker type.

Initial PIM configuration: pim.cfg vs. Node instantiation

The expected logical configuration of the coordination problem can be specified in two ways:

  1. using the pim.cfg file (current approach)
  2. programmatically, i.e. in the code of the CP (proposed approach).

This pim.cfg file describes the number of nodes, the ordering of nodes in the ring (indexes from 0 to N-1) and the type of each node (e.g. p3dx, TRIPS). In the current PIM implementation, the IP address and TCP port of the nodes is specified as well. We believe that this model should be disbanded in favor of a more flexible model for the following reasons:

  • Assigning an IP address to each node violates the basic programming principle that we claim for the PIM model: we don’t want the PIM to rely on a particular node, but simply on the capabilities of the nodes.
  • From the previous point it descends that fault-tolerant behavior is very hard to achieve, because the binding between the physical node and the logical one is static and predefined (IP address). If the physical node fails, then only a node with the same IP and index can take its place.
  • Specifying the configuration of the problem (nodes, types and ordering) in both the pim.cfg file and the CP (programmatically) is redundant, error-prone and ineffective. We reckon that the initial aim of the pim.cfg file was “having a fast and easy way to change the parameters of the coordination problem”. While this approach works well for trivial and node-unaware CPs (e.g. the MatrixMultiplication), it fails to deliver its promised benefits in more complex algorithms. Just consider, for instance, the non-trivial and node-aware CTF algorithm: the number, type and ordering of nodes is replicated in the CP code and changing the pim.cfg file impairs the behavior of the CP, which was coded with a certain node configuration. In other words, that CP will not work properly with the new nodes configuration and the usefulness of the pim.cfg file is vanished.

The programmatic method seems both simpler and more effective. No file is used to outline the configuration of nodes in the problem. The latter configuration is instead dynamically constructed at runtime by the CP itself using the Node concept. It basically means that, at bootstrap time (e.g. in the constructor of the CP), the CP instantiates as many Node objects as it needs (e.g. a Tracker node, a Seer node and a Herder node in the figure above). Then, it fills up the PIM object with those nodes, thus precisely describing the configuration of the problem at a mere logical level. We admit that this causes a loss of configurability compared to the use of a pim.cfg file, because it requires recompiling the CP each time the experiment is modified. Nonetheless, such promised configurability (with regard to the number and ordering of nodes) is illusory or overly hard to achieve in real coordination algorithms.

The PIM initialization and Node initialization processes

As said earlier, we want to isolate the definition of the coordination problem only within the CP (i.e. programmatically). Yet, the pim.cfg file comes in handy to specify initialization values not related to the CP algorithm. It will contain only configurable parameters that are read at bootstrap time by each PIM runtime. An example is the following pim.cfg:

timeslice=100
className=us/ihmc/pim/ctf/CaptureTheFlag
RuntimeListenPort=4000
GroupManagerPort=8502
...

These parameters are loaded by each runtime, regardless of the specific type of a node. Every initialization parameter specific to the node type is instead contained in the node.cfg file, as shown earlier. This file may be different on each node and it is interpreted by the Node subclass matching that type. In other words, the PIM runtime will analyze the nodeType field and then load the right class in the JVM. On that class, it will call a special public static void init(ConfigLoader cfg, String filePath) method, which will retrieve type-specific properties from the node.cfg file. For instance, a Tracker node will try to load its specific initialFlightLevel parameter from its node.cfg file. It is important that the init() method be declared as public, in order to be accessed from the PIM runtime package.

In the following code excerpt, we show an example of the TRIPS node class:

<java>

package us.ihmc.newPim.nodes;

/**
 * A node with TRIPS capabilities
 * 
 * @author  Raffaele Quitadamo
 *
 */ 
public class TRIPSNode extends Node {

	/** 
	 * The hostname (or IP address) of the TRIPS host 
	 */
	static InetAddress TRIPSHost;

	/**
	 * The TCP port of the TRIPS host
	 */
	static int TRIPSPort;
	
	/**
	 * Initialization method that carries out node specific initialization, reading data
	 * from the passed ConfigLoader object.
	 * 
	 * @param nodeConfigFile The ConfigLoader to read properties from the node specific configuration file
	 * @param configFile The full pathname of the node config file
	 * @throws UnknownHostException 
	 *  
	 */
	public static void init(ConfigLoader nodeConfigLoader, String configFile) throws UnknownHostException
	{
		TRIPSHost = InetAddress.getByName(nodeConfigLoader.getProperty("TRIPSHost"));
		TRIPSPort = nodeConfigLoader.getPropertyAsInt("TRIPSPort", 0);
		
		System.loadLibrary("TRIPSLib");

		if((TRIPSHost != null) && (TRIPSPort > 0))
		{
			TRIPS.initializeTRIPSLib(TRIPSHost.getHostName(), TRIPSPort);
		}
	}
}  

</java> This strategy allows a better and clearer separation of concerns in the new PIM model: on the one hand, the pim.cfg is simplified and made very general; on the other hand, we can add new node types with their related initialization parameters without the need to change the PIM runtime, because the logic to interpret such new parameters is delivered with the Node subclass.

The PIM class

As described earlier, each node broadcasts its type and UUID at bootstrap time using the GroupManager. The list of known nodes is maintained by the NodeMonitor component of the PIM runtime (see figure of the PIM architecture above). This component notifies the PIM object when

  • a new node comes up
  • a previous node is no more reachable (e.g. it is dead).

In the case of new physical node coming up, the PIM tries to map that node on an unbound logical node (as declared by the programmer of the CP). The mapping is based on the node type, but in the future we can implement more sophisticated mapping policies (e.g. choose the least overloaded or the one with the highest bandwidth). If the physical node does not match the expected types or is not needed at the moment, its presence is simply ignored.

StateChartNode.png
Interaction between the PIM object and the CP

From the programmer’s point of view, we can consider two possible approaches:

  1. The CP algorithm has been designed to work with a predefined set of nodes (number and types). This approach will be called the static configuration approach.
  2. The CP can work on a flexible set of nodes (flexible means that the number of nodes can grow or shrink). We will call this approach the dynamic configuration approach. However, some huge programming-level questions arise in this approach: (i) Do we really need to handle this case? (ii) Is it possible or meaningful in real coordination problems?


The following methods will be included in the PIM class independently of which approach (dynamic or static) is used:

  • Node getNodeByIndex(int index) returns the Node object identified by the passed index (ranging from 0 to N-1). If no such a node exists, the method will throw a NoSuchPimNodeException.
  • void insertNewNode(int index, Node newNode) is used by the CP to set the configuration of the current coordination problem. The index is the logical index in the problem (e.g. insertNewNode(1, new Tracker()) from the figure above).
  • void removeNode(Node oldNode) and removeNodeByIndex(int index) remove the specified node from the current configuration. These methods may be used if we want the CP to change dynamically at runtime the configuration of the coordination problem, e.g. skipping some nodes during the experiment.
  • int getCurrentNodesCount(), returns the total number of nodes currently in the PIM (including the current node).
  • int isRingClosed(), tells if we have holes (null slots) in the ring
  • void enableMigration() and void disableMigration(), used when the CP needs to stay on the same node to perform an atomic operation.
  • others to be added...
We think that Node getCurrentNode() doesn't make sense anymore because the caching mechanism is supposed to hide all the migration details to the CP. This way, the CP behaves like it is on every node at
the same time.

Please, note that in the following paragraphs we will focus only on the static configuration approach.

The “unbound” state of a Node

Consider that a Node object may be “invalidated” during the life of the CP (e.g. the node disappears), transiting into the unbound state (see figure above). In order to make this transparent to the programmer, the PIM tries to replace the lost node with another one of the same type. If no such node exists and the CP wants to migrate on that node, two behaviors of the PIM are advisable:

  1. Skipping that node and migrate on the next one in the ring (skipping behavior),
  2. The CP will have to block until a new node has replaced the previous one (blocking behavior).

The first behavior is desirable when the CP may want to skip the unbound node and continue execution with the others. Right now, the blocking approach has been implemented in the CVS code.

Handling node failure/substitution in the static configuration approach

As for node substitution, two possibilities can be explored:

  1. The new candidate node does not receive the state of its predecessor and the entire algorithm must be re-executed taking into account the state of the new node.
  2. The new candidate node is forced to assume the same state of its predecessor, so that the algorithm can go on smoothly. This case raises some non-negligible issues on how to initialize the physical node (e.g. robot) with the old state. For instance, can we really force the new robot to reach the same spatial position and velocity of the old robot?

The PIM runtime

Right above the JVM, we have the PIM runtime, which provides the architectural mechanisms for the PIM to work.

Node connection approaches

In the current PIM model, nodes are connected in a ring topology before the PIM starts executing. Socket connections among consecutive nodes are kept open throughout the life of the PIM.

In our model, connections are established right before the current node sends the CP to the next node. According to the latest measurements, the cost of opening a socket each time is always less than 1ms (negligible with respect to the overall cycle time). The advantage of this model is an increased flexibility to better support fault tolerance.

Fault tolerance

The PIM runtime of the new model has been explicitly designed to provide fault-tolerance in a minimally intrusive fashion. Further details on this matter are provided in the fault-tolerance page.

PimVM

At the lowest level, we find the JVM running the whole PIM model (mostly written in Java). Any “strong mobility enabled” JVM can be suitable and in our case both the AromaVM and Mobile JikesRVM are supported. The PimVM class defines the minimum set of abstract methods that the underlying JVM should implement to work with the PIM.

The description of such an interface will be detailed in the future.