## REIHE INFORMATIK 32/95

# Improved 3-Line Hardware Synchronization

M. Makhaniok, R. Männer Universität Mannheim Seminargebäude A5 D-68131 Mannheim

# IMPROVED 3-LINE HARDWARE SYNCHRONIZATION

#### M. Makhaniok<sup>1</sup>), R. Männer<sup>2,3</sup>)

Institute of Engineering Cybernetics, Academy of Sciences, Surganov St. 6, 220012 Minsk, Republic Belarus
Lehrstuhl für Informatik V, Universität Mannheim, A5, D-68131 Mannheim, Germany
Interdisziplinäres Zentrum für Wissenschaftliches Rechnen, Universität Heidelberg, Heidelberg, Germany

• Abstract. A new procedure is proposed to synchronize processors of a distributed system, which concurrently execute a common process consisting of a sequence of operations. The procedure is an extension of that used for the 1987 IEEE Futurebus Standard. It is based on global synchronization lines and a distributed synchronizer, and requires only minor modifications of existing hardware. The procedure allows to carry out two alternative synchronization protocols. As usual, an operation may be terminated by the last processor having finished its part of the operation. Alternatively, the operation may also be terminated by the first processor being ready. Application of this second procedure, e.g., to bus arbitration, allows to reduce the arbitration time in average by a factor of 2.

**1. Introduction.** We consider a  $\underline{\text{process}}^1$  that consists of a sequence of  $\underline{\text{operations}}$ , e.g., search, comparison or sorting, that are executed distributively by multiple processors. The processors execute their local parts of every operation. Then they are synchronized so that every operation is finished by all processors before the next one is started.

A distributed system (DS) consists of a number of processors that operate in the general case asynchronously at different speeds and with different input data. Generally a varying number of processors participate in an operation, and the duration of an operation depends on the set of participating processors and their characteristics. Assuming a high number of processors and a short duration of the operations, it is not possible to exchange these characteristics before the start of an operation. Therefore a <u>distributed hardware synchronization</u> between the processors is required so that each one can determine when the current operation will be finished; using only its own characteristics, and characteristics specified for the whole system. The proper type of synchronization is the topic of this paper.

In many cases synchronization in a DS is done using an asynchronous (hand-shake) protocol. Such a synchronization does not need a common clock and has a number of well known advantages. One of them is that the end of an operation can be determined dynamically [1] which can speed up the operation on average. This may considerably reduce unnecessary waiting time and processing overhead. The synchronization procedure introduced in this paper exploits this fact.

This work has been supported by the Deutsche Forschungsgemeinschaft (DFG) under grants Ma 1150/8-1, Ma 1150/8-2, and 436-WER-113-1-0 (438 113/117/0).

<sup>&</sup>lt;sup>1</sup> Here and below we underline terms that are newly defined.

**2. Taub's synchronization procedure.** Synchronization of processors can be done according to the following procedure that was originally designed by Taub [2,3] for the IEEE "Futurebus" backplane bus standard. In this paper we simply call it "Taub's synchronization procedure". Each processor of a DS can be considered to consist of two parts, a processing element (PE) and a synchronization unit (SU), where the lines  $L_i$  are used for data exchange between PEs (Fig. 1).



Fig. 1. A multiple processor system with distributed synchronization.

The general principle of Taub's synchronization procedure is that <u>any</u> processor can start a new operation, and the operation is finished only when <u>all</u> participating processors completed their individual parts.

Let **Z** be a set of processors in the system. It takes an <u>individual waiting time</u> t(j,W) for the **j**-th processor to complete its part of an operation **W**. Therefore the operation will be finished within the time  $Max\{t(j,W), j \in Z\}$ . We call this value the <u>global waiting time</u>.

Taub's procedure is based on a synchronizer that uses three bus lines P, Q, R. Each SU is connected to these lines and can assert onto them its individual binary signals  $p_j$ ,  $q_j$ ,  $r_j$ 

through open-collector drivers. Thus every line computes the wired-OR of the signals asserted by the SUs, and all SUs read back this value. If we assume, e.g., that a line is set to a logic 0, then it will be switched to the logic 1 state if **any** processor asserts a 1 signal. In contrast, the line will change its state from 1 to 0 only if **all** processors withdraw their 1 signals.

Suppose that the initial signals  $\mathbf{p}_j$ ,  $\mathbf{q}_j$ ,  $\mathbf{r}_j$ ,  $\mathbf{j} \in \mathbf{Z}$ , generated by all SUs, have the values of **0**, **0**, **1** respectively, and therefore also **P**, **Q**, **R**. Then the sequence of events in the synchronization procedure according to [2] is described by Tab. 1.

| #  | Protocol steps                                        |        | Next stage |                  |                 |                 |                           |       |
|----|-------------------------------------------------------|--------|------------|------------------|-----------------|-----------------|---------------------------|-------|
|    | _                                                     | signal | lines      | rrent stage of s |                 | trigger outputs | trigger outputs           |       |
|    |                                                       | run    | PQR        | Op <sub>1</sub>  | Op <sub>2</sub> | Op <sub>3</sub> | $s_p r_p s_q r_q s_r r_r$ | pqr   |
| 1  | Initial condition for the                             | 0      | 001        | 0                | 0               | 0               | 0 x 0 x 1 0               | 0 0 1 |
|    | 1st operation                                         |        | 0.0.1      | -                |                 |                 |                           | 1.0.1 |
| 2a | Start of the 1st opera-<br>tion by any PE             | 1      | 001        | 0                | 0               | 0               | 1 0 0 x 1 0               | 101   |
| 2b | Start of the 1st opera-<br>tion by other PEs          | 0      | 101        | 0                | 0               | 0               | 1 0 0 x 1 0               | 1 0 1 |
| 3  | Execution of the 1st operation (stable state)         | Х      | 101        | 1                | 0               | 0               | 1 0 0 x 1 0               | 1 0 1 |
| 4  | End of the 1st operation by any PE                    | Х      | 101        | 1                | 0               | 0               | 1 0 0 x 1 0               | 1 0 0 |
| 5  | End of the 1st operation by all PEs                   | Х      | 100        | 0                | 0               | 0               | 1 0 1 0 0 x               | 1 1 0 |
| 6  | Execution of the 2nd operation (stable state)         | Х      | 110        | 0                | 1               | 0               | 0 0 1 0 0 x               | 1 1 0 |
| 7  | End of the 2nd opera-<br>tion by any PE               | Х      | 110        | 0                | 1               | 0               | 0 1 1 0 0 x               | 0 1 0 |
| 8  | End of the 2nd opera-<br>tion by all PEs              | Х      | 010        | 0                | 0               | 0               | 0 x 1 0 1 0               | 0 1 1 |
| 9  | Execution of the 3rd operation (stable state)         | Х      | 011        | 0                | 0               | 1               | 0 x 0 0 1 0               | 0 1 1 |
| 10 | End of the 3rd opera-<br>tion by any PE               | Х      | 011        | 0                | 0               | 1               | 0 x 0 1 1 0               | 0 0 1 |
| 11 | End of the 3rd opera-<br>tion by all PEs <sup>2</sup> | Х      | 001        | 0                | 0               | 0               | 1 0 0 x 1 0               | 101   |

Tab. 1. Synchronization according to Taub [2,4].

One example for the corresponding signal waveforms is shown in Fig. 2. Here three participating PEs are assumed with numbers A, B and N.

**3.** Alternative Synchronization Procedure. To reduce the global waiting time in the DS we introduce an alternative synchronization procedure. Unlike Taub's procedure it is based on the principle that <u>any</u> processor can start a new operation, and <u>any</u> processor can terminate it. The latter is possible if a processor recognizes either that all output signals are settled or that all further changes of the signals in the DS cannot longer influence the final result of the operation<sup>3</sup>.

<sup>&</sup>lt;sup>2</sup> If **run=1** the PEs will execute the sequence of the next three operations starting from step 2a. If, however, **run=0** PEs will be waiting for initiation of the next sequence of operations that corresponds to step 1.

<sup>&</sup>lt;sup>3</sup> A trivial example is the case in which a processor with the highest possible (unique) bus access priority arbitrates for bus mastership. It can stop the arbitration process immediately since it will become bus master



Fig. 2. Signal timing example corresponding to Taub's synchronization procedure.



definitively.

#### Fig. 3. Control circuit for Taub's synchronization procedure.

The corresponding alternative synchronization procedure is also based on a synchronizer that uses three open-collector bus lines P, Q, R. Suppose as above that the initial signals  $p_j$ ,  $q_j$ ,  $r_j$ , generated by all PEs, have the values of 1, 0, 0 respectively, and therefore also P, Q, R. Then the sequence of events of the synchronization procedure is given by Tab. 2.

| #   | Protocol steps                                    | Current stage of signals |        |     |       |                                                         |                                                                                           | Next stage |
|-----|---------------------------------------------------|--------------------------|--------|-----|-------|---------------------------------------------------------|-------------------------------------------------------------------------------------------|------------|
|     |                                                   | sign<br>PE               | als fi |     | lines | PĔ                                                      | trigger outputs                                                                           | outputs    |
|     |                                                   | run                      | trm    | ack | PQR   | Op <sub>1</sub> Op <sub>2</sub> stop                    | s <sub>p</sub> r <sub>p</sub> s <sub>q</sub> r <sub>q</sub> s <sub>r</sub> r <sub>r</sub> | pqr        |
| 1   | Initialization of the system                      |                          | Х      | Х   | 100   | 0 0 0                                                   | 1 0 0 x 0 0                                                                               | 1 0 0      |
| 2   | System is initialized (stable state)              | 0                        | X      | X   | 100   | 0 0 0                                                   | 0 0 0 x 0 0                                                                               | 100        |
| 3a  | Start of the 1st operation                        | 1                        | X      | X   | 100   | $\begin{array}{c c} 0 & 0 & 0 \\ 0 & 0 & 0 \end{array}$ | 0 0 0 x 1 0<br>0 1 0 x x 0                                                                | -          |
| 21. | by any PE                                         | X                        | X      | X   | 101   |                                                         |                                                                                           |            |
|     | Start of the 1st operation by other PEs           | 0                        | х      | х   | 101   | 0 0 0                                                   | 0 1 0 11 10 0                                                                             |            |
|     | Execution of the 1st opera-<br>tion               | Х                        | Х      | Х   | 001   | 1 0 0                                                   | 0 x 0 0 x 0                                                                               |            |
|     | The 1st operation is fin-<br>ished by any PE      | Х                        | 1      | 0   | 001   | 1 0 0                                                   | 0 x 1 0 x 0                                                                               | 0 1 1      |
| 5b  | The first operation is fin-<br>ished by other PEs | Х                        | Х      | 0   | 011   | 0 0 1                                                   | 0 x 1 0 0 0                                                                               | 0 1 1      |
| 6   | Confirmation of the end of the lst operation      | Х                        | 0      | 1   | 011   | 0 0 1                                                   | 0 x x 0 0 1                                                                               | 0 1 0      |
|     | Execution of the 2nd operation (stable state)     | Х                        | Х      | Х   | 010   | 0 1 0                                                   | 0 0 x 0 0 x                                                                               | 0 1 0      |
|     | The 2nd operation is fin-<br>ished by any PE      | Х                        | 1      | 0   | 010   | 0 1 0                                                   | 1 0 x 0 0 x                                                                               | 1 1 0      |
| 8b  | The first operation is fin-<br>ished by other PEs | Х                        | X      | 0   | 110   | 0 0 1                                                   | 1 0 0 0 0 x                                                                               | 1 1 0      |
| 9   | Confirmation of the end of the 1st operation      | Х                        | 0      | 1   | 110   | 0 0 1                                                   | x 0 0 1 0 x                                                                               | 100        |
| 10  | End of a pair of operations                       | Х                        | Х      | Х   | 100   | 0 0 0                                                   | 0 0 0 x 0 0                                                                               | 1 0 0      |

Tab. 2. Synchronization protocol for the alternative procedure.

A control circuit of a SU for the alternative synchronization is presented in Fig. 4. The example of signal timing (Fig. 5) clarifies the interaction between the processors participating in the described synchronization procedure.

It is possible to combine Taub's synchronization procedure with the one suggested in this paper. This is important for applications where both synchronization principles must be used to execute one process, i.e., where some operations are finished when individual processors terminate, and others only when all participating processors are ready with the current operation. In this case a control circuit with four common lines can be used (Fig. 6).

As mentioned above, it is possible sometimes to take a decision about the result before the end of an operation, i.e., before the resulting signals appear on the output lines  $L_i$  of the DS.

This can happen, e.g., in an arbitration operation as discussed in the following section.



Fig. 4. Control circuit of a synchronization unit for the alternative synchronization procedure.

4. Synchronization of the arbitration operation. Taub's synchronization procedure was especially designed for the arbitration operation [4,5]. The objective of arbitration is to find a processor having the highest arbitration priority, i.e., the highest arbitration word, among all processors participating in the operation. The arbitration operation is executed by a distributed arbitration circuit that consists of PEs connected by open-collector arbitration lines  $L_1,...,L_N$  (Fig. 1). After begin of the operation all PEs assert the bits of their unique arbitration words  $A_j = \langle a_{j,1},...,a_{j,N} \rangle$  onto the corresponding arbitration lines. Until the end of the operation the PEs simultaneously compare their own bit values  $a_{j,i}$  with the current signals on the arbitration lines. If a line  $L_i$ ,  $i \in \mathbb{Z}$ , is set to a logic 1 and the corresponding bit  $a_{j,i}$  of the word is 0 then the PE j withdraws all less significant bits  $a_{j,i+1},...,a_{j,N}$  from the lines  $L_{i+1},...,L_N$ . If this condition is not longer true the PE asserts the withdrawn bits again.

Every PE consists of combinatorial logic which compares all bits of its arbitration word with the signals on the corresponding arbitration lines. Let T(j) be the gate delay time to process one bit for PE<sub>j</sub>. The computation of the wired-ORs on the bus lines requires an additional bus-propagation delay  $\tau$ . This delay is identical for all arbitration lines [3]. Obviously, as the arbitration scheme considered is a combinatorial circuit, each PE starts to compare every bit automatically when the signals on the lines belonging to more significant bits are changed [5-7]. Thus the operation can be considered as a sequential procedure.



Fig. 5. Signal timing example for the alternative synchronization procedure. The duration of the operation is  $t(Z,W) = \sum_{j \in Z} C_j(Z,W)T(j) + \sum_{i=1,...,N} \tau_i$ , where

 $C_j(Z,W) = \sum_{i=1,...,N} d_{i,j}$ , is the number of the individual propagation delays T(j) introduced in the operation W by the j-th PE,

 $T_L \leq T(j) \leq T_H$ , where  $T_L$  and  $T_H$  are upper and lower bounds identical for the whole DS,  $d_{i,j} = 1$  if the j-th PE takes longest to complete the i-th distributed part of the operation W,  $d_{i,j} = 0$  else, so that  $\sum_{j \in Z} d_{i,j} = 1$ .

On one hand it is very difficult to find exactly the minimal upper bound for the duration t(Z,W). The reason is the lack of information on the current set Z of PEs participating in the operation, and on their individual speed characteristics, i.e., T(j). On the other hand the PEs can find the minimal upper bound of the duration of the operation if they cooperate according to the following algorithm.

After the PEs have recognized the begin of the operation, they assert their 1 signals  $p_j$  onto the wired-OR line P. During execution each PE j waits for a specified individual period of time t(j,W) before it switches  $p_j$  from 1 to 0. After the slowest PE has switched its  $p_j$  to 0, line P changes to the 0 state and all PEs recognize the end of the operation. In modular systems

where different PEs can be combined arbitrarily, t(j,W) must be independent of Z and of the technical characteristics of other PEs. Therefore – according to Taub's synchronization procedure – has the global waiting time for the end of the operation to be chosen as  $Max\{t(j,W), j \in Z\}$ .



Fig. 6. Control circuit of a SU for combined synchronization.

After this time all output signals of the participating PEs are settled. Obviously the operation is executed correctly if the global waiting time  $Max\{t(j,W), j \in Z\}$  is not shorter than the duration of the operation. This means that the condition

$$Max \{t(j,W), j \in \mathbb{Z}\} \ge \sum_{j \in \mathbb{Z}} C_j(\mathbb{Z},W)T(j) + \sum_{i=1,\dots,N} \tau_i, \quad (1)$$

must be true for all possible T(j),  $T_L \leq T(j) \leq T_H$ . The less this value the higher is the performance of the DS. The alternative synchronization procedure allows to reduce this global waiting time.

Relation (1) guarantees the correct result of the operation. However the PEs generally know neither which PEs participate in the current operation nor their the individual propagation delays. This means that the values t(j,W) cannot be found from (1) directly. Therefore –

according to [7,8] – every PE **j** has to take into account relations (1) for every possible set of participants and to assume the worst case, i.e., the maximal individual waiting time

$$\mathbf{t}(\mathbf{j},\mathbf{W}) = \mathbf{NT}(\mathbf{j}) + \left(\left\lfloor \frac{\mathbf{N}}{2} \right\rfloor + 2\right)\boldsymbol{\tau}.$$
 (2)

In this case the global waiting time  $Max\{t(j,W), j \in Z\}$  is usually much higher than the actual duration of the operation. This can degrade the speed of the system considerably. The probability for this to happen is the higher the more PEs participate in the operation.

Let us consider for example an arbitration operation with N=4. If we use Taub's synchronization procedure then follows from (2) that the global waiting time t(j,W) is equal to  $Max\{4T(j)+4\tau, j \in Z\}^4$ . However the PE having the maximal arbitration word <1111> could correctly decide to get next bus mastership immediately after begin of the operation. Also, the PE having the arbitration word <0111> must wait only the time  $T_H+\tau$  and can then terminate the operation. For such cases the alternative synchronization procedure has considerable speed advantages.

If a PE recognizes at the time  $t_a(j,W)$  that every bit  $a_{j,i}$  of its arbitration word  $A_j$  is greater or equal than the value of the signal on the corresponding bus line  $L_i$ , it concludes that it has the highest priority among the arbitrating PEs. This happens usually before the arbitration lines are settled. The minimal waiting time is obtained if the PE acts as the next bus master from this time on. The second column of Tab. 5 gives for every PE the time  $t_a(j,W)$  after which it knows the arbitration result independently of the set Z of participating processors and their speed.

However the other PEs still have to wait for the result of the arbitration operation, i.e., they have to wait until the arbitration lines have settled. For that, the new bus master can terminate the operation using the alternative synchronization procedure. For this case the third column of Tab. 5 gives the individual waiting time  $t_b(j,W)$  for all PEs which, again, is independent of the set Z of participating processors and their speed.

If the alternative synchronization procedure is used, i.e., if every processor can terminate the operation, the average global waiting can be reduced by 50% as shown in Tab. 5. This is the case either if all processors are of comparable speed, i.e., all T(j) have similar values, or if the number of participating processors is high.

**5.** Conclusions. The suggested synchronization procedure allows to use different hardware synchronization procedures. Using this procedure, it is also possible to identify the end of the operation dynamically and to reach a minimal upper bound for the duration of asynchronous operations. This can considerably reduce the processing time in distributed processing systems. One advantage of the suggested synchronization is that it does not require modifications of standard hardware, i.e., backplanes, connectors, etc. The only requirement is to

<sup>&</sup>lt;sup>4</sup> Using a modification of Taub's synchronization procedure [7], the global waiting time can be reduced to  $T_{H}+\tau$  if only one processor participates in the arbitration.

extend slightly the logic of the distributed synchronization circuits.

|                          | T 1 · · 1 1 · · ·    | <b>T</b>               | T 1 1 1 1                        |  |
|--------------------------|----------------------|------------------------|----------------------------------|--|
|                          | Individual waiting   | Termination time       | Individual waiting time          |  |
|                          | time for alternative | for alternative syn-   | for Taub's synchro-              |  |
| Arbitration number       | synchronization      | chronization           | nization                         |  |
|                          | $t_a(j,W)$           | $t_b(j,W)$             | <b>t</b> ( <b>j</b> , <b>W</b> ) |  |
| $A_{15} = 1 1 1 1$       | 0                    | $T(15)+\tau$           | 4T(j)+4τ                         |  |
| $A_{14} = 1 1 1 0$       | $T(14)+T_H+2\tau$    | $T(14)+T_H+2\tau$      | 4T(j)+4τ                         |  |
| $A_{13} = 1 1 0 1$       | $T(13)+T_H+2\tau$    | $2T(13) + T_H + 3\tau$ | 4T(j)+4τ                         |  |
| $A_{12} = 1 \ 1 \ 0 \ 0$ | $T(12)+2T_H+2\tau$   | $T(12)+2T_H+3\tau$     | $4T(j)+4\tau$                    |  |
| $A_{11} = 1011$          | $T(11)+T_H+2\tau$    | $2T(11) + T_H + 3\tau$ | 4T(j)+4τ                         |  |
| $A_{10} = 1 \ 0 \ 1 \ 0$ | $T(10)+2T_H+3\tau$   | $2T(10)+2T_{H}+4\tau$  | 4T(j)+4τ                         |  |
| $A_9 = 1001$             | $T(9)+T_H+2\tau$     | $2T(9)+T_H+3\tau$      | 4T(j)+4τ                         |  |
| $A_8 = 1 0 0 0$          | $T(8)+2T_H+3\tau$    | T(8)+2T <sub>H</sub>   | 4T(j)+4τ                         |  |
| $A_7 = 0 1 1 1$          | $T_{H}+\tau$         | $T_{H}+\tau$           | $4T(j)+4\tau$                    |  |
| $A_6 = 0 1 1 0$          | $T(6)+T_H+2\tau$     | $T(6)+T_H+2\tau$       | $4T(j)+4\tau$                    |  |
| $A_5 = 0 1 0 1$          | $T(5)+T_H+2\tau$     | $2T(5)+T_H+3\tau$      | 4T(j)+4τ                         |  |
| $A_4 = 0 1 0 0$          | $T(4)+2T_H+3\tau$    | $T(4)+2T_H+3\tau$      | 4T(j)+4τ                         |  |
| $A_3 = 0 \ 0 \ 1 \ 1$    | $T_{H}+\tau$         | $T_{H}+\tau$           | $4T(j)+4\tau$                    |  |
| $A_2 = 0 \ 0 \ 1 \ 0$    | $T(2)+T_H+2\tau$     | $T(2)+T_H+2\tau$       | $4T(j)+4\tau$                    |  |
| $A_1 = 0 0 0 1$          | $T_{H}+\tau$         | $T_{H}+\tau$           | $4T(j)+4\tau$                    |  |
| $A_0 = 0 0 0 0$          | $T_{H}+\tau$         | $T_{H}+\tau$           | 4T(j)+4τ                         |  |

Tab. 5. Duration of the arbitration operation with 4-bit arbitration words.

### References

- M. Makhaniok, V. Cherniavsky, R. Männer, O. Stucky, Massively Parallel Realization of Logical Operations in Distributed Parallel Systems; in: H. Burkhart (Ed.): Proc. CONPAR '90 - VAPP IV, Lect. Notes in Comp. Sci. 457, Springer, Heidelberg (1990), 796-805.
- D.M. Taub, Hardware method of synchronizing processes without using a clock, Electron. Lett., No. 19, Vol. 19, 1983, 772-773.
- D.M. Taub, Improved Control Acquisition Time for the IEEE 896 Futurebus, IEEE Micro, 7 (3), 1987, 52-62.
- 4. D.M. Taub, Clockless synchronization of distributed concurrent processes, IEE Proc. E, Vol. **139**, No. 1, 1992, 88-92.
- 5. ANSI/IEEE Std. 896.1-1987, IEEE Standard Backplane Bus Specification for Multiprocessing Architectures: Futurebus.
- 6. M. Makhaniok, R. Männer, Synchronization of Parallel Processes in Distributed Systems;

in: L. Bougé, M. Cosnard, Y. Robert, D. Trystram (Eds.): Proc. CONPAR '92 - VAPP V, Lyons, France; Lect. Notes in Comp. Sci. **634**, Springer, Berlin (1992) 157-162

- Taub M.D., Corrected Settling Time of the Distributed Parallel Arbiter, IEE Proc. E, Vol. 139, No. 4, 1992, 348-354.
- 8. Kipnis S., Analysis of Asynchronous Binary Arbitration on Digital Transmission-Line Busses, IEEE Trans. Comput., Vol. **43**, No. 4, 1994, 484-489.