A FAST TAGGED SORTER USED IN 100/10 MBPS ROUTERS

Chua-Chin Wang1, Hsin-Long Wu, & Shao-Ku Huang†

Department of Electrical Engineering
National Sun Yat-Sen University
Kaohsiung, Taiwan

ABSTRACT
This paper presents a very fast circuit for a tagged sorter which has been integrated in 100M/10M bps routers to resolve the re-ordering problem of incoming packets. The utilization of both edges of the clock is presented, which can drastically improve the clock speed and the throughput.

1. INTRODUCTION
The performance of the applications with heavy I/O load such as routers might not be as fast as the their internal processing elements (PEs). One of the critical paths is to locate where the packet with maximal (or minimal) priority is [3]. It is often deemed as a sorting problem. In addition, the detection of whether the queue is full or empty is not efficiently resolved. A simple modification is presented in this paper, which not only improves the design up to 125 MHz, but also resolves the problem of detecting the queue’s status without hardware penalty.

2. THEORY OF THE TAGGED SORTER
2.1. Operations of the tagged sorter
Many prior sorting algorithms were proposed to reduce the typical $O(n)$ complexity sorting problem to $O(\log n)$ [1]. When it comes to the hardware realization, many of them become either impractical or costly [3]. A tagged up/down sorter was proposed by Moore et al. [2]. The packets are assumed to be in a (key, data) pair format. Notably, all of the sorting elements contain infinity (or a sufficiently large number) initially. The functions of a tagged sorter unit are as follows.

Insertion: On the arrival of an input value on $A_n$, a copy of $C_n$ is delivered to $B_n$, and $D_n$ is transferred to $A_{n+1}$. Then, a comparison takes place between $A_n$ and $B_n$: $\min(A_n, B_n) \rightarrow C_n$, and $\max(A_n, B_n) \rightarrow D_n$.

Extraction: $C_n$ is either removed or transferred to $B_{n-1}$, $D_n$ is copied to $A_n$, and $C_{n+1}$ is transferred to $B_n$. Thus, a comparison also takes place between $A_n$ and $B_n$: $\min(A_n, B_n) \rightarrow C_n$, and $\max(A_n, B_n) \rightarrow D_n$.

Ensured FIFO ordering: A simple method is to use a tag bit associated with each packet. When one packet arrives at the input port of the sorter, its tag bit is reset. The tag bit of a packet will be set once if the packet is moved to the right column, i.e., the output queue. However, if the packet is swapped to the left, the packet can be forced to be swapped back and keep its tag bit as set. If it is swapped to the left, then on the next cycle, regardless insertion or extraction, it will be compared with the right packet which was previously physically below it. Hence, a swap is required to maintain the order on the right column of the sorter.

2.2. Fast sorter implementation
Although Moore proposed one design of the sorter unit, his design not only has problems when given 100 Mbps inputs, but also the detection of critical boundary conditions was ignored, i.e., queue full and queue empty. Our proposed implementation is shown in Fig. 1, where only one crossbar switch is needed. However, the most important feature is that both edges of the clock are utilized to enhance the operation.

Positive edge: process the insertion and extraction

Insertion: $A_n = L_{n-1}$; $B_n = R_n$ (1)
Extraction: $A_n = L_n$; $B_n = R_{n+1}$ (2)

Negative edge: process the comparison and swap

if ($(A_n.key < B_n.key) \cup A_n.tag = 1$)
$L_n, R_n = (B_n, A_n)$, else $(L_n, R_n) = (A_n, B_n)$ (3)

Boundary conditions: Referring to Fig. 2, the queue empty condition can be monitored by $R_{O.tag}$, while the queue full situation can be judged by NANDing the bits in $L_{O.key}$.

3. PERFORMANCE SIMULATIONS
The result of the proposed implementation is shown in Table 1. The clock cycle time is drastically short-
ened to 8 ns, which indicates that the sorter can correctly operates under a 125 MHz clock. Fig. 3 demonstrates the waveforms of our sorter given a 125 MHz clock. Fig. 4 is the physical layout of our design without pads.

4. CONCLUSION

The exploitation of both the edges of the clock leads to a more efficient implementation. Though there is a tradeoff regarding the chip area, the speed of the sorter possesses a higher priority owing to that most of the sorters are employed in heavy I/O trafficking applications, e.g., routers and ATM switches.

5. REFERENCES


<table>
<thead>
<tr>
<th>our implementation</th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Area of an element</td>
<td>967 gates</td>
<td>378,754 µm²</td>
</tr>
<tr>
<td>Area of the sorter</td>
<td>7736 gates</td>
<td>3,030,036 µm²</td>
</tr>
<tr>
<td>Minimal cycle time</td>
<td>8 ns</td>
<td></td>
</tr>
</tbody>
</table>

Table 1: The features of the our implementation.

Figure 2: Proposed sorter (n=8)

Figure 3: Results using the proposed design

Figure 1: Proposed sorter unit design

Figure 4: Chip layout (without pads)