Choosing the System Moduli of RNS Arithmetic Processors

Khaled M. Elleithy

Computer Engineering Department
King Fahd University of Petroleum and Minerals
Dhahran 31261, Saudi Arabia

Abstract

Designing an optimal Residue Number System (RNS) processor in terms of area and speed depends on the choice of the system moduli. In this paper an optimal algorithm for choosing the system moduli is presented. The algorithm takes into consideration several constraints imposed by the problem definition. The problem is formalized as an integer programming problem to optimize an area/time objective function.

1: Introduction

Residue Number System (RNS) has received increased attention due to its ability to support high-speed concurrent arithmetic [1-5]. Although, Digital Signal Processing (DSP) applications utilize the efficiencies of RNS arithmetic in addition and multiplication, they do not require the difficult RNS operations such as division and magnitude comparison. RNS has been employed efficiently in the implementation of DSP processors.

Since special purpose processors are associated with general purpose computers, binary-to-residue and residue-to-binary conversions become inherently important and the conversion process should not offset the speed gain in RNS operations. While the binary-to-residue conversion does not pose a serious threat to the high speed RNS operations, the residue-to-binary conversion can be a bottleneck. Chinese Remainder Theorem (CRT) [6] is considered the main algorithm for the conversion process. Several implementations of the residue decoder have been reported [2,5,7-10].

Designing an optimal RNS processor in terms of area and speed depends on the choice of the system moduli. Most of the reported implementations in the literature for choosing the system moduli are based on a special set of moduli. The residue decoders in [7,8] are based on using three moduli in the form \( (2^n - 1, 2^n, 2^n + 1) \), where \( n \) is the number of bits. Due to the limitation imposed on the number of moduli and the choice of them, it is limited in application. In [9], the residue decoder is based on the base extension technique, it uses only modular look-up tables in its implementation. Since look-up tables are used, the choice of moduli must not be large for the implementation to be feasible. In addition, it does not support residue to 2's complement binary number system conversion. The implementation in [10] requires that one of the moduli must be a power of two; therefore, it may be limited in application. Due to the constraints imposed on the chosen moduli set, such approaches have limited applications and the final design is not optimal in area and time.

In this paper an optimal algorithm for choosing the system moduli is presented. The algorithm takes into consideration several constraints imposed by the problem definition. The problem is formalized as an integer programming problem to optimize an area/time objective function.

2: Residue Number System

In RNS, an integer, \( X \), can be represented by an \( N \)-tuple of residue digits,

\[
X = (r_1, r_2, \ldots, r_N)
\]

where \( r_i = \lfloor X/m_i \rfloor \) with respect to a set of \( N \) moduli set \( (r_1, r_2, \ldots, r_N) \). In order to have a unique residue representation, the moduli must be pairwise relatively prime, that is,

\[
GCD(r_i, r_j) \neq 1 \quad \forall i \neq j
\]

then it is shown that there is a unique representation for each number in the range of:

\[
0 \leq X < \prod_{i=1}^{n} m_i = M
\]

The arithmetic operation on two integers \( A \) and \( B \) is equivalent to the arithmetic operation on the residue representation, that is,
where "•" can be addition, subtraction, or multiplication. Therefore, it is desired to convert binary arithmetic on large integers to residue arithmetic on smaller residue digits in which the operations can be executed in parallel, and there is no carry chain between residue digits.

The conversion from RNS to weighted binary number system is done using the Chinese Remainder Theorem (CRT), which states that:

\[ |A \cdot B|_m = \left( |A|_{m_1} \cdot |B|_{m_1}, \ldots, |A|_{m_n} \cdot |B|_{m_n} \right) \]

System is done using the Chinese Remainder Theorem (CRT), which states that:

\[ |X|_M = \sum_{j=1}^n \left( \tilde{m}_j \right) \left( \frac{r_i}{m_j} \right) \]

where

\[ M = \prod_{j=1}^n m_j, \quad \tilde{m}_j = \frac{M}{m_j}, \quad \text{GCD}(m_j, m_k) = 1 \quad \forall j \neq k \]

3: ROM Requirements

A typical architecture for RNS-based processor is shown in Figure 1. The architecture consists of four stages. In the first stage a variable \( X \) is converted to its RNS representation. In the second stage an arithmetic operation is performed. The third and fourth stages perform the conversion from RNS representation to weighted number system representation. The third stage is responsible for the computation of \( t \)'s according to the following definition:

\[ t_i = \frac{M}{m_i} \left| \frac{r_i}{\tilde{m}_i} \right| \]

The fourth stage is a modulo adder that adds the \( t \)'s inputs and gives the final weighted value.

The ROM required for the RNS based system can be divided into the following:

1- ROM required to implement processor \( i \mod m_i \), \( 1 \leq i \leq n \). The processor implementation is shown in Figure 2. The ROM required for each processor is given by:

\[ 2^{\lceil \log m_i \rceil} \cdot 2^{\lceil \log m_i \rceil} \cdot \lceil \log m_i \rceil \]

The total memory requirements for \( i \) processors is given by:

\[ \sum_{i=1}^n 2^{\lceil \log m_i \rceil} \cdot \lceil \log m_i \rceil \]

2- ROM required to implement \( t \)'s. The ROM size is given by:

\[ \sum_{i=1}^n \left( 2^{\lceil \log m_i \rceil} \cdot t_i \right), \quad t_i = \lceil \frac{M}{m_i} \left| \frac{r_i}{\tilde{m}_i} \right| \rceil \]

3- If the summation unit is implemented using ROM, as shown in Figure 3, the memory required is:

\[ \{ 2^{\lceil \log m_1 \rceil} \cdot 2^{\lceil \log m_2 \rceil} \cdot \ldots \cdot 2^{\lceil \log m_n \rceil} \} \cdot l = \sum_{i=1}^n \lceil \log m_i \rceil \]

The ROM implementation of the summation unit is impossible since it needs huge memory even for a small size system.

4- The implementation of processors as well as \( t \)'s calculation can be implemented using ROMs as given in equations 1 and 2. The total size of ROM is:

\[ \sum_{i=1}^n 2^{\lceil \log m_i \rceil} \cdot \lceil \log m_i \rceil + \sum_{i=1}^n \left( 2^{\lceil \log m_i \rceil} \cdot t_i \right) \]

4: Implementation of the Chinese Remainder Theorem

Chinese Remainder Theorem (CRT) is considered the main algorithm for the conversion process. Several implementations of CRT have been reported[2,5,7-10]. In [10] Vu introduced an algorithm for computing the CRT based on manipulating the form the summands stored in ROMs. The main idea in the algorithm is representing the summands in a different form doesn't require any special real-time processing as they are stored in ROMs. The summands are stored in a form that can be efficiently used in evaluation of the final summation modulo \( M \). The following derivation is used:

\[ X = \sum_{i=1}^n \frac{S_i}{m_i} \]

\[ = \sum_{i=1}^n \left( q_i \frac{M_i}{m_i} + r_i \right) \]

\[ = \left( \frac{M_i}{m_i} \sum_{i=1}^n (q_i) \right) + \left( \frac{M_i}{m_i} \sum_{i=1}^n (r_i) \right) \]

\[ = \frac{M_i}{m_i} \sum_{i=1}^n \left( q_i \right) + \frac{M_i}{m_i} \sum_{i=1}^n \left( r_i \right) \]

211
Computing the first part of the sum is easy because modules \( m_j \) is small and can be implemented using ROMs. However, the second part is obtained most easily when \( m_j \) equal \( 2^k \) or \( 2^k - 1 \). Since it is not unusual to have a power of 2 included in most practical systems of moduli, \( m_j = 2^k \) is assumed. The following equation is valid:

\[
0 \leq \sum_{i=1}^{n} t_i \cdot \frac{nM}{m_j} \leq M, \text{ if } n \leq (m_j = 2^k)
\]

### 4.1 Implementation details

One of the moduli must be equal to \( 2^k \). A hardware implementation is given in Figure 4. There are four levels in implementing the modulo M adder as follows:

1. **At level 1** a Mod \( 2^k \) adder is used. The speed depends on the speed of the multi-operand adder. An optimal adder of \( \Theta(\log n) \) time complexity[3] can be used in this stage.

2. **Level 2** is a small size ROM.

3. **Levels 3 and 4** are 2-operand adders.

The size of buses and ROMs used are as follows:

1. \( q_i \) requires \( \log m_j \) bits (0 \( \leq q_i \leq m_j \))
2. \( r_i \) requires \( \lceil \log \frac{M}{m_j} \rceil \) bits (0 \( \leq r_i \leq \frac{M}{m_j} \))
3. \( q \) requires \( \lceil \log m_j \rceil \) bits (\( q = \sum_{i=1}^{n} q_i \) )
4. \( r \) requires \( \lceil \log M \rceil \) bits (\( r = \sum_{i=1}^{n} r_i \) )
5. \( Y \) requires \( \lceil \log M \rceil \) bits (\( Y = q \left( \frac{M}{2^k} + 2^k - M \right) \))
6. \( \tilde{Z} \) requires \( \lceil \log M \rceil \) bits

### 4.2 Choosing the System moduli

#### 4.2.1 Choosing the system moduli to minimize the ROM size: The following two cases are considered:

1. **ROM is used to obtain \( t_i \)'s** and implement the processors, then the problem is

\[
\text{minimize } \sum_{i=1}^{n} 2^{\lceil \log m_i \rceil \cdot \lceil \log m_i \rceil} + \sum_{i=1}^{n} (2^{\lceil \log m_i \rceil \cdot l_i}) \quad \text{such that } m_1 \cdot m_2 \cdot \ldots \cdot m_n \geq 2^l - 1 \text{ and } m_1, m_2, \ldots, m_n \text{ are relatively prime}
\]

2. **ROM is used to obtain \( t_i \)'s and implement the processors, then the problem is

\[
\text{minimize } \sum_{i=1}^{n} 2^{\lceil \log m_i \rceil \cdot \lceil \log m_i \rceil} + \sum_{i=1}^{n} (2^{\lceil \log m_i \rceil \cdot l_i}) \quad \text{such that } m_1 \cdot m_2 \cdot \ldots \cdot m_n \geq 2^l - 1 \text{ and } m_1, m_2, \ldots, m_n \text{ are relatively prime}
\]

#### 4.2.2 Special Case of \( M = 2^l \) : In this section a special case is considered. The required condition for \( M \) is that \( M \geq 2^l - 1 \). If \( M = 2^l \) the difficult problem of designing a modulo \( M \) will be reduced to the design of multi-operand adder. The integer programming problem is represented as:

\[
\text{minimize } \sum_{i=1}^{n} 2^{\lceil \log m_i \rceil \cdot \lceil \log m_i \rceil} + \sum_{i=1}^{n} (2^{\lceil \log m_i \rceil \cdot l_i}) \quad \text{such that } m_1 \cdot m_2 \cdot \ldots \cdot m_n = 2^l - 1 \text{ and } m_1, m_2, \ldots, m_n \text{ are relatively prime}
\]

Solving this special case is as difficult as the general case. There may not be a feasible solution. If there is no feasible solution, the general problem is solved instead of the special case problem. If there is a feasible solution for the special case, it is likely that the memory requirements may be larger than the general case. In this case we have to compare between larger memory and high speed in realizing the summation network, on the other side, less memory and complicated summation network.

#### 4.2.3 Choosing one of the moduli equal to \( 2^k \) : The implementation discussed in section 4.1 requires one of the system moduli to be equal to \( 2^k \). In this case the problem is defined as follows:

\[
\text{minimize } \sum_{i=1}^{n} 2^{\lceil \log m_i \rceil \cdot \lceil \log m_i \rceil} + \sum_{i=1}^{n} (2^{\lceil \log m_i \rceil \cdot l_i}) \quad \text{such that } m_1 \cdot m_2 \cdot \ldots \cdot m_n \geq 2^l - 1 \text{ and } m_1, m_2, \ldots, m_n \text{ are relatively prime}
\]

\( m_j = 2^k \) for any \( j \) and \( k \)
As in the previous case, there may not be a feasible solution. If there is no feasible solution the general problem is solved instead of the special case problem. If there is a feasible solution for this special case, it is likely that the memory requirements may be larger than the general case. In this case we have to compare between larger memory and high speed in realizing the summation network, on the other side, less memory and complicated summation network.

4.3 New Algorithm for choosing the system moduli

From our previous analysis in section 4.2 we see that the problem of choosing the system moduli is presented as an integer programming problem with the following objective function:

$$\sum_{i=1}^{n} 2^{\left\lceil \log m_i \right\rceil} + \sum_{i=1}^{n} (2^{\left\lceil \log m_i \right\rceil} - 1)$$

4.3.1 Algorithm

Step #1: Solve objective function such that:

$$m_1 m_2 \ldots m_n \leq 2^l - 1$$

$$m_1, m_2, \ldots, m_n \text{ are relatively prime}$$

$$m_j = 2^k \text{ for any } j \text{ and } k$$

Step #2: Solve objective function such that:

$$m_1 m_2 \ldots m_n \leq 2^l - 1$$

$$m_1, m_2, \ldots, m_n \text{ are relatively prime}$$

$$m_j = 2^k \text{ for any } j \text{ and } k$$

Step #3: Solve objective function such that:

$$m_1 m_2 \ldots m_n \leq 2^l - 1$$

$$m_1, m_2, \ldots, m_n \text{ are relatively prime}$$

Step #4: Choosing solution

(a) Speed

Sol 1 (if exists) is the fastest
Sol 2 (if exists) is slower than Sol 1
Sol 3 is the slowest

(b) Memory requirements

Sol 3 needs the least memory
Sol 2 needs memory equal or greater than Sol 3
Sol 1 needs memory similar to Sol 3

(c) Implementation

Sol 1: using n-operands adder.
Sol 2: is implementation-[Figure 4].
Sol 3: is Elleithy’s implementation[2].

4.3.2 Results: The problem for choosing the system moduli has been solved using the integer programming approach with extra constraint of having relatively prime moduli set. Table 1 shows the results starting from 7-bit output to 16 bits. The approach is general and can be used to obtain results for larger bits. For each bit size a number of solutions are obtained. In Table 1 only the solutions with the minimum ROM size are shown.

5: Conclusions

Conversion from Binary Representation into RNS Representation is a time consuming process. In this paper the (CRT) is used as the main algorithm for the conversion process. An algorithm to design optimal RNS processor in terms of area and speed depends on the choice of the system moduli. In this paper an optimal algorithm for choosing the system moduli is presented. The algorithm takes into consideration several constraints imposed by the problem definition. The problem is formalized as an integer programming problem to optimize an area/time objective function.

Acknowledgments

The author would like to acknowledge King Fahd University of Petroleum and Minerals for support provided for this work.

References


![Figure 1. RNS-Based Architecture.](image1.png)

![Figure 2. Processor \(i \mod m_i\) Implementation.](image2.png)

![Figure 3. Implementation of the CRT Decoder.](image3.png)

![Figure 4. Implementation of the CRT decoder with one of the moduli equal 2^k](image4.png)

Table 1: Optimal ROM size for RNS processors where the bus size varies between 7 and 16 bits.

<table>
<thead>
<tr>
<th></th>
<th>7</th>
<th>8</th>
<th>9</th>
<th>10</th>
<th>11</th>
<th>12</th>
<th>13</th>
<th>14</th>
<th>15</th>
<th>16</th>
</tr>
</thead>
<tbody>
<tr>
<td>Sol1</td>
<td>480</td>
<td>512</td>
<td>1024</td>
<td>1088</td>
<td>1152</td>
<td></td>
<td>2300</td>
<td>5885</td>
<td>6600</td>
<td></td>
</tr>
<tr>
<td>Sol2</td>
<td>342</td>
<td>810</td>
<td>878</td>
<td>1468</td>
<td>2277</td>
<td>2409</td>
<td>2541</td>
<td>2673</td>
<td>6488</td>
<td>7648</td>
</tr>
<tr>
<td>Sol3</td>
<td>448</td>
<td>480</td>
<td>512</td>
<td>1024</td>
<td>1088</td>
<td>1152</td>
<td>2541</td>
<td>2300</td>
<td>5885</td>
<td>6600</td>
</tr>
</tbody>
</table>