In this paper, we proposed different comparable reconfigurable hardware implementations for the radix-8 fast two operands multiplier coprocessor using Karatsuba method and Booth recording method by employing carry save (CSA) and kogge stone adders (KSA) on Wallace tree organization. The proposed designs utilized ALTERA Cyclone IV FPGA family with target chip device EP4CGX–22CF19C7 along with simulation package. Also, the proposed designs were synthesized and benchmarked in terms of the maximum operational frequency, the total path delay, the total design area and the total thermal power dissipation. The experimental results revealed that the best multiplication architecture was belonging to Wallace Tree CSA based Radix-8 Booth multiplier (WCBM) which recorded: critical path delay of 14.103 ns, maximum operational frequency of 90.83 MHz, hardware design area (number of logic elements) of 14249 LEs, and total thermal power dissipation estimated as 217.56 mW. Consequently, WCBM method can be efficiently employed to enhance the speed of computation for many multiplication based applications such embedded system designs for public key cryptography.

Recently, the vast promotion in the field of information and communication technology (ICT) such as grid and fog computing has increased the inclination of having secret data sharing over the existing non-secure communication networks. This encouraged the researches to propose different solutions to ensure the safe access and store of private and sensitive data by employing different cryptographic algorithms especially the public key algorithms [

Indeed, wide range of public key cryptographic systems were developed and embedded using hardware modules due to its better performance and security. This increased the demand on the embedded and System-on Chip (SoC) [

Computer arithmetic [

A multiplication algorithm [^{2}). [

In this paper, we report on several fast alternative designs for Radix-8 based multiplier unit including: Radix-8 CSA Based Booth Multiplier, CSA Based Radix-8 Booth, Wallace Tree Karatsuba Multiplier, CSA Based Radix-8 Booth, KSA Based Karatsuba Multiplier, CSA Based Radix-8 Booth, With Comparator Karatsuba Multiplier, Sequential 64-Bit CSA Based Radix-8 Booth Multiplier, 64-bit Wallace Tree CSA based Radix-8 Booth multiplier (WCBM). The remaining of this paper is organized as follows: Section

Two operands-multiplication is a substantial arithmetic operation since it plays a major role in the design of many embedded and digital signal processors [

CSA [

Carry save Adder: (a) Top View Design (b) Internal Architecture

In this work, we have implemented the CSA adder using VHDL code for different bit sizes ranges from 8-bits through 64-bits [

Delay-Area analysis of CSA vs CLA implementations (8–64 bit)

The simulation results of both CSA and CLA is provided in Fig.

KSA is a fast two operands parallel prefix adder (PPAs) [

To verify the performance of all PPAs, we have implemented them on FPGA and the experimental results [

Kogge Stone Adder: (a) Top View Design of KSA (c) KSA Stages (c) Group generation and propagation

_{out} = P_{in1} · P_{in2} and G_{out} = G_{in1} || (P_{in1} · G_{in2}), Where the generation group have only logic equation for carry generation: G_{out} = G_{in2}||(P_{in2} · G_{in1}).

_{i} = P_{i} ⊕ G_{i−1}. The top view and the internal logic circuit is provided in the Fig.

Addition operation is not commonly used to add two operands only, instead, it is more involved with multiplication and inner product computations [

Dot notation of Multi-operand addition for multiplication and inner-product computation

In this work, we have adopted a CSA based Wallace tree since it confirmed better operands organization to improve the total addition delay [

Multi-operand addition for 10 operands.

To enhance the performance of multiplication for large operands (i.e. 1024-bit size), a re-organization process can be adopted for the multiplication operands to utilize the maximum possible parallelism to enhance the multiplication time. Karatsuba algorithm [

Aligning Partial Products.

A more efficient implementation of Karatsuba multiplication can be accomplished as:

The magnitude (or digital) comparator is a hardware electronic device that takes two numbers as input in binary form and determines whether one number is greater than, less than or equal to the other number. Like that in binary addition, the efficient comparator can be implemented using G (generate) and P (propagate) signal for comparison. Basically, the comparator involves two 2-bits: A_{1}A_{0}& B_{1}B_{0} can be realized by:

For A<B, “BBig, EQ” is “1,0”. For A=B, “BBig, EQ” is “0,1”. Hence, for A>B, “BBig, EQ” is “0,0”. Where BBig is defined as output A less than B (A_LT_B). Comparing Eq. (

Where A & B are binary inputs Cin is carry input, Cout is carry output, and G & P are generate & propagate signals, respectively. Now, after comparing equations (

Cin can be considered as G0. For this, encoding equation is given as:

Substituting the two values from equations (

G&P signals can be further combined to form group G&P signals. For instance, for 64-bit comparator, B_{Big}&EQ can be computed as:

Fig 7. Shows the complete design of an 8-bit comparator as an example of this techniques where: i= 0…7, j = 0…3.

The complete design of8- Bit Comparatorincluding Pre- Encoding circuit and Comp circuit

Fundamentally, multiplication operation (along with fast addition) is a significant unit in almost all cryptographic coprocessors. For instance, in the design of SSC Crypto-processor[^{2}), the public key (p^{2}q) and the modulus (pq). Also, in the design of RSA Crypto-processor, the multiplier is used to compute the modulus (p.q) and the Euler function Ø(n) = (p – 1).(q – 1) [

Unlike Binary radix booth encoder, Radix-8 booth encodes each group of three bits as shown in table

RADIX-8 BOOTH ENCODING.

Inputs (bits of M-bit multiplier) | Partial Product | |||
---|---|---|---|---|

_{i+2} |
_{i+1} |
_{i} |
_{i−1} |
_{i} |

0 | 0 | 0 | 0 | 0 |

0 | 0 | 0 | 1 | A |

0 | 0 | 1 | 0 | A |

0 | 0 | 1 | 1 | 2A |

0 | 1 | 0 | 0 | 2A |

0 | 1 | 0 | 1 | 3A |

0 | 1 | 1 | 0 | 3A |

0 | 1 | 1 | 1 | 4A |

1 | 0 | 0 | 0 | -4A |

1 | 0 | 0 | 1 | -3A |

1 | 0 | 1 | 0 | -3A |

1 | 0 | 1 | 1 | -2A |

1 | 1 | 0 | 0 | -2A |

1 | 1 | 0 | 1 | -A |

1 | 1 | 1 | 0 | -A |

1 | 1 | 1 | 1 | 0 |

Design of Radix-8 Booth 32-bit multiplier

As can be seen from fig.

Also, Fig.

State machine diagram for 32-bit Booth multiplier.

In this method, we combine the benefits of the bit reduction of radix 8 booth along with the parallelism of CSA based Wallace tree as well as the pipelining process of Karatsuba multiplication. Thus, this design achieved minimum path delay and minimized area (i.e. the best performance). However, redundancy in this design produced one critical problem regarding the middle carry at the edges of blocks that affects the results. Fig.

Design of 64-bit CSA Based Radix-8 Booth, Wallace Tree Karatsuba multiplier.

Thus, 10 partial products are generated. In the final stage, a CSA based Wallace tree was implemented to be used for adding the resulted partial products. Final result is represented redundantly as vector sum and vector carry. This design achieves minimum path delay with limited area.

However, redundancy in this design produces one critical problem that affects the results. As a rule-of-thumb, if we multiply two N – bit numbers (i.e. p and q), the multiplication result will be increased to 2N – bit. However, this is not the case when using redundant systems since the result is stored as two 2N – bit vectors and adding the two vectors to we tend to obtain the conventional product might result in 2N + 1 bits. This additional bit brings up a new problem in the preliminary design. Now, this problem can be solved by discarding the last carry when converting back to conventional representation. However, in Karatsuba algorithm the numbers are split into 32-bit (original size is 64). The result must be 128-bit, but in Karatsuba case will be 10 partial product vectors of 64-bit shifted in such a way that adding those vectors will result in 128-bit. Thus, discarding all the generated carry when converting back to conventional system leads to error since only the carry generated of adding the two vectors corresponding to the same variable (or the same partial product in this case) needs to be discarded. Other generated carries must be considered. Fig.

Graphical approaches to demonstrate the carry error (the mid-carry problem), here we have two cases: Case I- ps1+ pc1 = might result in carry, result = 65-bit (wrong). Carry must be discarded and Case II- ps1+ ps2 = might result in carry, result = 65-bit (correct). Carry must be considered.

Eventually, the mid-carry problem was solved by either using 64-bit CSA Based Radix-8 Booth, KSA Based Karatsuba multiplier or using 64-bit CSA Based Radix-8 Booth, with comparator Karatsuba multiplier. However, both solutions have added more overhead to design cost; therefore, this solution has been excluded. Both solutions are discussed in the following subsections.

Since the carry to be eliminated is the generated one from booth multiplier, a first thought is to exchange the CSA adder with KSA adder to convert back the two vectors into one 64-bit number and discard any generated carry. All the 8 vectors are reduced into five 64-bit vectors in parallel. This stage helps to eliminate the false carry without the need to do any further examination. KSA is a fast adder, thus this design maintains its high performance utilizing more logic elements. The logic diagram of the design is shown in Fig.

Design of 64-bit: 64-bit CSA Based Radix-8 Booth, KSA Based Karatsuba multiplier.

Another noticeable design option can solve the mid-carry problem is to use a 64-bit comparator to test if the two vectors will generate a carry if yes, then do the correction step before input the 10 vectors to CSA Tree. After Booth multiplication stage, connect the vector sum and vector carry that may produce carry error to the inputs of 64-Bit comparator unit, then perform correction if needed. Finally, all vectors added using CSA tree. The complete solution is depicted in fig.

Karatsuba multiplication based on CSA and comparator.

Note that the 64-bit comparator can be built with 8 stages in total recording a total delay of 13 level gate delay and area of 317 gates (like the design of 8-bit comparator discussed in section. 2.5). To predict whether the carry will be generated or not, then we need to generate 64-Bit G (generate) and K (kill) vectors. Thus, we have three cases which might happen as follows:

Case I: when

Case II: when

Case III: when

To define the first case, we have used a comparator to compare the two vectors G and K as the comparator results:

We investigated both proposed design alternatives of Karatsuba based multiplication theoretically in terms of critical path delay (using gate delay unit) and the area of the multiplier (how many gates used in the implementation). The results are shown in table

COMPARISON BETWEEN DESIGN II & DESIGN III.

Design Solutions # | Delay (gate delay) | % Optimization | Area (# of gates) | % Optimization |
---|---|---|---|---|

Solution I: using KSA Adder. | 23 | +15% | 6130 | |

Solution II: using Comparator unit. | 27 | 3712 | +50% |

This design is accomplished by expanding the 32-bit booth to 64-bit. The two modules (i.e. 64-bit and 32-bit Booth) differ only in the number of generated partial products. Since radix-8 is used, 22 partial products are generated in the new module instead of 11 while other logic components remained the same. Fig.

Design of CSA based Radix-8 Booth 64-bit multiplier.

To speed up the performance of sequential 64-Bit CSA Based Radix-8 Booth Multiplier, we parallelized the addition of partial products produced in the same level by using Wallace CSA tree instead of sequential CSA to exploit the maximum possible parallelism between the partial products to gain in speed and enhance the design performance. That’s it, we end up with implementing a 64-bit Wallace Tree CSA based Radix-8 Booth multiplier (WCBM). The block diagram for the proposed design is shown in Fig.

(a) Design Architecture of WCBM (a) Top Level DiagramWCBM (C) FSM Diagram for WCBM.

The top view of our implemented WCBM unit is given in Fig.

Sample run example of WCBM process of two 64-bit numbers

The proposed multiplier implementation has been synthesized using Altera Cyclone EP4CGX-22CF19C7 FPGA kit to analyze several design factors such as design area, the total delay of the multiplication unit and the thermal power consumption of FPGA implementation. We have evaluated the performance of the 64-bit Wallace Tree CSA based Radix-8 Booth multiplier WCBM module for different data path sizes. Timing analysis of the critical clock cycle for the implemented WCBM is illustrated in Fig.

Waveform sample of the proposed WCBM data delay

Multiplication operation is a core operation that domineer the performance of several public cryptographic algorithms such as RSA and SSC. In this paper, we have thoroughly discussed several design alternatives of radix-8 based multiplier unit by employing the Karatsuba method and Booth recording method with carry save and Kogge stone adders on Wallace tree organization. The proposed designs were evaluated in terms of many aspects including: maximum frequency and critical path delay, design area, and the total FPGA power consumption. The proposed hardware cryptosystem design is conducted using Altera Cyclone FPGA design technology along with the help of CAD package of Altera such as Quartus II and Modelsim 10.1. To sum up, we have successfully implemented and synthesized the Wallace Tree CSA Based Radix-8 Booth Multiplier (WCBM) module via the target FPGA technology for 64-bits. The synthesizer results showed an attractive results in terms of several design factors that can improve the computation performance for many multiplication based applications.