# End-user low-power alternatives at topological and physical levels. Some examples on FPGAs

E. Todorovich, G. Sutter, and N. Acosta

INCA, Universidad Nacional del Centro, Tandil, Argentina http://www.exa.unicen.edu.ar

### Abstract

High-speed digital designs exhibit a moderate logic depth, gate count, and wiring capacitance. These three characteristics are also essential conditions for a low-power operation. Therefore, blocks with lower area or higher bandwidth can be good candidates to have a moderated power figure. This fact opens a way to overcome the lack of low-power EDA tools for FPGAs: to optimize size and speed during the design cycle, to indirectly reduce power. However, the area-time characteristics of a circuit can be modified at different levels. For example, the designer can choose between different logic schemes, physical design actions, or even architectural changes like parallelism or pipelining. In this paper, the usefulness of some of these alternatives are evaluated and quantified using Xilinx FPGAs as technological framework.

### **1. Introduction**

Timing analysis and area estimators exist from the beginning of IC automation, at Fairchild in 1967 [1]. Nowadays, more than three decades later, the designers of FPGA-based electronics have powerful tools to optimize the circuits in both area and time. But, in spite of the research effort [2], no accurate power estimation software has been appended to any current FPGA design suite. The option of in-circuit measurements, inherent to reprogramable devices, is not always available. In several cases, the chip is part of a large system that will be integrated during the last stages of the project. In this situation, two straightforward questions are: Can the abundant area-time tools and reports help the designer to predict if a given solution will have also low-power attributes? Can these analyses be performed as early as possible in the design cycle?

Several technologists have pointed the relationship between bandwidth and power. A recommendation of Xilinx to diminish chip consumption is to redesign the circuit to make it faster, even if the specified bandwidth has already been achieved [3]. E. Boemo and S. López-Buedo

ETS Informática, Universidad Autónoma de Madrid, Ctra. Colmenar Km.15, Madrid, España http://www.ii.uam.es

That is, the minimization of parameters that increase the speed (like fanout, CLB count, or logic depth) can also determine a lower-power operation. For instance, the effect of both pipelining and manual partitioning of binary multipliers was explored in [4]. Main results verify that these actions not only speedup the circuits, but also reduce the power, for a fixed frequency operation (Fig.1).

The relationship between area and power is clear. Abundant experimental data are available in the technical literature. For instance, the area-power figure of different topologies for binary multiplication [5] and addition [6] is shown in Fig.2. In most of the circuits, a lower area implies lower power consumption, for a fixed frequency operation.

Chip optimization can be performed at the different steps of the design cycle. First at all, at the topological level: different circuits are available to perform the same operation. For instance: ripplecarry, carry-save, carry-skip, carry look-ahead, Brent & Kung are some alternatives for binary addition. Each of them has a unique ATP figure for a given technology. The next level corresponds to architectural modifications of the selected topology, basically parallelism or pipelining. Finally, the circuit can be transformed at the physical level, by the combination of actions like manual partitioning, effort setting, floorplanning, routing time specification, etc.

This paper continues the above lines of research by exploring some topological and physical alternatives. Xilinx 4K-series FPGAs have been utilized as a technological framework. In Section II, the main characteristics of the benchmark circuits are summarized. In the next section, the principal results are presented. Finally, some suggestions to reduce power in FPGAs are presented.



Fig.1: Bandwidth-power figure for different pipelined multipliers. Data extracted from [4].





**Fig.2:** Area-power figure of different topologies corresponding to multipliers (above) and adders (below). Data extracted from [5] and [6] respectively.

#### 2. Benchmark circuits

In order to illustrate the main ideas of this work, a set of four multipliers were selected as case studies. This operation is usually utilized for benchmarking purposes: it can be materialized in different topologies, some of the circuits can be straightforward pipelined, and there is a large bibliography (multiplication can be tracked in the history of computer machinery [7] over the last 150 years at least). The main characteristics of each circuit are summarized in Table I.

Each benchmark set has been implemented and tested under identical conditions. That is, all the measurements are related to the same FPGA sample (an *XC4010PC84-4C*), output pins, tool settings, printed circuit board, input vectors, clock frequency, and logic analyzer probes. As a consequence, all prototypes have almost identical off-chip and static power component. These two fractions, that can not be manipulated either at topological, architectural or physical levels, have been subtracted in the next charts in order to focus the study on the dynamic power components. In Table II, is outlined the technique utilized to measure the power components [4]. Other alternatives have been explored in [8].

#### 3. Experimental results

The effect of different topologies.- The first chance to reduce power is the selection of the right topology. Several circuits are logically equivalent, having at the same time an odd area-time-power figure. In Fig.3, is depicted the pairs time-power (measured) of four binary multipliers. The group includes the synthesis result of a behavioral VHDL model (a minimal description using  $P \le a * b$ ). The x-axis represents the critical path delay, calculated using the static timing analyzer tool. Each region in the graph demarcates the bandwidth-power area of 21 measured samples from a set of 100 automatic place-route repetitions of each multiplier. The samples comprise a group of fast, slow, and average speed prototypes. Power for identical circuits vary in a factor near 1.1 times, meanwhile the maximum gap between all the measured circuits is 1.3.



**Fig.3:** Datapath power of four 8-bit multiplier alternatives.



Fig.4: ATP figure for the benchmark circuits, measured at 2 MHz.

*Correlation Area-Time-Power.*- There is not a clear correlation between these parameters in the four multipliers analyzed (Fig.4). In terms of time-power, the results can be separated in two zones; one of them groups the faster and low-power circuits. But inside each group, the opposite effect can be observed: the slower circuits also exhibit lower power consumption. The third variable, CLB occupation, is not is significant in terms of power. Circuits like the sets VHDL and Guild show a minimum occupation, 54 and 66 CLBs respectively, having at the same time the maximum gap between power consumption. On the contrary, the sets Hatamian and Guild have an important gap in occupation, but a close power figure.

*Correlation Time-Power between versions of a same topology.*- A more clear time-power relationship can be observed in circuits with the same topology and

CLB count. The only differences are the final placement and interconnection network: they have been obtained using a repetitive automatic placeroute process. Fig.5 shows two examples corresponding to 21 measured versions of the VHDL and Guild circuits (the samples comprise a group of fast, slow, and average speed prototypes). The results do not pattern an exact line (the regression coefficients are 0.65 and 0.63 respectively) but the assumption "faster circuit - slower consumption" is valid in most of the cases. For example, from 12 of the VHDL samples that exhibit a bandwidth over the average speed, 10 of them also have consumption below the average power. The same analysis for the "Guid" topology shows that 9 of the 10 faster circuits also exhibit a lower consumption respect to the mean value.

| Topology                  | Reference | Description<br>language | Number of CLB | Logic depth in the<br>critical path |
|---------------------------|-----------|-------------------------|---------------|-------------------------------------|
| Wallace                   | [9]       | VHDL                    | 69            | 13 LUTs                             |
| Hatamian                  | [10]      | Gate level              | 96            | 15 LUTs                             |
| Guild                     | [11]      | VHDL                    | 60            | 15 LUTs                             |
| Behavioral VHDL Synthesis | [12]      | VHDL                    | 54            | 12 LUTs                             |

 Table I: Main constructive characteristics of the benchmark circuits. XC4010PC84-4C

| Static power          | The chip is configured but neither stimulus nor clocking is applied. The pull-up resistors and other external elements that require the FPGAs remain connected.                                                                                                                                                                                                          |
|-----------------------|--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Off-chip power        | The circuit is measured twice. First, during normal operation. Second, by disabling the tri-<br>state output buffers. Thus, the off-chip component can be approximated to the difference<br>between the two results (In addition, the use of the tri-state buffers in low-power design is<br>also useful to separate the results from a particular PCB).                 |
| Synchronization power | A constant data (for example, all bits zeroed) is inputted to the circuit, meanwhile the clock signal is applied. Thus, only the clock tree has activity. Is important to note that FPGAs use multiplexers to emulate the effect of a clock enable. As a consequence, the use of the <i>clock enable</i> pin of a CLB does not interrupt the clocking of the flip-flops. |

Table II: Power component measurement in arithmetic circuits [4].



Fig.5: Relationship time-power for identical topologies: Sets VHDL (left) and Guild (right). Datapath power measured at 2 MHz.

The effect of timing constraints.- Taking into account the previous result, other end-user alternative to improve bandwidth and power could be to fix a timing constraint in the clock period, before the P&R process. This option is included in most of the current FPGA design suites. In this way, delays would be diminished and hence, power figure to be improved. However, experimental data do not confirm that assumption. In Fig.6, the results of this idea for the VHDL benchmark set are shown. Timing constraints of 60, 54 and 52 ns were assigned to the clock period. In addition, the maximum placement effort was selected. The three versions resulted faster than the original circuit; but no one exhibited a power reduction.

## 4. Conclusions

This paper has explored some end-user alternatives to get an extra power saving in FPGA-based circuits. The main idea has been to indirectly employ the available information about area and timing, to improve the power consumption. The technique is complementary to a recent FPGA low-power analysis at the architectural level [13].

The goals have been partially fulfilled. The main conclusions are:

- For a selected topology, maximum bandwidth usually points to the best circuit in terms of power. This optimization can be obtained for free; for instance, by using a repetitive PPR process (Fig.5).
- However, if the designer must choice between different topologies, neither clock period nor occupation are primary parameters by themselves to predict a power saving.

The relationship between area and power is not as clear as occurs in cell-based circuits. Some techniques to trade CLBs for routing delay (like the "through-CLB" option or the duplication of hardware to diminish fanout), contribute to make the CLB occupation less significant.



Fig.6: Effect of timing constraint on the power consumption. Benchmark circuit VHDL.

#### Acknowledges

This work has been financed by Project 658001 of the *Fundación General de la Universidad Autónoma de Madrid*. The FOMEC Program of the World Bank financed the participation of Nelson Acosta, Gustavo Sutter and Elías Todorovich. The Conicet Agency of Argentine also grants G. Sutter and E. Todorovich.

#### References

- R. Walker, "Silicon Destiny. The History of ASICs and LSI Logic Corporation", C.M.C. Publications, Milpitas: 1992.
- [2] F. Najm, "A Survey of Power Estimation Techniques in VLSI Circuits", *IEEE Trans. on VLSI Systems*, *Vol.2*, no.4, pp.446-455, December 1994.
- [3] Xilinx Inc, "Power Considerations", in "Technical Conference and Seminar Series", 1995.
- [4] E. Boemo, "Contribution to the Design of Fine-grain Pipelined VLSI Arrays", Ph.D. Thesis, ETSI Telecomunicación, Universidad Politécnica de Madrid, 1996.
- [5] G. Keane, J.Spanier and R. Woods, "The impact of data characteristics and hardware topology on harware selection for low power DSP", *Proc. ISLPED '98*, pp.94-96. ACM, 1998.
- [6] T. Callaway, y E. Swartzlander, "Estimating the Power Consuption of CMOS Adders", *Proc. IEEE 11th Symposium on Computer Arithmetic*, pp.2210-216, Windsor, Ontario. Julio 1993.
- [7] H. Goldstine, "The Computer. From Pascal to von Newman", Princeton University Press: New Jersey 1993.
- [8] L. Mengíbar, M. García, D. Martín, and L. Entrena, "Experiments in FPGA Characterization for Lowpower Design", *Proc. DCIS'99*, Palma de Mallorca, 1999.
- [9] C. Wallace, "A Suggestion for a Fast Multiplier", *IEEE Trans. on Electronic Computers*, pp.14-17, February 1964.
- [10] M. Hatamian and G. Cash, "A 70-MHz 8-bit x 8 bit Parallel Pipelined Multiplier in 2.5-um CMOS". *IEEE Journal of Solid-State Circuits*, August 1986
- [11] H. Guild, "Fully Iterative Fast Array for Binary Multiplication and Addition", *Electronic Letters*, pp.263, Vol.5, N°12, June 1969.
- [12] Synopsys FPGA Express, Version 3.2, 1999.
- [13] A. García, W. Burleson and J. Danger, "Power Consumption Model of Field Programmable Gate Arrays", Proc. FPL'99, in *LNCS Series*, Springer-Verlag, 1999.

This paper was published at the:

Proc. XV Conference on Design of Circuits and Integrated Systems (DCIS'2000), Montpellier, November 21-24, 2000 pp.640-644.

It can be downloaded from http://www.ii.uam.es/~ivan