Online Scheduling for Real-Time
 Multitasking on Reconfigurable Hardware
 Devices
 A Thesis submitted for the degree of Doctor of Philosophy
 By
 Guy Wassi-Leupi
 Department of New Media & Technologies,
 Faculty of Design, Media & Management,
 Buckinghamshire New University
 Brunel University, London
 July, 2011
Abstract
 Nowadays the ever increasing algorithmic complexity of embedded applications requires the designers to
 turn towards heterogeneous and highly integrated systems denoted as SoC (System-on-a-Chip). These
 architectures may embed CPU-based processors, dedicated datapaths as well as reconfigurable units.
 However, embedded SoCs are submitted to stringent requirements in terms of speed, size, cost, power
 consumption, throughput, etc. Therefore, new computing paradigms are required to fulfil the constraints
 of the applications and the requirements of the architecture.
 Reconfigurable Computing is a promising paradigm that provides probably the best trade-off between these
 requirements and constraints. Dynamically reconfigurable architectures are their key enabling technology.
 They enable the hardware to adapt to the application at runtime. However, these architectures raise new
 challenges in SoC design. For example, on one hand, designing a system that takes advantage of dynamic
 reconfiguration is still very time consuming because of the lack of design methodologies and tools. On the
 other hand, scheduling hardware tasks differs from classical software tasks scheduling on microprocessor
 or multiprocessors systems, as it bears a further complicated placement problem.
 This thesis deals with the problem of scheduling online real-time hardware tasks on Dynamically
 Reconfigurable Hardware Devices (DRHWs). The problem is addressed from two angles :
 (i) Investigating novel algorithms for online real-time scheduling/placement on DRHWs.
 (ii) Scheduling/Placement algorithms library for RTOS-driven Design Space Exploration (DSE).
 Regarding the first point, the thesis proposes two main runtime-aware scheduling and placement techniques
 and assesses their suitability for online real-time scenarios. The first technique discusses the impact of
 synthesizing, at design time, several shapes and/or sizes per hardware task (denoted as multi-shape task),
 in order to ease the online scheduling process. The second technique combines a looking-ahead scheduling
 approach with a slots-based reconfigurable areas management that relies on a 1D placement. The results
 show that in both techniques, the scheduling and placement quality is improved without significantly
 increasing the algorithm time complexity.
 Regarding the second point, in the process of designing SoCs embedding reconfigurable parts, new design
 paradigms tend to explore and validate as early as possible, at system level, the architectural design space.
 Therefore, the RTOS (Real-Time Operating System) services that manage the reconfigurable parts of the
 SoC can be refined. In such a context, gathering numerous hardware tasks scheduling and placement
 algorithms of various complexity vs performance trade-offs in a kind of library is required. In this thesis,
 proposed algorithms in addition to some existing ones are purposely implemented in C++ language, in
 order to insure the compatibility with any C++/SystemC based SoC design methodology.
 Key-words: FPGA, Reconfigurable SoC, DES, Modelling, Scheduling, Placement, RTOS.
 i
Acknowledgments
 This work was carried out in two institutions : at Faculty of Design, Media & Mana-
 gement, Buckinghamshire New University, High Wycombe (UK), and, in the Computer
 Architecture team at ETIS/CNRS lab, ENSEA, University of Cergy (France).
 I would like to take this opportunity to thank Professor Geoff Lawday for giving me the
 opportunity to undertake this thesis dissertation. I would also like to express my gratitude
 to him for being always understanding and encouraging even when I felt stressed.
 I am sincerely thankful to my supervisors Dr Kevin Maher from Buckinghamshire New
 University and Dr Amine Benkhelifa from ETIS lab, University of Cergy, for their valuable
 supervision and support over the years.
 I owe a great debt to Professor Fran?ois Verdier from University of Nice Sophia-
 Antipolis who advised, encouraged, and inspired my research. I have been blessed with two
 experts in reconfigurable computing, Amine and Fran?ois. They have shown generosity
 and availability towards my research project.
 I am grateful to Professor John Boylan and Buckinghamshire New University for sup-
 porting me with the funding necessary to carry out my research. I also gratefully thank the
 ETIS lab for providing me with extra funding along with a suitable research environment.
 I would like to thank Professor Chris Hudson, Professor Inbar Fijalkow, Professor
 Bertrand Granado, Dr Peter Wilkinson, Laura Bray, Howard Bush, William Lishman, Dr
 Anne Evans, staff at Buckinghamshire New University and staff at ETIS lab, University
 of Cergy, for their support and advice. I will not forget my fellow research students and
 my former colleagues from Buckinghamshire New University and the ETIS lab. We have
 been sharing very fruitful discussions.
 Thanks to Paul & Lucy Tanyi, Antoine & Ariane Kamina, Bernard Mankem, Julius
 Ebokolle, Premkumar Elangovan, Knowledge Mpofu, Indrachapa Bandara and many oth-
 ers that are not personally named here. They all made my life in UK enjoyable and
 sociable. Thanks to Randolph Boyd for its availability.
 ii
My family has been a constant source of understanding, encouragement and love.
 Thanks to my brothers and sisters, Michel, Alex, Theresine and Anne-Marie for their
 constant support and love. They always believed in me.
 iii
I dedicate this work to my beloved parents Jeanne-d'Arc Beumani and Emmanuel Leupi
 to my wife Lydie-Flore, to my little princesses Jane-Veronica, Lise-Nahomie
 and to Fran?oise-Soul-angel.
 I wish Jeanne d'Arc, Fran?oise and Jeanne witnessed this achievement
 I am quite sure, they are all smiling on me up there
 iv
Contents
 1 Introduction 1
 1.1 Precis of Embedded Systems and Research Rationale . . . . . . . . . . . . 1
 1.2 Raison D'?tre for using Reconfigurable Hardware Devices in Embedded SoCs 3
 1.2.1 Dynamic and Online Embedded Applications . . . . . . . . . . . . 4
 1.2.2 Technology Advances, Market and Costs Constraints . . . . . . . . 8
 1.3 Related Research Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
 1.3.1 System-On-Chip Design Overview . . . . . . . . . . . . . . . . . . 11
 1.3.2 Reconfigurable System-On-Chip Design . . . . . . . . . . . . . . . 13
 1.3.3 Operating System for Reconfigurable System-On-a-Chip . . . . . . 17
 1.4 Contribution of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
 1.4.1 Algorithms for Online Real-Time Scheduling & Placement . . . . . 18
 1.4.2 Scheduling & Placement Algorithms for OS-driven Design Space
 Exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
 1.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
 2 Dynamically Reconfigurable Architectures vs Implementation Alterna-
 tives 22
 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
 2.1.1 The Switch from Analog to Digital Signal Processing . . . . . . . . 23
 2.1.2 The most common DSP Functions . . . . . . . . . . . . . . . . . . 24
 2.1.3 Software vs Hardware Platforms . . . . . . . . . . . . . . . . . . . 24
 v
2.2 Software Implementation Platforms . . . . . . . . . . . . . . . . . . . . . . 25
 2.2.1 General Purpose Processors (GPPs) . . . . . . . . . . . . . . . . . 25
 2.2.2 Programmable Digital Signal Processors (DSPs) . . . . . . . . . . 28
 2.3 Hardware Implementation Platforms . . . . . . . . . . . . . . . . . . . . . 29
 2.3.1 ASIC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 30
 2.3.2 Fine and Coarse Grain Reconfigurable Arrays Implementation . . . 31
 2.4 ASIP/ASSP Implementation . . . . . . . . . . . . . . . . . . . . . . . . . 32
 2.5 Fine-grained Reconfigurable Hardware Devices . . . . . . . . . . . . . . . 33
 2.5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
 2.5.2 FPGA Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . 33
 2.5.3 FPGA Technology . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
 2.5.4 FPGA Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
 2.5.5 SRAM-based FPGA . . . . . . . . . . . . . . . . . . . . . . . . . . 36
 2.5.6 Heterogeneous FPGAs . . . . . . . . . . . . . . . . . . . . . . . . . 41
 2.5.7 FPGA Design Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
 2.5.8 FPGA Modular Design for Runtime Partial Reconfiguration . . . . 46
 2.5.9 Coupling with the Host Processor . . . . . . . . . . . . . . . . . . . 48
 2.5.10 Types of Reconfiguration . . . . . . . . . . . . . . . . . . . . . . . 51
 2.5.11 Configuration Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . 53
 2.6 Coarse-grained Reconfigurable Arrays . . . . . . . . . . . . . . . . . . . . 55
 2.6.1 Raison D'?tre . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
 2.6.2 Presentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
 2.7 Platform-Based Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
 2.7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
 2.7.2 Definition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
 2.7.3 OS for Reconfigurable Platforms . . . . . . . . . . . . . . . . . . . 58
 2.8 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
 vi
3 Background and Related Work 64
 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
 3.2 Real-Time Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
 3.2.1 Hard vs Soft Real-Time . . . . . . . . . . . . . . . . . . . . . . . . 66
 3.2.2 Requirements for Real-Time Computer Systems . . . . . . . . . . . 66
 3.3 Real-Time Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
 3.3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
 3.3.2 Real-Time Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
 3.3.3 Different Scheduling Problems . . . . . . . . . . . . . . . . . . . . . 71
 3.3.4 Objective Functions . . . . . . . . . . . . . . . . . . . . . . . . . . 74
 3.3.5 O?ine Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
 3.4 Online Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
 3.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
 3.4.2 Different Online Paradigms . . . . . . . . . . . . . . . . . . . . . . 76
 3.4.3 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 78
 3.4.4 Schedulability Analysis . . . . . . . . . . . . . . . . . . . . . . . . . 80
 3.5 RT Scheduling for Uniprocessor Systems . . . . . . . . . . . . . . . . . . . 80
 3.5.1 Rate Monotonic (RM) . . . . . . . . . . . . . . . . . . . . . . . . . 80
 3.5.2 Deadline Monotonic (DM) . . . . . . . . . . . . . . . . . . . . . . . 81
 3.5.3 Earliest Deadline First (EDF) . . . . . . . . . . . . . . . . . . . . . 81
 3.5.4 Least Laxity First (LLF) . . . . . . . . . . . . . . . . . . . . . . . . 82
 3.5.5 List Scheduling (LS) . . . . . . . . . . . . . . . . . . . . . . . . . . 82
 3.5.6 Uniprocessor Scheduling Model for Reconfigurable Hardware . . . 82
 3.6 RT Scheduling for Multiprocessor Systems . . . . . . . . . . . . . . . . . . 87
 3.6.1 Multiprocessor Scheduling Problem . . . . . . . . . . . . . . . . . . 88
 3.6.2 Multiprocessor Platforms . . . . . . . . . . . . . . . . . . . . . . . 88
 3.6.3 Partitioned vs Nonpartitioned Scheduling Strategies . . . . . . . . 89
 3.6.4 Multiprocessor Scheduling Model for Reconfigurable Hardware . . 90
 3.7 Online Real-Time Scheduling on Reconfigurable Hardware Devices . . . . 93
 vii
3.7.1 Online Scheduling Without-Looking-Ahead and Related Work . . . 94
 3.7.2 Online Looking-Ahead Scheduling and Related Work . . . . . . . . 98
 3.8 Tasks Placement and Related Work . . . . . . . . . . . . . . . . . . . . . . 105
 3.8.1 Online Placement Issues . . . . . . . . . . . . . . . . . . . . . . . . 105
 3.8.2 Free Area Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . 107
 3.8.3 Data Structure to store the State of the Reconfigurable Array . . . 110
 3.8.4 Fitting Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
 3.8.5 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
 3.9 Fragmentation and Related Work . . . . . . . . . . . . . . . . . . . . . . . 122
 3.9.1 Internal and Intra-task Fragmentations . . . . . . . . . . . . . . . . 124
 3.9.2 External Fragmentation . . . . . . . . . . . . . . . . . . . . . . . . 125
 3.9.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
 3.10 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
 4 Proposed Methodology, Models and Metrics 136
 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
 4.2 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
 4.2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
 4.2.2 Proposed Methodology . . . . . . . . . . . . . . . . . . . . . . . . 137
 4.2.3 Running two MERs-based Algorithms on an Embedded Processor . 139
 4.2.4 Lessons Learnt from Preliminary Results and Conclusion . . . . . . 143
 4.3 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 146
 4.3.1 Real-Time Tasks and Applications Modeling . . . . . . . . . . . . . 146
 4.3.2 Reconfigurable Devices Area Models . . . . . . . . . . . . . . . . . 152
 4.3.3 Scheduler Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
 4.3.4 Placer Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
 4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
 4.4.1 Reconfigurable Hardware Resources Metrics . . . . . . . . . . . . . 158
 4.4.2 Tasks Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158
 viii
4.4.3 Application Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . 159
 4.4.4 Scheduling Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
 4.4.5 Feasible Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
 4.5 Global Simulation Model and Compatibility with the OVeRSoC Design
 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164
 4.5.1 An UML Overview of the Global Simulation Model . . . . . . . . . 165
 4.5.2 The Importance of Using a C++ Based Simulation Model . . . . . 166
 4.6 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
 5 Proposed Algorithms for Online Real-Time Scheduling & Placement 169
 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 169
 5.2 Tasks Parameters Based Global Scheduling . . . . . . . . . . . . . . . . . 170
 5.2.1 Temporal parameters based scheduling (Basic, EDF, LLF, etc.) . . 173
 5.2.2 Geometric parameters based scheduling (BSF, SSF, etc.) . . . . . . 175
 5.2.3 Combining Geometric and Temporal parameters for scheduling . . 176
 5.3 Slots-based Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
 5.3.1 n X 1D variable size slots scheduling . . . . . . . . . . . . . . . . . 177
 5.3.2 1D variable slots looking-ahead scheduling . . . . . . . . . . . . . . 181
 5.3.3 1D variable slots scheduling with minimum makespan . . . . . . . 183
 5.4 Placement Strategies for 2D Looking-Ahead Scheduling . . . . . . . . . . . 187
 5.4.1 A Ternary Tree structure for Looking-Ahead Scheduling . . . . . . 187
 5.5 Multi-shape based Tasks Scheduling . . . . . . . . . . . . . . . . . . . . . 191
 5.5.1 Raison d'?tre for multi-shape tasks . . . . . . . . . . . . . . . . . . 192
 5.5.2 The multi-shape basic algorithm . . . . . . . . . . . . . . . . . . . 194
 5.6 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
 6 Simulation Results of the Algorithms Proposed to Solve Online Real-
 Time Scheduling Issues 197
 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
 6.2 Building the Inputs and the Testing Environment . . . . . . . . . . . . . . 198
 ix
6.2.1 Hardware Tasks Characterization . . . . . . . . . . . . . . . . . . . 198
 6.2.2 Estimating the Size of Tasks . . . . . . . . . . . . . . . . . . . . . . 199
 6.2.3 Final Inputs Values for Experiments . . . . . . . . . . . . . . . . . 200
 6.2.4 The Running Environment . . . . . . . . . . . . . . . . . . . . . . 201
 6.3 Tasks Parameters Based Scheduling . . . . . . . . . . . . . . . . . . . . . . 201
 6.3.1 Chip Utilization Ratio and Tasks Rejection Ratio . . . . . . . . . . 201
 6.3.2 Runtime Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
 6.3.3 Conclusion on parameters based scheduling . . . . . . . . . . . . . 205
 6.4 Multi-shape Tasks Based Scheduling . . . . . . . . . . . . . . . . . . . . . 205
 6.4.1 Multi-shape Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . 205
 6.4.2 Chip Utilization Ratio and Tasks Rejection Ratio . . . . . . . . . . 208
 6.4.3 Makespan and Runtime Overheads . . . . . . . . . . . . . . . . . . 213
 6.4.4 Conclusion on multi-shape scheduling . . . . . . . . . . . . . . . . 218
 6.5 Horizon Looking-Ahead Scheduling Algorithms . . . . . . . . . . . . . . . 218
 6.5.1 Horizon Looking-Ahead Scheduling using a Ternary Tree . . . . . . 219
 6.5.2 1D Variable Slots Looking-Ahead Scheduling . . . . . . . . . . . . 221
 6.6 Conclusion of the Chapter . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
 7 Conclusion and Future Work 225
 7.1 Discussions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 225
 7.2 Key Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 226
 7.2.1 Algorithms for Online Real-time Scheduling/Placement on DPRHWs 226
 7.2.2 Scheduling/Placement algorithms library for RTOS-driven design
 space exploration . . . . . . . . . . . . . . . . . . . . . . . . . . . . 229
 7.3 Hypothesis and Limitations . . . . . . . . . . . . . . . . . . . . . . . . . . 230
 7.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
 Appendix 245
 7.5 Appendix A : Table classifying related work on scheduling and placement
 strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245
 x
7.6 Appendix B : Additional Simulation Results . . . . . . . . . . . . . . . . . 248
 7.7 Appendix C : Tables of algorithms and data structures implemented . . . 256
 7.8 Appendix D : Size of IPs from the Xilinx core generator . . . . . . . . . . 260
 7.9 Appendix E : DA Implementation of a Multi-shape Hardware Task : the
 FIR Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 261
 7.9.1 Distributed Arithmetic as an enabling technique . . . . . . . . . . 262
 7.9.2 Implementing Y using LUT-based DA . . . . . . . . . . . . . . . . 263
 7.9.3 Throughput vs reconfigurable resources trade-off . . . . . . . . . . 265
 xi
List of Figures
 1.1 Programmable device market segment share in 2011 (Xilinx, Company re-
 ports). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
 1.2 Processing requirements for wireless access protocols (Sch?ler and Tan, 2004) 5
 1.3 The increase in logic density in FPGA over one decade and over the corre-
 sponding process technology (Koch and Torresen, 2010). . . . . . . . . . . 10
 1.4 Main levels in the generic design flow of a SoC . . . . . . . . . . . . . . . 12
 1.5 Generic architecture of a Reconfigurable System-On-Chip. . . . . . . . . . 14
 1.6 Hardware/Software Co-Design shortens the design process(Fujitsu, 2002) . 15
 1.7 Productivity gap according to ITRS (The International Technology Roadmap
 for Semiconductors, www.itrs.net) . . . . . . . . . . . . . . . . . . . . . . . 16
 2.1 A simplified representation of a Digital Signal Processing System . . . . . 23
 2.2 Sequential execution. (a) a single operation at a time (b) sequential execu-
 tion (c) pipelined execution providing higher throughput. . . . . . . . . . 26
 2.3 The Von Neumann architecture. . . . . . . . . . . . . . . . . . . . . . . . . 26
 2.4 Dataflow representation of one instruction performing an N th-order (N + 1
 taps) FIR filtering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
 2.5 Mask cost exponentially grows with technology. . . . . . . . . . . . . . . . 32
 2.6 Simplified structure of an FPGA. . . . . . . . . . . . . . . . . . . . . . . . 34
 2.7 Truth table of function S = f ( a, b, c ) and its mapping using a 3 inputs
 Look-Up-Table. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
 2.8 A logic element or configurable logic block . . . . . . . . . . . . . . . . . . 38
 xii
2.9 An Adaptive Logic Module in Altera Stratix V architecture (courtesy Altera). 39
 2.10 A Slice (SLICEL) in a Xilinx Virtex 5 FPGA architecture (courtesy Xilinx). 40
 2.11 Altera Stratix-V floor plan (Altera, www.altera.com). . . . . . . . . . . . . 43
 2.12 Xilinx Virtex II Pro FPGA with up to 4 hard core embedded proces-
 sors(Xilinx, www.xilinx.com). . . . . . . . . . . . . . . . . . . . . . . . . . 43
 2.13 Design flow for FPGA-based systems embedding a programmable processor. 45
 2.14 Modular design enables dynamic module swapping. . . . . . . . . . . . . . 47
 2.15 FPGA as co-processor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
 2.16 FPGA based SOPC (left) and embedded FPGA (eFPGA) . . . . . . . . . 50
 2.17 Configuration hierarchy model. . . . . . . . . . . . . . . . . . . . . . . . . 53
 2.18 A view of the OveRSoC methodology with emphasis on DRA (dynamically
 reconfigurable architecture) management. . . . . . . . . . . . . . . . . . . 59
 2.19 OS services exploration in OveRSoC design methodology (Miramond et al.,
 2009a) which maps the system level part of the generic design flow of SoC
 (see figure 1.4). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
 2.20 Flexibility vs Performance of implementation platforms. . . . . . . . . . . 62
 3.1 Model of a Real-Time system . . . . . . . . . . . . . . . . . . . . . . . . . 65
 3.2 Different periodic real-time task according to their release time . . . . . . 70
 3.3 Aperiodic task and sporadic task. . . . . . . . . . . . . . . . . . . . . . . . 71
 3.4 Periodic, sporadic and aperiodic tasks . . . . . . . . . . . . . . . . . . . . 71
 3.5 Uniprocessor model for reconfigurable hardware devices with time sharing
 compound tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
 3.6 Compound tasks timing characteristics . . . . . . . . . . . . . . . . . . . . 84
 3.7 Uniprocessor model for reconfigurable hardware devices with space sharing
 compound tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
 3.8 Equal sizes and unequal sizes partitioning of a DPRHW (dynamically and
 partially reconfigurable hardware device) . . . . . . . . . . . . . . . . . . . 92
 3.9 Looking-ahead vs without-looking-ahead scheduling approaches . . . . . . . 95
 xiii
3.10 Managing areas availability or occupancy in looking-ahead scheduling (e.g.
 horizon and stuffing algorithms, Steiger et al., 2004) . . . . . . . . . . . . 100
 3.11 An example of 1D Horizon and Stuffind scheduling algorithms (Steiger et al.,
 2004) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
 3.12 Intelligent Stuffing (IS) scheduling algorithm using 1D placement (Marconi
 et al., 2008) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
 3.13 Nonoverlapping vs overlapping partition; vertical vs horizontal split for
 overlapping partition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
 3.14 Placing a task in an overlapping rectangle . . . . . . . . . . . . . . . . . . 109
 3.15 Maximum empty rectangles . . . . . . . . . . . . . . . . . . . . . . . . . . 109
 3.16 A binary tree used as a data structure that records the state of the FPGA 111
 3.17 Scheduling tasks on a 7 X 6 reconfigurable array using a 1D placement model114
 3.18 2D placement model of tasks on a 7 X 6 reconfigurable array . . . . . . . 115
 3.19 3D view of the 2D placement model illustrated in figure 3.18 . . . . . . . . 116
 3.20 Algorithms execution time comparison between KAMER algorithm (Bazargan
 et al., 2000) and 1D Cluster-based algorithm (Ahmadinia et al., 2004) . . 118
 3.21 The hash matrix approach (a), the hash table (b) rectangle insertion/deletion
 in the hash matrix (c), (d) and (e) (Walder et al., 2003) . . . . . . . . . . 120
 3.22 Placing tasks m1 and m2 on an heterogeneous reconfigurable architecture
 (Koester et al., 2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
 3.23 Defragmentation strategies: complexity grows with performance . . . . . . 123
 3.24 Intra-task and internal fragmentation . . . . . . . . . . . . . . . . . . . . . 126
 3.25 A fragmented FPGA: the free space available on the chip is sufficient to
 insert the arriving task, but its shape doesn't allow it. . . . . . . . . . . . 126
 3.26 Bazargan's adjacency graph: bigger rectangles restoration process (Bazargan
 et al., 2000) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
 3.27 Footprint Transform (Walder and Platzner, 2002) . . . . . . . . . . . . . . 129
 3.28 Using horizontal line to manage free space Ahmadinia et al. (2004) . . . . 131
 xiv
4.1 A simple architecture of a reconfigurable SoC . . . . . . . . . . . . . . . . 138
 4.2 Scheduling one task on the reconfigurable array using a MERs-based place-
 ment algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
 4.3 Scheduling the end of a task using a MERs-based scheduling algorithm. . 141
 4.4 Time for finding a MER (Maximum Empty Rectangle) . . . . . . . . . . . 143
 4.5 Scheduling timing and overheads (staircase) . . . . . . . . . . . . . . . . . 144
 4.6 Evolution (over one decade) of the configuration time of a full FPGA when
 considering the fastest possible configuration speed (Koch and Torresen,
 2010). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
 4.7 A hardware task model: 2D view (b) and 3D view (a) . . . . . . . . . . . 147
 4.8 Different states of a hardware task . . . . . . . . . . . . . . . . . . . . . . 149
 4.9 An application as a set of boxes (taskgraph). . . . . . . . . . . . . . . . . 152
 4.10 Simple models of homogeneous and heterogeneous reconfigurable array . . 154
 4.11 The global simulation model . . . . . . . . . . . . . . . . . . . . . . . . . . 156
 4.12 The placer model and its different functional parts. . . . . . . . . . . . . . 157
 4.13 An UML overview of the global simulation model of the DPRHW-OS for a
 reconfigurable platform. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
 5.1 1D-like partitioned scheduling . . . . . . . . . . . . . . . . . . . . . . . . . 178
 5.2 1D improved horizon scheduling algorithm, also denoted as 1D variable size
 slots horizon (1D-VSSH) . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
 5.3 List scheduling vs optimal scheduling of n tasks on m identical processors;
 ei is the execution time of task Ti . . . . . . . . . . . . . . . . . . . . . . . 184
 5.4 Ternary tree structure : splitting and updating processes . . . . . . . . . . 190
 5.5 Multi-shape tasks provides more fitting opportunities (e.g. T3 provides 5
 variants). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192
 5.6 Flow chart of the multi-shape algorithm that selects task version to be placed.194
 6.1 Summarizing the scheduling problem as defined in this thesis. . . . . . . . 198
 xv
6.2 Utilization ratio (top), rejection ratio (middle) and quality metrics (bottom)
 : comparative results for EDF, LLF, SSF and BSF scheduling algorithms. 203
 6.3 Scheduling runtime overhead, number of scheduling calls and cumulative
 scheduler runtime overheads : comparative results on EDF, LLF, SSF and
 BSF scheduling algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 204
 6.4 Different combinations of multi-shape tasks for variants of multi-shape
 scheduling algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 207
 6.5 Multi-shape scheduling algorithms: simulation results of the utilization ra-
 tio, comparison with the Basic scheduling. . . . . . . . . . . . . . . . . . . 208
 6.6 Multi-shape scheduling algorithms: simulation results of the tasks rejection
 ratio, comparison with the Basic scheduling. . . . . . . . . . . . . . . . . . 209
 6.7 Utilization ratio, tasks rejection ratio and differential quality metric URqm
 (with  = 0:5) : comparative results for basic scheduling and multi-shape
 scheduling algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212
 6.8 The average makespan : comparative results for multi-shape and Basic
 scheduling algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
 6.9 Multi-shape scheduling algorithms : the simulation results of the scheduling
 runtime overhead, with basic scheduling as reference scheduling. . . . . . . 215
 6.10 Differential quality metrics for horizon-EAAF, horizon-SFAF, Basic and
 EDF scheduling algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . 220
 6.11 Rejection delay for horizon-EAAF, horizon-SFAF, Basic and EDF schedul-
 ing algorithms. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 220
 6.12 Tasks rejection ratio, reconfigurable array utilization ratio and differential
 quality metric for the proposed 1D variable slots horizon scheduling, com-
 pared to 1D and 2D horizon scheduling from Steiger et al. (2004) . . . . . 223
 7.1 The tasks rejection ratio for paramaters based scheduling and multi-shape
 tasks based scheduling. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249
 xvi
7.2 The reconfigurable array utilization ratio for paramaters based scheduling
 and multi-shape tasks based scheduling. . . . . . . . . . . . . . . . . . . . 250
 7.3 The runtime overheads of without-looking-ahead scheduling algorithms (para-
 maters based scheduling and multi-shape tasks based scheduling) . . . . . 251
 7.4 The number of scheduler invocations for without-looking-ahead schedul-
 ing algorithms (paramaters based scheduling and multi-shape tasks based
 scheduling) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
 7.5 The cumulative runtime overheads for without-looking-ahead scheduling al-
 gorithms (paramaters based scheduling and multi-shape tasks based schedul-
 ing) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
 7.6 Tasks parameters based scheduling algorithms vs multi-shape algorithm :
 Simulation on a large number of tasks (10 sets of 5000 tasks). . . . . . . . 254
 7.7 Serial Distributed Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . 264
 7.8 Serial-Parallel Distributive Arithmetic . . . . . . . . . . . . . . . . . . . . 264
 7.9 Example of resources utilization for different DA implementation of a FIR
 filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265
 xvii
List of Tables
 2.1 Comparative table of implementation platforms for DSP applications (Adam,
 2002) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
 3.1 Tasks to schedule on an FPGA of size 7 X 6 . . . . . . . . . . . . . . . . . 114
 3.2 Comparison of the simulation results with the 1D-placement approach (Koester
 et al., 2005) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
 4.1 Simulation paremeters for tasks and the reconfigurable array (FPGA) . . 142
 5.1 A pseudo code of the tasks parameters-based scheduling algorithm . . . . . 171
 5.2 A pseudo code of the basic scheduling algorithm . . . . . . . . . . . . . . . 174
 5.3 A pseudo code of the 1D variable slots scheduling algorithm . . . . . . . . 180
 5.4 Tasks parameters for 1D variable size slots looking-ahead scheduling . . . 181
 5.5 A pseudo code of the 1D variable slots with minimum makespan algorithm 186
 5.6 Example of tasks parameters for horizon-SFAF and horizon-EAAF schedul-
 ing algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
 6.1 Approximate sizes of most common IPs (hardware tasks). . . . . . . . . . 199
 6.2 Simulation results for looking-ahead scheduling using a ternary tree : com-
 parison with basic scheduling and EDF scheduling. . . . . . . . . . . . . . 219
 7.1 Related work on scheduling and placement strategies (homogeneous reconfi-
 gurable array model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 246
 xviii
7.2 Related work on scheduling and placement strategies (heterogeneous reconfi-
 gurable array model) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247
 7.3 Scheduling algorithms implemented. . . . . . . . . . . . . . . . . . . . . . 256
 7.4 List of placement algorithms implemented. . . . . . . . . . . . . . . . . . . 257
 7.5 List of placement structures implemented (1): The areas partitioning (ex-
 isting works are cited and those from us are highlighted). . . . . . . . . . . 258
 7.6 List of placement structures implemented (2): Finding fitting areas and
 merging free areas (existing works are cited and those from us are highlighted).259
 7.7 Few IPs for Virtex2pro FPGAs from the XILINX Core Generator System
 1
 . 260
 xix
Nomenclature
 Afpga : Area or total amount of resources on the reconfigurable array (FPGA)
 ai : Release time of a task or job Ti
 Apload : Relative application load or the relative amount of resources required to complete
 an Apk on a given reconfigurable array of size Afpga = W  H
 Apn : Application consisting of n jobs (or set of n tasks or jobs)
 ar : Aspect ratio of a hardware task
 ci : Execution or processing time of task Ti
 di : Absolute deadline of task Ti
 Di : Relative deadline of task Ti
 ei : Execution or processing time of task Ti
 ftav , rtav : Average flow time or response time of a job sequence or tasks set
 fti , rti : Response time or flow time of a scheduled task Ti
 fttot , rttot : Total flow time or response time of a job sequence or tasks set
 fi : Finishing time of a task or job
 hi : Height of hardware task Ti
 Hp : Hyper-period of a periodic tasks set
 Ji : Job number i, with i = 1:::n
 li : Laxity of a task Ti
 m : Number of processors or resources available for jobs processing
 Mi : Processor number i in a multiprocessor system which consists of m processors
 M1;M2; :::;Mm
 mk : Scheduling makespan or length of the scheduling
 n : Number of jobs to be processed
 na : Number of jobs accepted by the scheduling algorithm, among the k jobs in the
 application  k, where na  k, and k = na + nj
 Nacc : Number of accepted tasks or jobs
 nj : Number of jobs rejected by the scheduling algorithm, among the k jobs in the
 application  k, where nj  k
 Nrej : Number of rejected tasks or jobs
 Pi : Period of a periodic task Ti
 xx
Rdi : Rejection delay of a task Ti
 ResAp : Total amount of resources required to complete all the k jobs Ji=1:::k of an
 application Apk
 ResTi : Total amount of resources required to complete an instance of a hardware
 task Ti
 Rj k(%) , Rj_(%) : Tasks rejection ratio
 rtav , ftav : Average flow time or response time of a job sequence or tasks set
 rti , fti : Response time or flow time of a scheduled task Ti
 rttot , fttot : Total flow time or response time of a job sequence or tasks set
 res : Computation load of an application  k relative to a given reconfigurable
 array of size Afpga = W  H.
 Res : Computation load or total amount of resources required to complete appli-
 cation  k.
 si : Starting time of a task or job
 td : Absolute deadline of a set of tasks or an application
 tD : Relative deadline of a set of tasks or an application
 Ti : Task number i, with i = 1:::n
 trej : Rejection time of a task or job
 Ufpga(%) : Utilization ratio of an FPGA of size Afpga = W  H on which an application
 Apk has been scheduled
 URqm : Differential quality metric of a scheduling on a reconfigurable hardware
 device
 U (t) : Time utilization factor of a set of tasks  
U (t)Ti : Time utilization factor of task Ti
 wi : Width of hardware task Ti
 wr : Width ratio between a task and its fitting slot or cluster
 wti : Waiting time of a scheduled task Ti
 W , H : Width and Height of the reconfigurable hardware device (e.g. FPGA)
 xi : X coordinate of hardware task Ti
 yi : Y coordinate of hardware task Ti
 xxi
 : Machine(s) or processor(s) environment
  : Application to process on the machine along with its constraints (e.g. time,
 precedence, etc...)
  : The objective function of a scheduling
  ,  n : Set of tasks or jobs , set of n tasks or jobs
 }i : Priority of a task Ti
 8i : For all i
 9 : There exists
 T
 , \ : Intersection
 S
 , [ : Union
 2 : Belongs to
 xxii
Acronyms
 ADAS Advanced Driver Assistance Systems
 ASIC Application Specific Integrated Circuit
 ASIP Application-Specific Instruction set Processors
 ASSP Application-Specific Standard Parts
 BF Best Fit
 BSF Biggest Size First
 CAD Computer Aided Design
 CDMA Code Division Multiple Access
 CLB Configurable Logic Block
 CMOS Complementary Metal Oxyd Semiconductor
 CPLD Complex Programmable Logic Device
 DDR Double data rate
 DRHW Dynamically Reconfigurable Hardware Device(s)
 DPRHW / DPRHWs Dynamically and Partially Reconfigurable Hardware Device(s)
 DPRHW-OS Operating System for Partially and Dynamically Reconfigurable Hardware
 DRA Dynamically Reconfigurable Architecture
 DSE Design Space Exploration
 DSP Digital Signal Processor/Processing
 EAAF Earliest Available Area First algorithm
 EDA Electronic Design Automation
 EDF Earliest Deadline First
 EDGE Enhanced Data Rates for GSM Evolution
 EEPROM Electrically-erasable programmable read-only memory
 ETIS Equipes Traitement de l'Information et Syst?mes
 xxiii
FF First Fit
 FPGA Field Programmable Gate Array
 GPP General Purpose Processor
 GPRS General Packet Radio Service
 GPS Global Positioning System
 GSM Global System for Mobile Communications
 HDL Hardware Description Language
 HLL High Level Languages
 HPC High Performance Computing
 HSCSD High Speed Circuit Switched Data
 HSTL High Speed Transceiver Logic
 HW / SW Hardware/Software
 IC Integrated Circuit
 IDE Integrated Design Environment
 IP Intellectual Property
 ISP Instruction-set-processor
 IT Information Technology
 ITRS The International Technology Roadmap for Semiconductors
 JTAG Joint Test Action Group
 KAMER Keeping All Maximum Empty Rectangles
 LAN / WLAN Local Area Network / Wireless LAN
 LUT Look-Up Table
 LVCMOS Low Voltage CMOS
 LVTTL Low Voltage Transistor-Transistor-Logic
 MAC Multiply-Accumulate
 xxiv
Mbps Millions of bits per second or megabits per second
 MBps Megabytes per second.
 MER Maximum Empty Rectangle
 MIMO Multiple Input Multiple Output
 MIPS Million Instructions Per Second
 MMAC Millions of Multiply Accumulate
 MPSoC Multi-Processors System-On-a-Chip
 MSDL Merge Server Distribute Load
 NF Next Fit
 NRE Non-recurring engineering
 OFDM Orthogonal Frequency Division Multiplexing
 OS Operating System
 OS4RHD Operating System for Reconfigurable Hardware Devices
 PC Personal Computer
 PDA Personal Digital Assistant
 PE Processing Element
 PLD Programmable Logic Device(s)
 QAM Quadrature Amplitude Modulation
 QoS Quality of Service
 RC Reconfigurable Computing
 RFT Right First Time
 RISC Reduced Instruction Set Computer
 RSOC Reconfigurable System-On-a-Chip
 RT Real-Time
 RTOS Real-Time Operating System
 xxv
SDR Software Defined Radio
 SERDES Serializer/deserializer
 SFAF Smallest Fitting Area First algorithm
 SIMD Single Instruction Multiple Data
 SoC System-On-a-Chip
 SoPC System-On-a-Programmable-Chip
 SRAM Static Random Access Memory
 SSF Smallest Size First
 SSTL Series Stub Terminated Logic
 TTM Time to market
 UART Universal Asynchronous Receiver/Transceiver
 UML Unified modeling language
 UMTS Universal Mobile Telecommunications System
 VLSI Very Large Scale Integration
 W-CDMA Wideband Code Division Multiple Access
 WCET Worst Case Execution Time
 1D-VSSH 1D Variable Size Slots Horizon algorithm
 1D-VSSS 1D Variable Size Slots Stuffing algorithm
 3G 3rd Generation radio communications system
 xxvi
Chapter 1
 Introduction
 Dynamically reconfigurable hardware devices such as FPGAs
 1
 are becoming the enabling tech-
 nology for true hardware multitasking in embedded systems and high performance computing.
 Such devices are increasingly used in heterogeneous System-On-a-Chip designs. This thesis copes
 with the problem of scheduling online real-time hardware tasks on reconfigurable hardware de-
 vices. As online real-time applications are targeted, scheduling algorithms are combined with
 appropriate placement strategies in order to provide low scheduling runtime overheads. Online
 real-time scheduling requires on-the-fly reconfiguration (partially or not) of the reconfigurable
 device in a reasonable amount of time, depending on the targeted application domain. This
 work targets dataflow oriented embedded applications as they are the most suitable for hardware
 implementation.
 1.1 Precis of Embedded Systems and Research Rationale
 The following summary is a plain English introduction to embedded systems, which is given to ad-
 dress a wide audience, and in particular a r?sum? of embedded reconfigurable systems that relate
 to the area of research set out in this PhD thesis. Embedded electronic systems cover an extensive
 range of topics whereby dedicated programmable electronic systems are the key mechanisms in
 aircraft fly-by-wire systems, satellite navigation, vehicle engine management and entertainment
 systems along with space vehicle control and communication, machine tool control, cameras, mo-
 bile telecommunications, robots and toys, etc. Today the number and variety of embedded system
 1
 Field Programmable Gate Array
 1
1. Introduction Raison D'?tre for Reconfigurable Computing
 applications appears to be endless. However, although there are a multitude of embedded systems
 in production, the challenges facing designers are significant as recent technological advances are
 pushing the boundaries of engineering, such as high-speed digital design coupled with constraints
 in power consumption and the increasing complexity of information communication, together with
 the increased demands of new compliance standards for improved interoperability and reliability.
 Nowadays, as technology advancement for the semiconductor industry is driven by Moore's
 Law
 2
 , a complete system with numerous functionalities can be implemented on the same integrated
 circuit chip. Such a system is denoted as SoC
 3
 or MPSoC
 4
 when integrating more than one proces-
 sor core (e.g. general purpose processors, domain specific processors, etc.), dedicated processing
 blocks like voice encoding/decoding, cryptography, etc.), memory blocks, inputs/outputs blocks,
 inter-blocks communication media along with other analog blocks (RF blocks, ADC/DAC blocks,
 antenna, MEMS sensors, etc.) as pictured in figure 1.5, on page 14.
 One consequence of the complexity of designing modern embedded systems has been the
 introduction of new university courses in signal integrity engineering (as communication bandwidth
 is ever increasing) and embedded system architecture, especially the so-called SoC. Concurrent
 with the new undergraduate courses, there is a surge in novel post graduate research programmes
 associated with embedded systems engineering.
 An important area of embedded systems development is telecommunications, where the next
 generation of cellular telephony and broadband radio are seen as significant areas of research.
 Future telecommunication and radio systems will require a step change in architectural design,
 software complexity and associated computing power to achieve the improvements envisaged for
 the next generation of telecommunication functionality. In the past, these challenges have been
 solved by adding dedicated digital signal processors (DSPs) known as ASICs
 5
 to conventional
 General Purpose Processors (GPP) or programmable DSP
 6
 . A custom hardware such as an ASIC
 is used to provide the processing performance required by a given application and consume less
 power while a GPP or a programmable DSP brings more flexibility to the system.
 However, designed as a SoC or not, this architecture is not seen as sufficient to meet the ever
 increasing demands of embedded applications presented above.
 2
 Gordon Moore, the co-founder of Intel, predicted in the 1960s that processor through-put would
 double every eighteen months
 3
 System-on-a-Chip
 4
 Multi-Processors SoC (System-on-a-Chip)
 5
 Application Specific Integrated Circuits
 6
 Digital Signal Processor
 2
1. Introduction Raison D'?tre for Reconfigurable Computing
 1.2 Raison D'?tre for using Reconfigurable Hardware De-
 vices in Embedded SoCs
 One solution to the step change requirements in embedded system architectural design is the
 evolution of SRAM
 7
 -based PLDs
 8
 such as FPGAs. Although FPGAs were first introduced in the
 mid 1980s by Xilinx, the recently introduced devices are highly developed. By providing some
 form of post-fabrication hardware programmability, these devices are the enabling technology for
 reconfigurable computing. Today, as shown in figure 1.1, Xilinx shares the market lead with Altera
 in programmable solutions. These companies are proposing architectural solutions that bridge the
 gap between general programmable processors and custom integrated circuits and that provide
 the computational power and flexibility required for future systems. Nevertheless the advances in
 reconfigurable hardware technology (e.g. SRAM-based FPGA and similar devices) in particular
 its dynamic runtime reconfigurability, have brought new possibilities in embedded systems design.
 Indeed, dynamically reconfigurable hardware devices provide more flexibility and silicon area re-
 use in addition to their intrinsic parallelism (spatial implementation) as illustrated in figure 2.20
 on page 62. However such architectures raised new challenging design issues. The principal area
 of interest of this research can be summarized through two main points listed below:
 Figure 1.1: Programmable device market segment share in 2011 (Xilinx, Company reports).
 7
 Static Random-Access Memory (SRAM) is a type of semiconductor memory which does not require
 to be preiodically refreshed, as SRAM memory uses bistable latching circuitry to store each bit. However
 the data are lost when the memory is not powered.
 8
 Programmable Logic Devices
 3
1. Introduction Raison D'?tre for Reconfigurable Computing
  The lack of embedded systems design methodologies leveraging dynamic reconfigurability
 of FPGA-like reconfigurable hardware devices, when they are incorporated in such systems
 in general, and especially when they are on the same silicon die (SoC).
  The need of an Operating System - like manager to cope with the scheduling and placement
 of hardware tasks in heterogeneous embedded systems (on a single intergrated circuit or
 not) featuring such a reconfigurable fabric.
 There are three main but non-exhaustive reasons for developing reconfigurable architectures
 are :
 1. The dynamicity of new embedded applications.
 2. The market and manufacturing costs constraints.
 3. Advances in reconfigurable hardware devices technology.
 1.2.1 Dynamic and Online Embedded Applications
 Embedded computer systems provide more possibility to interact with their environment (e.g.
 the user or another computer system in its neighborhood). Hence, embedded applications are
 becoming more dynamic and require more computation power to process all incoming or outgoing
 data and more often in a safety-critical context.
 Hereinafter are few and non-exhaustive embedded dynamic applications:
 Example of Software Defined Radio
 Next generation mobile telecommunication terminals will require flexibility and high performance
 under low-power and size constraints. Indeed, they will be expected to be flexible in order to
 dynamically adapt to any wireless infrastructure, download and run applications offered by ser-
 vices providers, while providing reliable functionalities, such as Personal Digital Assistant (PDA),
 mobile phone, MP3 player, Global Positioning System (GPS), Audio/Video streaming, messag-
 ing, and other multimedia services available on future networks. To achieve this aim, mobile
 terminals will need architectures that are fast enough to run complex algorithms at high data
 rates, typically 10 Mbps to 100 Mbps (as pictured in figure 1.2, from Sch?ler and Tan, 2004),
 and flexible enough to accommodate various new standards and protocols. Software Defined Ra-
 dio (SDR) concept is at the nearly ultimate stage of this evolution. A software radio implies an
 embedded computer system where the functionality of the mobile terminal should be defined in
 4
1. Introduction Raison D'?tre for Reconfigurable Computing
 Figure 1.2: Processing requirements for wireless access protocols (Sch?ler and Tan, 2004)
 software, so that it enables full programmability and adaptability on-the-fly. Consequently, such
 an `all-in-one' device could be used everywhere in the world giving global mobility without adding
 new hardware, regardless of the number of global wireless communication standards. Nevertheless
 SDR requirements are stringent, where flexibility entails a powerful reconfigurable computational
 capability with multifunctionality. Moreover, successful SDR systems will require to operate un-
 der severe constraints (e.g. reduced power consumption and physical size, low cost, etc.). These
 systems are difficult to achieve with GPP processors, programmable DSP and ASICs; even though
 modern microprocessors and programmable digital signal processors are quite flexible and have
 benefited from fabrication advances and tool development over recent years. Processor technol-
 ogy has reached an unexpected plateau that contradicts Moore's law. Today manufacturers have
 found that increasing the processor clock speed and reducing the processor device size has led
 to undesirable heat problems and unacceptable power consumption. Modern processor manufac-
 turers have chosen to provide multiple processors within a single integrated circuit to advance
 Moore's Law, but the software complexity of parallel processing has thwarted their progress. SDR
 designers have chosen to use dedicated hardware and coprocessors to provide high-performance,
 low power systems. But the latter solution lengthens the design process while not providing the
 flexibility needed in software defined radio applications. To overcome these deficiencies in flexibil-
 ity, a dedicated component, such as an ASIC, could be used for each telecommunication standard
 or protocol. With this approach the flexibility is achieved by switching from one component to
 5
1. Introduction Raison D'?tre for Reconfigurable Computing
 another to meet the required standard or protocol. The main advantage is that high-performance
 is easily achieved, since each custom component is optimized for a particular standard. Except
 that with the proliferation of standards this solution is not cost effective because of the resulting
 size of the silicon die, where the cost of an integrated circuit is related exponentially to die size.
 Also such an embedded system would suffer from weight, power consumption and time to market
 disadvantages.
 Example of electronic embedded in transportation systems
 In transportation systems (e.g. aerospace, avionics, car, railway, etc.), functionalities assigned to
 the embedded computer are no longer limited to basic control and multimedia tasks. Indeed, the
 on-board computer increasingly consists of sophisticated systems performing complex real-time
 analysis on data collected by numerous sensors in order to assist pilots in critical flight situations.
 In any case, such systems are expected to be more and more intelligent, and require ever increasing
 computing capabilities allowing them to process in real-time the data collected, in order to take
 rapid decisions while insuring security and safety to the users. The on-board computer must
 drive these systems which react in real-time to dangerous and unexpected situations. Example of
 such systems are radar systems, auto-pilot systems, collision-detection systems, communication
 systems, etc.
 Example of an Automatic Target Detection and Tracking System
 Automatic target detection and tracking is an important area of research in video processing
 thanks to its great potential in military and civil applications. Example of such applications
 field are navigation, security, robotics, vehicular communications, etc. One of the challenging
 applications field is aerospace and defense where there is a need to detect, recognize and track
 flying targets (e.g. missiles, drones, fighters, etc.). Target detection and tracking is a good
 example of dynamic and computationally intensive application. It aims to detect, identify and
 track targets mostly in infrared image sequence with a given spatial and temporal resolution. The
 spatial resolution gives the size of each frame in the video (e.g. 640 x 512) and the temporal
 resolution gives the number of frames per second (e.g. 25 Hz for 25 fps). The dynamicity of the
 application relies on non predictable spatial and temporal factors such as:
  the date and the position of appearance of a target(s) on the visual field.
  the identity of the target(s) (type, shape, size and number of objects in a frame, etc.).
 6
1. Introduction Raison D'?tre for Reconfigurable Computing
  its trajectory (direction and speed of each object, etc.).
 The system obviously consists of two major parts :
 1. a static part which detects and recognizes objects in a video frame (detection sub-system).
 2. a more dynamic part which tracks objects in the video sequence (tracking sub-system).
 Detection aims to identify all the objects and connected components in the image. It uses
 functions such as Sobel and/or Canny edge detection, blobs detection, surface filling-in, objects'
 surrounding rectangles calculation, thresholding, erosion and dilation, etc.
 The tracking sub-system performs detected objects tracking using following steps :
  initializing the tracking algorithm using objects' surroundding rectangles along with blobs
 previously built.
  applying the tracking algorithms (e.g. CAMSHIFT) on detected blobs.
  updating the size of objects' surrounding rectangles and the list of connected components.
 One key feature of such a system is that on a video frame, many objects tracking should be
 performed concurrently and the number of tracked objects may evolve over time. This makes
 the scenario even more dynamic. Therefore, performant and evolvable computation resources are
 needed. As the number of objects to pursue is unknown beforehand, one approach is to implement
 in software the detection (or static) part of the system, and to dynamically instantiate on demand
 tracking functions on dynamically reconfigurable hardware. Doing so, one tracking function block
 per target is implemented (as a thread) on the dynamically reconfigurable hardware, thus taking
 advantage of flexibility and performance of such hardware.
 Example in car design (e.g. System detecting driver fatigue)
 Cars are becoming more equipped with systems which improve global safety by providing more
 assistance to the driver. The driver safety is improved by systems capable of detecting its sleepiness
 (e.g. Bandara and Hudson, 2006). In addition pedestrians safety is improved by the Advanced
 Driver Assistance Systems (ADAS). The ADAS requires a high computation capability to achieve
 vision functionalities (stereovision, pattern recognition, complex scenes analysis) in a very short
 timeframe in order to detect pedestrians and thus reduce the number of accidents.
 In all the above cited applications, cost and power consumption constraints have to be added
 to the aforementioned safety (reliability, hard real-time sensitivity, high computing capability, etc.)
 7
1. Introduction Raison D'?tre for Reconfigurable Computing
 and security constraints. In addition, image processing is involved in most of these applications.
 Thanks to its inherently parallelism, image processing has proven to be suitable for hardware im-
 plementation. Most of the time, output pixels could be concurrently computed . Reconfigurable
 architectures are good candidates for implementing such applications. Indeed, they suit to applica-
 tions where huge amount of data (image, sound, etc.) are processed in parallel and periodically. In
 addition, they provide an upgradability or hardware programmability which increases the system
 lifetime and therefore decreases the non-recurring engineering costs.
 1.2.2 Technology Advances, Market and Costs Constraints
 1. Reducing non-recurring engineering (NRE) costs
 Current and future research trends into embedded System-on-a-Chip (SoC) design are dic-
 tated by key factors such as market demands and technology advances. New research
 trends on reconfigurable hardware devices (especially FPGAs) result from these key factors,
 as these devices are currently seen as a solution to successfully design complex embedded
 systems; in particular the design of systems submitted to stringent constraints of the above-
 mentioned applications. The current market for wireless handsets is driven by factors such
 as price, flexibility, functionality and mobility; put simply, a handset has to be light weight,
 all inclusive and with extended battery life. Moreover, consumer electronic products such
 as wireless handsets have relatively short life cycles. The merciless competition between
 mobile service providers and handset design conspire to ever decrease prices and increase
 functionality. Consequently, the main challenge is to reduce production costs while providing
 more functionality and flexibility to consumers. While designing new products, shortening
 Time-To-Market (TTM) and increasing Right-First-Time (RFT) are important aspects that
 reduce non-recurring engineering (NRE) costs, as NRE costs of integrated circuit (IC) de-
 sign are rocketting today. Many research projects in Electronic Design Automation (EDA)
 mainly aim to achieve those two aforementioned purposes (short TTM and high RFT). For
 example, FPGA-based platforms have been promoted for complex ASIC rapid prototyping.
 A rapid prototyping allows the designers to shorten the design process, to prevent design
 failures and to avoid costly redesign by validating their designs earlier and by prototyp-
 ing their concepts within a reduced timeframe. Furthermore, designers are incorporating
 reconfigurable hardware in their designs, leading to cost effective products and enabling the
 release of new products without re-manufacturing the integrated circuit chip. FPGA-based
 8
1. Introduction Raison D'?tre for Reconfigurable Computing
 designs are even preferred to an ASIC based system for the implementation of low-volume
 applications thanks to their very low non-recurring engineering (NRE) cost.
 2. Modern reconfigurable hardware technology impacting SoC design
 Although a reconfigurable device like the FPGA typically has a higher power consumption
 than a comparable ASIC, the FPGA technology is evolving and current devices allow de-
 signers to consider FPGA as a computing resource for hardware acceleration. However,
 FPGAs are mainly used for rapid prototyping and glue logic purpose, nonetheless FPGAs
 are challenging programmable DSP processors by embedding hardwired DSP blocks, such as
 multipliers and distributed memories. Also, processor cores are currently embedded within
 FPGA structures (figure 2.16 on page 50). In so doing, an embedded hard-core or softcore
 processor core allows the designer to combine on a single FPGA all the benefits provided
 by a Von Neumann architecture with the parallelism provided by a spatial implementation.
 What is more important and a pivotal consideration in this research is reconfiguration ca-
 pabilities of Dynamically Reconfigurable Hardware Devices like FPGAs, which allow them
 to bridge the gap between hardware and software platforms implementation (figure 2.20 on
 page 62). Even though programmable and dedicated DSPs remain the traditional implemen-
 tation platforms for digital signal processing applications, many studies and demonstration
 platforms (e.g. Blyler, 2005; Petersen, 1995; Tessier and Burleson, 2001) have shown that
 using reconfigurable hardware devices, such as FPGAs, provides the best trade-off between
 flexibility, performance and power consumption, especially in digital signal processing ap-
 plications. FPGA programmability brings out a flexibility lacking in ASICs, while FPGA
 spatial structure is more suitable for to intrinsic data parallelism found in DSP functions.
 Thus, all these factors point the way forward for the FPGA to provide a significant perfor-
 mance improvement over traditionally implemented embedded system platforms.
 Current advances in semiconductor technology allow FPGAs to integrate more than 10 mil-
 lion gates
 9
 on a single chip while running at a relatively low frequency (e.g. 500 MHz).
 9
 Achronix Semiconductor Corp. announced (in fall 2010) strategic access to Intel Corporation's 22nm
 process technology, and plans to develop the most advanced FPGAs, the Achronix Speedster22i FPGA
 family. The device will provide more than 2.5 M LUTs in size, equivalent to an ASIC of over 20 million
 gates.
 Altera Stratix 5 FPGAs as well as Xilinx Virtex-7 FPGAs both manufactured at 28 nm process technology,
 provide more than one million LUTs, which is more than double the size of logic and memory available in
 Stratix or Virtex-II FPGAs. One million LUTs are enough to instantiate on the FPGA tens of softcore
 9
1. Introduction Raison D'?tre for Reconfigurable Computing
 Figure 1.3 depicts such an increase in LUT density over the last decade. In consequence,
 unlike the microprocessor, Moore's Law will still drive the FPGA market for the near fu-
 ture. One example is the FPGA Stratix 5 from Altera
 10
 , manufactured at 28 nm process
 technology. Moreover, today complete systems are integrated on a single chip. Whereby,
 this so called System-on-a-Chip (SoC) allows designers to integrate on single silicon dies
 one or more embedded processor cores executing software in addition to classical hardware,
 such as an ASIC, In/Out device, memory, Intellectual Property (IP) blocks and FPGA
 programmable hardware. Nonetheless, SoC design requires new design paradigms and rep-
 resents an active area of embedded system research. SoC design approach reduces the total
 silicon size required for the SoC, and correspondingly its cost. An exciting prospect is that
 the inclusion of a reconfigurable FPGA within a SoC could provide solutions to some of the
 demanding requirements of future embedded system applications.
 Figure 1.3: The increase in logic density in FPGA over one decade and over the
 corresponding process technology (Koch and Torresen, 2010).
 Fortunately, latest trends in reconfigurable hardware devices technology show improvements
 in capacity (decreasing transistor size and consumption), performance (increasing clock
 frequency) and reconfigurability (partial on-the-fly, see chapter 2, section 2.5 from page 33).
 CPUs along with the required peripherals.
 10
 The Altera Stratix 5 FPGA was announced in January 2010 and planned to be available in 2011
 (Altera, 2010a,b)
 10
1. Introduction Related Research Issues
 To summarize, using reconfigurable hardware in embedded systems design could reduce design
 costs by lowering non-recurring-engineering costs and by providing a computation platform com-
 bining flexibility and performance.
 1.3 Related Research Issues
 1.3.1 System-On-Chip Design Overview
 As stated previously, systems to design are becoming increasingly complex in order to satisfy
 application demands. In addition, technology allows huge systems to be integrated on a single
 chip, making their design more difficult. Such a so-called System-on-a-Chip (SoC) is in growing
 need for new design paradigms. Because of the increasing complexity, new design methodologies
 emphasize system level design. Design stages could be divided in three main levels (or less) as
 shown in figure 1.4 and summarized below :
 1. System Level (figure 1.4)
 System performance is evaluated early and various partitioning decisions are made. While
 engineering a new system, first specifications of the application are done at high level of
 abstraction according to system requirements. At that level, abstract models and templates
 of different implementation technologies from different vendors are first included in order
 to pre-define the architecture, to take partitioning decisions and to evaluate mapping needs
 for each target. System level simulation then allows the designer to evaluate in a reasonable
 amount of simulation time, the performance impacts using different architecture alterna-
 tives. The system architecture is refined by providing System Level simulation with cycle
 accurate models of the architectural blocks and requirements of the application.
 In the case of a SoC with a reconfigurable part, the reconfigurable hardware is seen as a com-
 puting resource among others like instruction set processors or dedicated processors. The
 reason for incorporating reconfigurable parts are expressed through systems requirements
 (e.g. a given mix of reconfigurability, upgradability and performance). So, the impacts
 of scheduling, placement and reconfiguration overhead on the global performance of the
 system is also evaluated. Academic and commercial research groups are proposing various
 system level co-design methodologies and tools for few application domains (e.g. wireless
 communication and multimedia systems).
 11
1. Introduction Related Research Issues
 Figure 1.4: Main levels in the generic design flow of a SoC
 2. Design Level (figure 1.4)
 Specification is refined and verification process customized to suit the chosen implementa-
 tion technologies and related design tools. Different parts of the system (reconfigurable and
 fixed hardware, software) are individually designed and then integrated in a single model in
 order to be model-checked by the verification process previously customized. At this level,
 all details of the implementation platforms are needed (processors, memories, external IPs,
 FPGAs, etc.) and vendor design and simulation tools are used. For example if dynamically
 reconfigurable hardware (DRHW) is involved, the FPGA technology along with its simu-
 lation and emulation tools are chosen during specification refinement with respect to the
 reconfigurability needed (partial, dynamic, etc.). Vendor tools are used during integration
 and co-verification steps.
 3. Implementation level (figure 1.4)
 where the whole design is implemented using commercial and technology-dependant tools.
 The fixed hardware part (ASIC) is also manufactured and the final text is done on the final
 product for its qualification.
 12
1. Introduction Related Research Issues
 1.3.2 Reconfigurable System-On-Chip Design
 As stated above and illustrated in figure 1.4, designing an electronic system (on chip or not)
 always follows a number of steps ranging from system level specification to final implementation
 (including mask generation for ASIC design). Over the years, EDA
 11
 tools have evolved, providing
 sometime push-button processes to design electronic systems. However, as SoCs are turning more
 heterogeneous (figure 1.5) by integrating reconfigurable hardware devices in addition to sequential
 processors and dedicated hardwired components, designers are facing new paradigms of computing
 and programming. Consequently, new automatic mapping methods of algorithms on the platform
 are needed, in order to take into account the additional flexibility brought by reconfigurable
 hardware devices. Indeed, such a platform allows the system to modify its hardware functionalities
 at run-time (on-the-fly) following the changing needs of applications. As for any ASIC design, a
 post-manufacture failure would be of very costly. One has to guarantee the expected behaviour
 (RFT
 12
 ) of the system before manufacturing. Unfortunately, dynamic hardware reconfigurability
 in addition to the traditional software flexibility leads to systems with less predictable behaviour,
 making the integration and validation of software parts (running on programmable processors) and
 hardware parts (running on dedicated and reconfigurable hardware) more challenging. Current
 tools do not enable an accurate assessment of dynamicity criteria brought both by the software
 and reconfigurable hardware. Hence, design methodologies along with tools still have a long way
 to go before fully exploiting the potential capabilities of reconfigurable hardware devices.
 As illustrated in figure 1.5, a SoC containing one or many reconfigurable hardware fabrics is
 denoted as RSoC
 13
 . SoCs are turning to Multi-Processors SoC (MPSoC) integrating on a single
 chip many programmable and dedicated processors, customized blocks (voice encoder/decoder,
 cryptography, etc.), memory blocks, I/O blocks, various communication media (hierarchical buses,
 NoC
 14
 , etc.), ADC
 15
 /DAC
 16
 blocks, RF
 17
 front ends, heterogeneous blocks like sensors and
 MEMS
 18
 , and sometime reconfigurable hardware blocks like FPGAs (figure 1.5).
 In order to refine aims and objectives of this research, here below are discussed few additional
 11
 electronic design automation
 12
 Right First Time
 13
 Reconfigurable System-On-Chip
 14
 Network on chip
 15
 Analog to Digital Converter
 16
 Digital to Analog Converter
 17
 Radio Frequency
 18
 Micro-Electro-Mechanical Systems
 13
1. Introduction Related Research Issues
 Figure 1.5: Generic architecture of a Reconfigurable System-On-Chip.
 challenges brought by the increasing use of DRHW
 19
 in SoCs design. These issues are identified
 at different stages in the design process.
 1. System specification and Hardware/Software Codesign
 Obviously, using DRHW adds more complexity in traditional hardware software integra-
 tion and verification. Emerging hardware/software Co-design methodologies are allowing
 designers to specify and refine their systems in unified environments (Ptolemy, Polis) and
 languages (SystemC (www.systemc.org), SystemVerilog (www.systemverilog.org), Celoxica
 (2000), ImpulseC (www.impulseaccelerated.com), etc.). As illustrated in figure 1.6, hard-
 ware/software co-design shortens the design time by enabling concurrent design of hardware
 and software parts of the system. Unlike the traditional design methodology, the trend today
 is to delay the partitioning stage of the design to allow the movement of a task schedul-
 ing from hardware (dedicated or reconfigurable) to software and, conversely, to be kept as
 flexible as possible during the design process. Previously, hardware/software partitioning
 was clearly defined as a spatial (or functional) partitioning. But the advent of DRHDs
 has provided the motivation for new research in hardware/software partitioning. Indeed,
 dynamically reconfigurable hardware has filled the gap between hardware and software,
 making the partitioning process both temporal and spatial (e.g. Chehida and Auguin, 2002;
 19
 Dynamically Reconfigurable Hardware Devices (e.g. FPGAs)
 14
1. Introduction Related Research Issues
 Kaplan et al., 2003).
 Figure 1.6: Hardware/Software Co-Design shortens the design process(Fujitsu, 2002)
 As stated above, at a time when Moore's Law is getting less profitable for microprocessor
 improvement, it is getting meaningful to increase performance by adding more reconfi-
 gurable hardware along with distributed memories and by using pipelining. Hence, as most
 of research challenges in embedded SoC design which tend to leverage silicon technology
 advances, FPGA research also falls within the global problem known as `the productivity
 gap'. It states that thanks to Moore's Law, integration technology is growing faster than
 the ability of the engineers and design tools to benefit from the doubling of through-put
 every eighteen months. As shown in figure 1.7, the gap between the number of logic gates
 that can integrate a single chip and the number of logic gates designers could integrate in
 their design using existing design methodology and tools increases over years at about the
 rate predicted by Gordon Moore.
 2. C to hardware compilation
 Unlike microprocessor implementation using high level C-like languages, mapping algo-
 rithms in fine grain reconfigurable hardware (e.g. FPGA) is still a complex and almost
 manual task. Indeed, this is done using low level Hardware Description Languages (HDLs).
 The designer must specify his design almost at bit level, reminding the early days of mi-
 croprocessor programming. While taking the FPGA as an example, since it remains the
 finest grain reconfigurable hardware device, its design process is quite similar to the ASIC
 design. In addition, compiling an algorithm to target an FPGA is noticeably complicated
 as FPGA, unlike a microprocessor, does not have an instruction set. To overcome those
 problems and exploit the overriding FPGA advantages, numerous researchers (e.g. Athanas
 and Silverman, 1993; Gokhale and Stone, 1998; Gokhale et al., 2000; J. L. Tripp and Hutch-
 15
1. Introduction Related Research Issues
 Figure 1.7: Productivity gap according to ITRS (The International Technology Roadmap
 for Semiconductors, www.itrs.net)
 ings, 2002) and companies (e.g. MathWorks, www.mathworks.com)
 20
 have addressed High
 Level Languages (HLL) and their compilers to target reconfigurable hardware. Another
 example is the Furthermore, a solution consists of designing coarse-grained reconfigurable
 architectures which would be easier to target with High Level Languages as detailed in the
 next paragraph.
 3. Need of coarse-grained reconfigurable architectures
 To overcome some of the previously discussed limitations, new coarse-grain reconfigurable
 architectures with limited configurability compared with FPGAs were investigated. Coarse-
 grained architectures are customized to suit a specific class of applications. In such a so-
 called word level granularity architecture, the reconfiguration is done at functional level
 (instead of gate level like in FPGAs). As in the example in Sch?ler and Tan (2004), a
 reconfigurable processor consisting of an array of configurable and interconnected arith-
 metic logic units (ALUs) and enabling high level parallelism is a good example of this
 type of architecture. Although, such application oriented architectures are not as suited
 as fine-grained FPGAs to implement a multipart function at the bit level. However, a
 20
 e.g. The Simulink HDL Coder from MathWorks generates bit-true and cycle-accurate, synthesizable
 Verilog and VHDL code from Simulink models and MATLAB code. The generated HDL code can be sim-
 ulated and synthesized using industry-standard tools and then implemented on FPGAs and ASICs. The
 main advantage of this so-called Model-Based Design approach it to relies on the same model throughout
 the design, enabing FPGA-in-the-loop cosimulation and accelerating verification and time-to-market.
 16
1. Introduction OS Reconfigurable SoC
 coarse-grained reconfigurable hardware saves a significant amount of circuit area, power
 and reconfiguration overhead compared to a traditional FPGA based system. As discussed
 in Gokhale and Graham (2006) and Ebeling et al. (1996), the trend is to develop high
 level languages and dedicated compilers for such architectures, with the hope that they
 will be easier to target than fine grain architectures such as FPGAs. Later in Chapter 2
 these concepts and architectures will be briefly presented alongside their associated research
 areas.
 1.3.3 Operating System for Reconfigurable System-On-a-Chip
 As stated in previous sections, many of today's embedded applications, such as signal and image
 processing for mobile telecommunications, wireless multimedia, automotive electronics, avionics
 and robotics, demand an increasingly dynamic performance and variable functionality. To meet
 these needs, FPGA developers provide large-scale devices to allow embedded system designers
 to consider the use of an FPGA as a computing resource at the same level as a microproces-
 sor or a programmable DSP processor. A consequence of FPGA runtime partial reconfiguration
 and embedded dedicated blocks is the need for a resources manager acting like an Operating-
 System. This resources manager being capable of efficiently managing both the microprocessor
 and the reconfigurable hardware part of the architecture (e.g. implemented on an FPGA) that
 run respectively software tasks and hardware tasks. In a traditional Operating System (OS),
 software tasks are sequentially created, run, pre-empted and/or deleted, allowing them to se-
 quentially use a single microprocessor in a time-shared basis. In the case of the FPGA/ISP
 21
 architecture, the OS may be viewed as an additional abstraction layer which hides the details
 of the underlying microprocessor and reconfigurable hardware part from the software designer
 (e.g. Steiger et al., 2004). By extending this principle to reconfigurable hardware, an OS can
 provide a hardware abstraction layer which eases the sequential and spatial implementation of
 hardware tasks on reconfigurable blocks (e.g. Mignolet et al., 2003). This allows multitasking on a
 reconfigurable hardware platform. Therefore, in a reconfigurable system with both hardware and
 software tasks the OS concept is extended to an Operating System for Reconfigurable Systems.
 As stated in Nollet et al. (2003), the principal aim of an Operating System for a Reconfigurable
 SoC is the simplicity of designing reconfigurable hardware platforms and efficient management
 of the associated computing resources. What is exciting are the new research areas initiated by
 21
 Instruction Set Processor
 17
1. Introduction Contribution of the Thesis
 Operating System for RSoC and, moreover, reconfigurable computing system design. Resources
 allocation, HW/SW partitioning, tasks scheduling, placement, routing, pre-emption, migration
 (hardware/softare, hardware/hardware) are few of the challenges involving numerous groups (e.g.
 Compton et al., 2002; Walder and Platzner, 2002; Noguera and Badia, 2004) investigating new
 reconfigurable FPGA/ISP architectures along with their programming models and implementa-
 tion.
 1.4 Contribution of the Thesis
 There are two major contributions in this thesis :
 1.4.1 Algorithms for Online Real-Time Scheduling & Placement
 As just stated above, designing an operating system for reconfigurable systems is a key challenge in
 RSoC design as it raises a number of issues. Among these issues, this thesis will focuse especially
 on algorithms which are suited to the scheduling and placement of online real-time hardware
 tasks on dynamically reconfigurable architectures. Hardware tasks scheduling and placement
 aim to efficiently and dynamically schedule and place modules on the dynamically reconfigurable
 hardware devices. Therefore, the algorithms will always seek the best trade-off between two,
 sometimes conflicting objectives: the algorithms runtime overhead and the scheduling/placement
 quality in terms of chip utilization ratio, chip fragmentation, tasks rejection ratio, makespan, etc.
 To enable multitasking and hardware virtualization on partially reconfigurable hardware devices,
 reconfiguration time overheads and area fragmentation are among the issues to overcome. The
 algorithms proposed in this thesis are mainly applied to partially reconfigurable hardware device.
 In the proposed methodology, scheduling algorithms runtime overheads are sometime evaluated
 on real platforms featuring an embedded processor, instead of desktop or laptop computers. For
 example, two algorithms (Cui and Deng, 2007; Handa and Vemuri, 2004c) are implemented on
 a Microblaze softcore processor instantiated on a Xilinx Spartan 3E FPGA. By doing so, as the
 methodology relies on more accurate experiments, the online real-time constraints are more likely
 to be accurately assessed. Different models of elements that are involved (reconfigurable fabric,
 scheduler, placer, application) are also proposed and discussed in Chapter 4. However, these
 models do not consider inter-tasks communication and communication between the CPU part and
 the reconfigurable hardware device, as communication is beyond the scope of this thesis.
 18
1. Introduction Outline of the Thesis
 1.4.2 Scheduling & Placement Algorithms for OS-driven Design Space
 Exploration
 System level co-design methodologies are widely accepted as an approach to overcome the in-
 creasing complexity of SoC design. As an Operating System aims to manage all the resources of
 a given platform, architecture definition could be done from an OS perspective. An OS-driven
 methodology for RSoC design has been proposed in Miramond et al. (2009a). Initiated by the
 French national research council, the OveRSoC project (Miramond et al., 2009a) aims to develop a
 complete model of an RSoC platform in order to investigate services a real time Operating System
 (RTOS) for RSoC should provide. Such Operating System for RSoC should manage processing
 resources that are on the chip, including the dynamically reconfigurable hardware parts of the
 RSoC. Consequently, as dynamically reconfigurable hardware is involved, hardware tasks schedul-
 ing and placement, hardware context switching, hardware/software task migration are pivotal to
 this research. Our contribution to the OveRSoC design methodology intends to provide DRHDs
 models and a set of scheduling and placement algorithms. Indeed, at system level simulation, while
 performing architecture definition and mapping (figure 1.4, page 12) these models and algorithms
 help in exploring various mapping solutions and in tuning the system partitioning accordingly. As
 the methodologies such the OveRSoC rely on a SystemC-based simulation engine, the models de-
 scription and most of the scheduling and placement algorithms implementation are done in C++
 in this work.
 1.5 Outline of the Thesis
 This first chapter has presented a snapshot of embedded systems that use reconfigurable hard-
 ware devices like FPGAs. Put simply, this introduction has striven to explain how the use of
 state-to-the-art reconfigurable hardware devices can be used as processing resources in embedded
 systems to enhance their performance; however their use raises significant new issues, especially
 in embedded RSoC design. These issues range from hardware/software partitioning to hardware
 tasks scheduling and placement.
 The main focus of this thesis is on the latter problem, with an emphasis on the online real-time
 context. However, there a two main aspects of RSoC design to be highlighted as the contribution
 of the thesis :
 1. The first aspect acts at runtime, where appropriate algorithms for online real-time schedul-
 19
1. Introduction Outline of the Thesis
 ing and placement of hardware tasks on dynamically and partially reconfigurable hard-
 ware devices are required. Novel methods for improving the quality of the online real-time
 scheduling and placement of hardware tasks on reconfigurable hardware devices are pro-
 posed and assessed.
 2. The second aspect acts at design or compilation time, and that consists of feeding a given
 system level design methodology with scheduling and placement algorithms for dynami-
 cally reconfigurable architectures. This thesis assumes that the Design Space Exploration
 (DSE) of the RSoC is OS-driven. Several scheduling and placement algorithms are imple-
 mented in addition to the newly proposed algorithms. The suitability of these algorithms
 for C++/SystemC-based RSoC design methodology (e.g. OVeRSoC methodology) is estab-
 lished.
 The rest of the thesis consists of 6 chapters and they are organized as follows:
 Chapter 2 gives a comparative presentation of different implementation architectures for signal
 processing in order to see which category fits the dynamically reconfigurable hardware devices.
 The chapter concludes with explanation on how the above mentioned second aspect of this research
 can be applied to a Platform-Based Design approach.
 Chapter 3 starts with a theoretical background on scheduling problems. In order to point out
 similarities between microprocessors scheduling and reconfigurable hardware devices scheduling,
 the chapter first presents both problems. It then emphasizes specific challenges raised by the
 scheduling of hardware tasks on dynamically and partially reconfigurable arrays. The chapter
 ends with a literature review on placement strategies and the resulting reconfigurable hardware
 fragmentation.
 Chapter 4 relies on the previous chapters and on preliminary experiments to draw the proposed
 methodology for scheduling and fitting online real-time hardware tasks on DPRHW
 22
 . Based on
 these accurate experiments, the chapter demonstrates the need for a trade-off between the schedul-
 ing algorithm and the underlying placement strategy. The chapter emphasizes the importance of
 providing more than one size and/or shape per hardware task. In addition, Chapter 4 presents
 different models and related metrics used to assess the performance of the aforementioned schedul-
 ing and placement algorithms. The end of the chapter discusses on how these models and metrics
 22
 DPRHW : dynamically and partially reconfigurable hardware devices.
 20
1. Introduction Outline of the Thesis
 can be used at system level design stage with any SystemC-based SoC architecture exploration
 (e.g. OVeRSoC methodology for DSE
 23
 ).
 Chapter 5 is an in-depth study of the designed algorithms that deal with the online real-time
 scheduling of hardware tasks. The proposed multi-shape scheduling algorithms are backed up
 with novel reconfigurable areas management strategies which are suitable for online looking-ahead
 scheduling algorithms. Along the chapter, whenever possible, the suitability of the proposed
 scheduling algorithms for online real-time problems is emphasized.
 In Chapter 6, performance of the scheduling algorithms and placement strategies proposed to
 schedule online real-time tasks on DPRHW are presented and discussed using simulation results.
 These latter are expressed through metrics such as the algorithms runtime overhead, the tasks
 rejection ratio, the reconfigurable hardware utilization ratio and the scheduling makespan.
 Chapter 7 concludes the thesis. The review of the work presented in the thesis is summarized,
 highlighting the contribution to the body of knowledge. The results from all the experiments
 are summarized to establish the primary hypothesis of the thesis, regarding the online real-time
 scheduling of hardware tasks on DPRHWs. This chapter also suggests possible future directions
 of the research, and emphasizes the RSoC design challenge aspect.
 23
 Design Space Exploration
 21
Chapter 2
 Dynamically Reconfigurable
 Architectures vs Implementation
 Alternatives
 2.1 Introduction
 The previous chapter introduced recent advances in embedded digital signal processing (DSP)
 applications and discussed various design challenges brought by these advances. The chapter also
 pointed out the exponential increase of bit rates that require current and future communications
 applications (up to 100 Mbits on 4G wireless channels). Through FPGAs, the also presented how
 the development of reconfigurable architectures is bringing new considerations in embedded SoC
 design.
 This chapter will present the most known digital architectures used to implement Digital Signal
 Processing functions in embedded systems and their distinctive features. The chapter slightly
 discusses the assets and the drawbacks of each architecture, according to criteria such as time-
 to-market, performance, price, silicon area, development ease, power consumption and feature
 flexibility.
 The processors platforms for DSP implementation can be classified in two main families:
 1. Software or sequential processors
 2. Hardware or parallel processors (sometime used as co-processors).
 22
2. Dynamically Reconfigurable Architectures Introduction
 However, the new trend is to combine both sequential and parallel processors in an appropriate
 proportion in order to achieve a given goal. The chapter will also point out the suitability of
 reconfigurable computing systems for future embedded System-on-a-Chip applications. The chap-
 ter ends by an in depth presentation of reconfigurable hardware architectures, with an emphasis
 on FPGAs architecture.
 2.1.1 The Switch from Analog to Digital Signal Processing
 In its earlier days, Signal Processing was performed using analog circuits. These circuits were
 processing signals in their continuous form. They consist of passive (resistances, capacitances,
 diodes, inductances, etc...) and active (transistors, integrated circuits, etc...) components. But
 thanks to the advances in digital circuit design in general and microprocessor technology in parti-
 cular, Signal Processing progressively moved to the digital domain. Today, Digital Signal Proces-
 sing is used in a wide range of applications (wireless communications, medical imaging, avionics,
 automotive, etc. . . ). Figure 2.1 illustrates the basic principle of a DSP system in which an analog
 signal used as input is first converted to a sequence of numbers, which are digitally processed
 to achieve a given purpose. The numeric results are then converted back to produce the desired
 analog output.
 Figure 2.1: A simplified representation of a Digital Signal Processing System
 Processing digital data brings out many advantages over analog signal processing. Particu-
 larly, digital data are easier to manipulate (multiplexing, filtering, compression, errors detection
 and correction, arithmetic operations, etc. . . ). Indeed, programmable processors provide all the
 flexibility of software programming. However the frequency range in DSP processors is limited
 compared to analog circuits because of the Analog-to-Digital Converter (ADC) and the Digital-
 to-Analog Converter (DAC) frequency limitations. In addition, DSP solutions tend to be more
 complex and consume more power.
 23
2. Dynamically Reconfigurable Architectures Introduction
 2.1.2 The most common DSP Functions
 Nowadays, DSP applications are ubiquitous. They can be found in many consumer devices in-
 cluding mobile handsets (such as mobile phone, personal assistant, digital camera, GPS, etc.),
 TVs, DVDs players, games console, etc. Hence, the main applications of DSP are audio signal
 processing, digital image processing, speech processing and digital communications. The most
 common DSP functions found in those applications are listed below:
  Filtering
  Transform (e.g. Bilinear, Fast-Fourier, Discrete-cosine)
  Convolution
  Modulation and demodulation (e.g. MIMO, QAM, etc.)
  Multiplexing and demultiplexing
  Signal generation
 2.1.3 Software vs Hardware Platforms
 DSP algorithms could be implemented in embedded electronic systems using different platforms.
 Choosing a given platform depends on the design trade-offs such as performance, power and price
 to be achieved. The most common platform implementation of DSP functions are:
  General Purpose Processors (e.g. Pentium, PowerPC) and Microcontroller Units (e.g. ARM
 Cortex-M3). RISC architectures are more used.
  Programmable Digital Signal Processors (DSP) which are digital processing oriented micro-
 processors.
  Application-Specific Integrated Circuits (ASICs) which are dedicated components optimized
 to run specific DSP functions at the best performance and the lowest power consumption.
  Programmable hardware devices (e.g. FPGAs or PLDs) which are components capable of
 implementing any DSP algorithms in a parallel way and which provide a kind of hardware
 programmability.
  Other variants such as Application-Specific Instruction set Processors (ASIP) or Application-
 Specific Standard Parts (ASSP), coarser grain reconfigurable devices and RISC/GPP archi-
 tectures. Those variants combine specific implementation platforms cited above in order to
 overcome some of their drawbacks.
 24
2. Dynamically Reconfigurable Architectures Software Platforms (GPPs)
 Those implementation platforms could be globally divided in two main families, depending
 on how computations are performed: Software (or sequential) implementation and Hardware (or
 parallel) implementation.
 2.2 Software Implementation Platforms
 Software implementation platforms are based on Von Neumann machine in which a single instruc-
 tion (or operation) is performed at a time, as shown in figure 2.2 (a). General Purpose Processors
 (GPP), Microcontrollers (MCU) and Digital Signal Processors (DSP) are such platforms. They
 are denoted as sequential architecture processors since in principle each operation is executed in
 sequence on a single Arithmetic and Logic Unit (ALU) circuit controlled by an instruction mem-
 ory (figure 2.3). By changing the content of the instruction memory in an appropriate way (using
 the Instruction Set of the processor), almost any function or algorithm could be implemented as
 a sequence of basic operations (figure 2.2(a)). As shown in figure 2.2(b), each instruction is im-
 plemented following the traditional Von Neumann computer cycles of Instruction Fetch (IF) and
 Decoding (D), operands Read (R), instruction Execute (EX) and result Write back (W). Hence,
 five cycles (one at a time) are needed to achieve a single instruction, in contrast with parallel
 architectures. Consequently, sequential processors are totally flexible and capable of handling
 tasks from a very wide range of applications. Depending on the application, this full flexibility
 could be at the expense of power consumption and/or a low performance or quality of service.
 Hence, trade-offs between flexibility and performance of sequential processors had led to several
 architectures each optimized for a given goal or a given class of applications.
 2.2.1 General Purpose Processors (GPPs)
 A GPP is the most popular sequential processor. It is a single integrated circuit that mainly
 contains a Central Processing Unit (CPU). As shown in figure 2.3, a GPP relies on a Von Neumann
 architecture that consists of a CPU (hosting a Control Unit and an Arithmetic and Logic Unit),
 a single central memory which holds both instructions and data, and an Input/Output unit for
 external communication. The Control Unit (CU) extracts instructions from memory, decodes and
 executes them, calling on the Arithmetic and Logic Unit (ALU) if necessary in order to perform
 arithmetic and logical operations on read operands, and then writes back the results in data
 memory.
 One limitation of Von Neumann architecture is known as the Von Neumann bottleneck where
 25
2. Dynamically Reconfigurable Architectures Software Platforms (GPPs)
 Figure 2.2: Sequential execution. (a) a single operation at a time (b) sequential execution (c)
 pipelined execution providing higher throughput.
 Figure 2.3: The Von Neumann architecture.
 26
2. Dynamically Reconfigurable Architectures Software Platforms (GPPs)
 the throughput (data transfer rate) between the CPU and memory is limited compared to the
 amount of available memory. On one hand, the CPU processing speed and memory size have
 constantly increased following the Moore's law. Pentium and PowerPC processors are a few
 examples of such architectures that can easily be clocked at 2GHz or more. Hence, they are
 used in systems such as personal computers, workstations or any other application where power
 consumption is not a predominant concern. On the other hand, the Von Neumann architecture
 derives less benefit from the Moore's law as both data and instructions are accessed in the memory
 through the same port as illustrated in figure 2.3, thus limiting the transfer rate. This architecture
 has evolved over the years to reduce this bottleneck.
 A way of reducing the Von Neumann bottleneck is to store data and program in two separated
 memories. It is the philosophy behind Harvard architecture that enables concurrent instruction
 and data access and rises the bandwidth between the CPU and the memory. Nowadays, Harvard
 architecture (or a modified Harvard architecture) is used in DSP processors and most of Micro-
 controllers. This architecture eases Instruction Level Parallelism (ILP or the so-called pipelining)
 as some of the five cycles (cited above and illustrated in figure 2.2(b)) could be performed con-
 currently if they do not belong to the same instruction. For instance, Instruction Fetch (IF) and
 operands Read (R) cycles need to access respectively instruction memory and data memory. As
 depicted in figure 2.2(c), if belonging to two different instructions, these two cycles could be run
 concurrently thanks to the fact that instruction memory and data memory have separated access
 ports. Even the execution (EX) cycle of a third instruction could be on the run at the same time.
 These overlappings in instruction cycles execution are limited by dependency between instructions,
 especially when a given instruction is fed by data resulting from another one. Pipelining adds
 more parallelism in the architecture. The resource (silicon die) utilization ratio and instructions
 execution throughput are increased accordingly.
 To summarize, in a GPP, very high performance at high clock rates comes at the cost of power
 consumption.
 Microcontroller units (MCUs) integrate on a single chip a processor core, memory,
 and programmable input/output peripherals. They are processors tailored for control purposes.
 Indeed, increasing the clock rate of CPU-based architectures to meet the rising requirements of
 embedded applications will never suffice. As previously stated, CPU-based architectures can easily
 run at high clock speeds and provide a flexibility that allow them can handle any signal processing
 function. But they are not suitable for embedded applications because of the resulting power
 27
2. Dynamically Reconfigurable Architectures Software Platforms (DSPs)
 consumption. Thus, many vendors have designed lower speed MCUs (compared to GPPs) in order
 to target embedded power-aware applications. As they are control oriented, MCUs are capable of
 running embedded real-time operating systems. However, the new trend is to derive many sub-
 families of MCUs, where each sub-family offers specific extensions for a given application domain.
 For example, some MCUs provide a good system trade-off for some basic DSP applications by
 coupling some DSP functionalities with their CPU. Therefore their instruction set is enriched
 with dedicated DSP instructions. The ARM Cortex-A8 is an example of an MCU where the ARM
 NEON SIMD
 1
 accelerator is attached to the CPU. Even with these improvements, MCUs are
 far from meeting processing requirements of highly intensive DSP applications since they are not
 designed for this purpose. One example among many is their numeric accuracy which is usually
 very low and rarely reaches 32 bits. This is not sufficient in most of embedded DSP applications.
 Multiprocessor architectures are another solution of increasing performance level by
 adding more parallelism in the architecture while limiting the clock rate. They enable more
 instruction level parallelism (ILP), but remain unsuitable for embedded devices because of the
 resulting power consumption.
 2.2.2 Programmable Digital Signal Processors (DSPs)
 Digital Signal Processors are programmable processors optimized for digital signal processing
 oriented applications. Thanks to their ever decreasing power supply, DSP processors are also
 power efficient. Thus, they are very well suited to mathematically intensive applications embedded
 in SoCs. Their Harvard architecture along with other architectural improvements attempt to
 achieve a kind of parallelism in order to offer a high MIPS (Million of Instructions Per Second)
 and MMACS signal processing performance that is competitive with FPGAs and ASICs.
 In addition, DSP processors have been proven excellent for short time to market. Indeed, most
 of digital signal processing algorithms are written in C and C-like programming languages easier
 to implement on sequential processors. Similar to other software programmable processors, many
 tools (compilers, code generators) and libraries (DSP functions, IPs) have been developed over the
 years to ease DSP implementation and code reuse. Hence, a DSP processor is orders of magnitude
 easier to target than a hardware implementation platform, even if in some exceptional cases some
 portions of the code are written in assembling language to optimize the implementation.
 1
 Single Instruction Multiple Data
 28
2. Dynamically Reconfigurable Architectures Hardware Platforms
 Nowadays, unlike the above-mentioned general-purpose DSP processors, DSPs vendors are
 moving to a more market-specific approach in DSPs architectures. The new trend is to design
 DSPs that target a more specific market. In doing so, these DSPs architectures provide the
 required performance and propose various solutions for the specific market without sacrificing their
 programmability and their reusability. Examples of DSP processors are TI's TMS320C6000 (with
 its Multi-MAC VLIW architecture), TI's C55, TI's C64, ADI's Blackfin and CEVA-X families,
 etc.
 Summarizing, sequential processors are flexible and easy to program. They provide the best
 silicon area efficiency since a fixed structure (e.g. a single ALU) is used sequentially to perform a set
 of micro-instruction under the control of a Control Unit. They are slower in terms of computation
 throughput and more power-hungry as increasing operation speed (clock rate) increases the power
 consumption accordingly. As architectures are continuously improved, the gap between different
 sequential processors presented above is narrowing , as most of the processors are based on Harvard
 architecture (see figure 2.20 on page 62).
 In this thesis, an application (resp. a task) designed to run on a software platform is referred to
 as software application (resp. software task).
 2.3 Hardware Implementation Platforms
 In a hardware implementation, processing is undertaken in parallel. Each instruction is imple-
 mented as a custom logic circuit. Hence, several intructions mapped directly in hardware can be
 executed at the same time (in single cycle), in contrast with sequential processor. One example
 is illustrated in figure 2.4 where a whole Finite Impulse Response (FIR) filter is implemented as
 a single instruction which is performed in one clock cycle. Operations needed to implement the
 N th-order filter (N multiplications and N  1 additions) occurred in parallel at each clock cycle;
 the only constraint being the signal propagation through the longest path between the input Xn
 and the output Yn of the logic circuit. One output sample is delivered per clock cycle, in contrast
 with a software implementation. Indeed, in a Von Neumann architecture the above 2N  1 arith-
 metic operations are sequentially done on the single Arithmetic and Logic Unit (ALU) as basic
 instructions which cannot overlap.
 This example of a FIR filter (figure 2.4) seen as an instruction also shows that the size and
 the complexity of an instruction are arbitrary. Many instructions could be pipelined in order to
 implement a bigger instruction or operation. Doing so, instructions are performed in parallel at
 29
2. Dynamically Reconfigurable Architectures Hardware Platforms (ASICs)
 Figure 2.4: Dataflow representation of one instruction performing an N th-order (N + 1 taps)
 FIR filtering.
 each stage of the pipeline. The depth of the pipeline is only limited by data dependencies within
 the implemented application and the available implementation resources on the circuit (e.g. num-
 ber of gates in an ASIC or number of Configurable Logic Blocks in an FPGA).
 Thanks to the fact that hardware implementation allows the designer to tailor their circuit for a
 specific application with maximum parallelism, it yields highest performance in terms of through-
 put, but lacks flexibility. Fortunately, today's FPGAs tends to overcome this lack of flexibility by
 enabling runtime reconfigurability.
 In any hardware implementation platform, algorithms or functionalities are mapped using Hard-
 ware Description Languages (e.g. VHDL, Verilog, etc.) to describe hardware. Such languages
 are not suitable for algorithms transcription as it contradicts the traditional sequential way of
 human thinking. Even if HDLs are moving to higher levels of abstraction over the years, writing
 programs is still a low level and time consuming task for designers. HDLs along with development
 environments and designers expertise are far from being as mature as high level C-like sequential
 languages targetting sequential processors (e.g. automatic code generation, compilers, etc..).
 In this thesis, an application (resp. a task) designed to be implemented on a hardware platform is
 referred to as hardware application (resp. hardware task).
 The following section presents a couple of hardware implementation platforms.
 2.3.1 ASIC Implementation
 ASIC implementation is the perfect example of a parallel implementation. Today, ASIC remains
 the best candidate for computationally intensive applications (even in real-time DSP embedded
 30
2. Dynamically Reconfigurable Architectures Hardware Platforms (FPGAs)
 ones). Algorithms are directly mapped into silicon through hardware gates. These gates are tuned
 to achieve the highest performance and the lowest power consumption while occupying the smallest
 silicon area. For example, if both designed at 90nm technology, cell-based ASIC is faster, can be
 up to 40 times smaller and consumes about 10 times less power compared to SRAM-based FPGA
 (Bolsens, 2005; Kuon and Rose, 2006). However, ASIC suffers from its lack of flexibility, since
 each ASIC chip is designed and manufactured for a specific purpose, and could not be changed or
 upgraded anymore. For example, as pointed out in Chapter 1, using ASIC technology to achieve
 Software Defined Radio requirements means designing one ASIC chip for each standard (or func-
 tion) in the system. In a system that contains many ASICs, one ASIC would be used at the same
 time, while others would remain unused or in an idle state. This approach leads to products with
 very large silicon area, high power consumption and consequently, non cost-effective. In addition,
 modifying a standard or adding new feature leads to a complete redesign of its ASIC.
 Embedded applications are facing exponential growth of their requirements, leading to complex
 and denser ASICs and SoCs design, integration and validation. This growing complexity lengthens
 the design process and consequently delays the time-to-market (TTM). The verification process
 became the key point of the design since any failure leads to a costly complete redesign. Further-
 more, the non-recurring engineering (NRE) cost of ASICs is also mainly increasing drastically as
 the costs of creating masks in today's deep submicron geometries are becoming unaffordable as
 shown in figure 2.5. Design tools are struggling to handle challenges posed by each new technol-
 ogy generation. Consequently, an ASIC product could become obsolete before getting into the
 market. This makes ASIC implementation cost-effective only for high-volume products. However,
 in contrast with Von Neumann-like architectures where the clock frequency is reaching a plateau,
 ASICs will still benefit from Moore's law for a while.
 2.3.2 Fine and Coarse Grain Reconfigurable Arrays Implementation
 Field Programmable Gate Arrays (FPGAs) are known as the finest grain reconfigurable hardware.
 They can implement fairly small pieces of logic (1-bit level). Hence, they are the main alternative to
 ASICs. If big enough, an FPGA could implement any digital system. Consequently, they have been
 used for ASIC prototyping. As previously stated in Chapter 1, TTM and RFT are the two factors
 which fostered FPGA development. The two main technologies are SRAM-based FPGAs and anti-
 fuse-based FPGAs. Depending on the technology used, an FPGA could be configurable only once
 or many times. For example, the main feature of SRAM-based FPGAs is to be reconfigurable
 31
2. Dynamically Reconfigurable Architectures ASIP/ASSP
 Figure 2.5: Mask cost exponentially grows with technology.
 an unlimited number of times. They hold their configuration in a static memory. They ally
 flexibility and performance. Their programmability is not as good as sequential processors, but
 their performance is not far below that of ASIC. However, dynamically reconfigurable SRAM-
 based FPGAs are seen as an enabling technology for reconfigurable computing. The anti-fuse
 FPGAs will not be emphasized in this section, as they do not enable dynamic reconfiguration.
 A trend in programmable logic devices is to develop coarse grained reconfigurable architectures.
 In these latter, the granularity is at word level (8, 16 and/or 32 bits level). This granularity
 provides a better performance (thanks to embedded hardwired blocks) and reduces the costs in
 terms of configuration logic, but at the cost of flexibility.
 A more detailed study of fine grain (FPGA) and coarse grain reconfigurable arrays technology is
 presented later in this chapter in sections 2.5 and 2.6.
 2.4 ASIP/ASSP Implementation
 Even though agreeing that SoCs including one or many CPU subsystems and reconfigurable hard-
 ware in addition to specific hardware blocks are the future of embedded systems design, many
 studies (e.g. Hartenstein, 2001a,b; Rabaey, 2001; Jerraya, 2004) argue that a universally efficient
 implementation platform is an illusion. According to them, a general purpose computing tech-
 nology platform could never meet different combinations of requirements of different applications.
 In addition, Gatherer et al. (2004) stated clearly that FPGAs and reconfigurable processors are
 only a stop-gap solution in the march towards an all-soft SDR, despite their numerous advan-
 32
2. Dynamically Reconfigurable Architectures FPGA Architectures
 tages. Hence, even dynamically reconfigurable platforms have to be optimized for a given class of
 applications. Platform-Based design concept presented below aims to follow this trend. Similarly,
 one could say that configurable processors cited above are used to generate Application-Specific
 Instruction set Processors (ASIP). ASIPs or ASSPs (Application-Specific Standard Parts) are
 circuits optimized for a class of products or a particular application domain. ASIPs/ASSPs are
 becoming an alternative to ASIC design, thanks to their shorter time-to-market. Indeed, once
 they are designed, different versions or new generation of a product could be derived from the
 same design. ASIPs/ASSPs are increasingly designed as SoCs integrating multiple cores (processor
 cores and semi-programmable circuits). Some application oriented platforms like TI OMAP cited
 in the paragraph 2.7 below are referred to as ASSPs. One can say that ASIP/ASSPs are between
 ASICs and FPGA, since they are less flexible than FPGA, but provide a better performance.
 2.5 Fine-grained Reconfigurable Hardware Devices
 2.5.1 Introduction
 As previously said, in a fine grain reconfigurable architecture, the granularity is at the bit-level.
 This means that even a one-bit logic function could be implemented on this architecture. This
 section presents fine grain reconfigurable architectures and mainly SRAM-based FPGAs as they
 are far the most used to implement dynamic and partial reconfiguration. However, this study is
 also valuable for other devices like CPLDs or other antifuse-based FPGA, as they rely on the same
 principle.
 2.5.2 FPGA Architectures
 A basic FPGA consists of three major elements as shown in figure 2.6:
 (i). Combinational logic blocks
 (ii). Programmable interconnects
 (iii). Programmable Input/Output pins
 Combinational logic blocks are configurable blocks used to implement custom logic functions.
 They are denoted as logic elements (LEs) or configurable combinational logic blocks (CLBs). These
 blocks are generally arranged as a two-dimensional structure and are surrounded by programmable
 interconnects. The circuit to be implemented is divided into small modules, each fitting in a
 33
2. Dynamically Reconfigurable Architectures FPGA Technology
 Figure 2.6: Simplified structure of an FPGA.
 logic block. Several blocks are then interconnected using programmable interconnects in order to
 implement the whole circuit, if too big to fit in a single block. If the FPGA is reprogrammable, the
 implemented function could be changed by updating the content of the configuration memories.
 Input and Output pads are assigned using programmable Input/Output blocks (IOBs). IOBs could
 be also programmable as low-power or high-speed connections. If there are enough implementation
 resources on the chip, any function could be implemented by interconnecting basic blocks. There
 are several types of interconnect, depending on the distance between logic blocks to interconnect.
 Special interconnects are dedicated to clock signals.
 2.5.3 FPGA Technology
 The two main technologies are :
 1. Antifuse-based FPGAs
 In antifuse-based FPGAs, interconnections rely on antifuses which two terminals are sepa-
 rated by a dielectric. Hence, they are of high impedance at their normal state and can be
 switched to their low impedance state (fused state) by applying a high voltage which melts
 the dielectric and reduces the resistance. The nodes are then connected for good. For this
 reason, they are also known as one-time programmable FPGAs. The desired logic function
 34
2. Dynamically Reconfigurable Architectures FPGA Structures
 is obtain by connecting basic logic elements in the appropriate way. Antifuse technology,
 a cheaper technology than SRAM, can achieve higher speeds and occupies less space on
 the die. In addition, antifuse technology makes more reliable and secure FPGAs compared
 to SRAM-based, safeguarding against cloning, overbuilding, reverse engineering and radia-
 tion bombardment. Consequently, antifuse FPGA technology is the best option for satellite
 applications. However, as antifuse FPGAs are only one-time-programmable, they are not
 suitable for reconfigurable computing and, therefore, are no more detailed in this thesis.
 2. Memory-based FPGAs
 In memory-based technology, memory parts are either SRAM
 2
 or flash EEPROM
 3
 . SRAM
 or flash EEPROM are used to configure both interconnections and logic blocks. Configura-
 tions are held in the memory. Consequently, unlike flash EEPROM which are permanently
 programmed FPGAs, SRAM-based are volatile and the configuration is lost at power off.
 However, the SRAM approach is the most widely used in FPGA configuration.
 2.5.4 FPGA Structures
 No matter which technology is used, the overall structure of figure 2.6 remains the same.
 Island style architecture
 This type of architecture was chosen by Xilinx from the beginning while introducing FPGAs
 in 1985. The FPGA consists of a planar array of programmable logic blocks with vertical and
 horizontal programmable routing resources.
 Sea-of-gates architecture
 In this architecture, logic blocks are spread all over the IC chip and routing resources provide a
 logarithmic connectivity. Indeed, interconnections are mainly made through neighbor-to-neighbor
 routes which are faster, in addition to other general routing resources. Sea-of-gates topology was
 used by Xilinx (in its 6000 series) and by Actel (in its ProASIC family).
 2
 Static Random Access Memory
 3
 Electrically-Erasable Programmable Read-Only Memory
 35
2. Dynamically Reconfigurable Architectures SRAM-Based FPGA
 Hierarchical architecture
 Hierarchical architecture is the philosophy of Altera. There are several plans in the FPGA, but
 these plans are not physical. They correspond to different logic levels. For example, one element
 of a logic level may contain elements of a lower logic level, leading to the notion of logic hierarchy.
 Each level uses the topology of island style architecture with dedicated routing for each level.
 This hierarchical approach both in logic and interconnects provide smaller and more predictable
 routing delays, and therefore higher operation frequency.
 2.5.5 SRAM-based FPGA
 This section focuses on SRAM-based FPGAs technology, as they can be reprogrammed an un-
 limited number of times and even during system operation, enabling dynamic (on-the-fly) recon-
 figuration. These two features are the main advantages of SRAM technology. In addition, unlike
 other technologies, SRAM technology uses standard CMOS
 4
 cells and the whole FPGA can be
 fabricated with standard VLSI
 5
 processes. However, SRAM-based technology occupies more chip
 area compared to other technology, making them relatively expensive. Furthermore, SRAM con-
 figuration memory is power consuming even when its contain remains unchanged. As the SRAM
 contents are lost at power-off, there is a need for an external non volatile storage memory to keep
 configuration data in order to reconfigure the FPGA at power-on.
 LUT-based Logic Elements
 In SRAM-based FPGAs, each logic element or configurable logic block (LE or CLB in figure
 2.8) is built around a LUT
 6
 . A LUT can implement any n  inputs combinatorial function (or
 sequential by adding a flip-flop latch at the output). For example in figure 2.7, a LUT implements
 a function by storing its truth table in SRAM memory. Hence, an n inputs function requires a
 2n location SRAM (resp. 3 and 23 in the example of figure 2.7). Values of function S in the truth
 table are pre-stored in the SRAM in a way that each value is at the address corresponding to its
 combination of inputs (e.g figure 2.7 where values 0 and 1 are stored respectively at addresses
 101 and 110). The SRAM is then connected to a decoder which uses the input combination
 to access the corresponding location and route the correct result of the function to the output.
 4
 Complementary Metal Oxyd Semiconductor
 5
 Very Large Scale Integration
 6
 Look-Up Table
 36
2. Dynamically Reconfigurable Architectures SRAM-Based FPGA
 One advantage of LUT implementation over static logic gates is that no matter which function is
 implemented by the LUT, the delay through the logic element is the same.
 Thanks to this fine grain granularity, a full parallelism (spatial implementation) is achieved
 as much as in ASICs. Hence, one can recreate a complete ASIC design in an FPGA by program-
 ming and interconnecting basic blocks. This feature made FPGAs suitable for complex ASIC
 prototyping.
 Figure 2.7: Truth table of function S = f ( a, b, c ) and its mapping using a 3 inputs
 Look-Up-Table.
 As shown in figure 2.8, a LUT-based logic element usually contains an output register which
 synchronizes, if necessary, the output with a clock. An n  inputs logic element can implement
 up to 22
 n
 different functions only by changing the contain of the SRAM.
 The size of LUT inputs is typically 4. Many studies have shown that a LUT size between 3 and
 6 provides the best trade-off between area optimization and delay (e.g. Rose et al., 1990; Singh
 et al., 1992; Ahmed and Rose, 2000). More precisely, a 3 to 4 input LUT improves the area, while
 a 5 to 6 inputs minimizes delay.
 Examples of LUT-based basic building blocks in commercial FPGAs
 In most of commercial SRAM-based FPGAs, basic logic elements are grouped in a kind of cluster
 which provides faster connections between the LUTs inside the cluster, in addition to registers (flip-
 37
2. Dynamically Reconfigurable Architectures SRAM-Based FPGA
 Figure 2.8: A logic element or configurable logic block
 flops and latches), multiplexers (used as input and output decoder) as well as some combinational
 logic for basic computations (e.g XOR-gates, adders, fast carry chain, etc.).
 1. Adaptive Logic Module in Altera Stratix V architecture (figure 2.9)
 Figure 2.9 depicts a high-level block diagram of a LUT-based Adaptive Logic Module (ALM).
 An ALM is the basic building block of logic in the Altera Stratix V architecture. It combines
 advanced features and efficient logic utilization.
 Each ALM contains a variety of LUT-based resources that can be viewed as two combina-
 tional adaptive LUTs (ALUTs) followed by two registers. With up to eight inputs for the
 two combinational ALUTs, one ALM can implement various combinations of two functions.
 However, the ALM is completely backward-compatible with four-input LUT architectures.
 One ALM can also implement any function with up to six inputs and certain seven-input
 functions. In addition to the adaptive LUT-based resources, each ALM contains two pro-
 grammable registers, two dedicated full adders, a carry chain, a shared arithmetic chain,
 and a register chain. Through these dedicated resources, an ALM can efficiently implement
 various arithmetic functions and shift registers. Each ALM drives all types of interconnects:
 local, row, column, carry chain, shared arithmetic chain, register chain, and direct link.
 2. A Slice in a Xilinx Virtex 5 FPGA architecture (figure 2.10) In Xilinx FPGAs,
 Configurable Logic Blocks (CLBs) are the main logic resources. Each CLB element consists
 of two SLICEs (shown in figure 2.10) and is connected to a switch matrix for access to the
 general routing matrix. CLBs and therefore slices are arranged array-wise. The two slices
 within the same CLB belong to different column of slices and do not have direct connections
 to each other. However, each slice in a column has an independent carry chain which could
 be connected to the neighboring slice of the same column. As pictured in figure 2.10, every
 slice contains four look-up tables for logic-function generation, four storage elements, wide-
 38
2. Dynamically Reconfigurable Architectures SRAM-Based FPGA
 Figure 2.9: An Adaptive Logic Module in Altera Stratix V architecture (courtesy Altera).
 function multiplexers, and carry logic. These elements provide logic, arithmetic, and ROM
 functions. With up to six independent inputs (e.g. A1 to A6) and two independent outputs
 per LUT in the slice, each of the four function generators can implement any arbitrarily
 defined six-input Boolean function, or implement two arbitrarily defined five-input Boolean
 functions, as long as these two functions share common inputs. Thanks to multiplexers, the
 four LUTs could be combined to generate any Boolean function of 7 or 8 inputs in a slice.
 Furthermore, some slices (SLICEM, not shown in figure 2.10) support additional functions
 such as storing data using distributed RAM and shifting data with 32-bit registers.
 39
2. Dynamically Reconfigurable Architectures SRAM-Based FPGA
 Figure 2.10: A Slice (SLICEL) in a Xilinx Virtex 5 FPGA architecture (courtesy Xilinx).
 Routing resources
 Most of the chip area (60 to 90%) of an FPGA is occupied by routing used to connect logic blocks
 between them and to I/O blocks. These programmable interconnects are of different kinds, ranging
 from short local wires, general-purpose wires, global interconnect and dedicated clock distribution
 networks. The main reason of using different interconnects is to minimize delay throughout the
 wire. This is achieved by finding the best trade-off between the area size of the wires, distance
 separating the buffers inserted along the wires, etc. As they provide low impedance, antifuse-
 based programmable interconnects are faster than those used in SRAM-based (pass transistors or
 three-state buffers) or flash-EEPROM based FPGA.
 40
2. Dynamically Reconfigurable Architectures Heterogeneous FPGAs
 Programmable IOBs
 Input/Output blocks control the data flow between FPGA external pins and the internal user
 logic through programmable interconnects. Input/Output blocks are multistandard, providing
 different logical levels (e.g. 3.3-V LVTTL
 7
 , multi-voltage LVCMOS, multi-voltage PCI, HSTL
 8
 ,
 SSTL
 9
 , etc.), LVDS
 10
 channels, SERDES
 11
 support, EDS
 12
 protection, DDR
 13
 compatibility, and
 numerous programmable features (input, output, tri-state, delay, skew rate, etc.).
 Digital Clock Manager
 Among routing resources, dedicated clock distribution networks appears as a tree of drivers with
 buffers distributed throughout it in order to minimize delay skew. DCM digitally manages clocks
 signals generation and distribution. It also enables delay skew compensation, clock multiplication
 and division for multi clock designs.
 2.5.6 Heterogeneous FPGAs
 Today, FPGAs are increasingly used in many computationally intensive applications. Their archi-
 tecture became more heterogeneous, embedding hardwired blocks as shown in figures 2.11 and
 2.12 in order to respond to the market demand. Indeed, its high flexibility comes at the cost of
 efficiency compared to ASIC. As more gates have to operate in the FPGA than in an ASIC for
 the same functionality
 14
 , it will consume more power, clock slower and require more silicon area.
 But more than ASIC and processors, FPGAs leverage the logic density improvements arising from
 technology scaling (the Moore's law). Hence, they are becoming larger and faster, with on-chip
 dedicated blocks (memories, multipliers, processors, etc...). These well-designed and well-tested
 ASIC blocks are more area efficient and faster than their CLB-based counterparts. They avoid
 the use of considerable amount of logic to implement CLB-resources-greedy functions such as
 7
 LVTTL/LVCMOS : Low Voltage TTL/CMOS I/O logic switching levels
 8
 High Speed Transceiver Logic, for fast SDRAM
 9
 Series Stub Terminated Logic, for reduced latency DRAM (RLDRAM)
 10
 Low Voltage Differential Signaling
 11
 Serializer/deserializer
 12
 Electrostatic discharge
 13
 Double data rate
 14
 Compared to an ASIC achieving the same functionality and manufactured at the same process
 technology (e.g. 90nm), an FPGA clocks 10 times slower and uses 50 times larger silicon area per gate.
 This is due to configuration overhead, logic gates and configuration memories, routing delays, etc. . . .
 41
2. Dynamically Reconfigurable Architectures Heterogeneous FPGAs
 multipliers or memories. Furthermore, one or many hard core processors are directly integrated
 within the FPGA
 15
 , along with various network interfaces and communication modules
 16
 . This
 new trend could be denoted as multi-grain FPGAs, as it combines fine-grain and coarse-grain
 reconfigurable architecture on the same silicon die. The most common embedded hard blocks are:
 Memory Blocks
 In dataflow oriented applications like image processing, huge amounts of data need to be regularly
 and temporary stored during their processing. Consequently, embedding memory blocks within
 the FPGA became crucial for reducing memory access delays. As depicted in figures 2.11 and
 2.12, these blocks are distributed all over the device, providing a very high memory bandwidth
 suitable for highly parallel applications. These so-called BlockRAM (BRAM) modules which are
 fundamentally 36 Kbits in size, can also be used as two independent 18 Kbit blocks, and provide
 single port and dual port access. A single FPGA chip may contain up to 16 Mbits
 17
 memory
 spread over the chip.
 Embedded DSP Blocks
 Embedded DSP blocks (figure 2.11) provide many MAC
 18
 units. Associated with the aforemen-
 tioned BlockRAM embedded memory modules, they ease the implementation of digital signal
 processing functions like filters. Hence, a DSP oriented FPGA can run concurrently hundred of
 MAC units
 19
 and thereby far exceed programmable DSP processors performance.
 High speed I/O transceivers
 High-speed serial I/Os are more used today since serial buses are progressively preferred to parallel
 buses. Effectively, parallel buses need to be synchronized to a clock line, while in serial buses, the
 clock signal can be implicitly included in the signal. The GX series of Altera Stratix FPGAs family
 15
 e.g the PowerPC405-based processor(s) integrated in Xilinx FPGAs (figure 2.12), or the ARM-based
 processors in Altera FPGAs.
 16
 e.g. built-in PCI Express and 100 Gigabit Ethernet in Altera Stratix V family and Xilinx 7 series,
 both leveraging new 28nm process and design innovations to reduce power consumption
 17
 e.g. Xilinx FPGA Virtex-5 XC5VFX200T integrates up to 456 x 36 Kbits blocks or 912 x 18 Kbits
 blocks
 18
 multiply-accumulate
 19
 up to 512 multipliers blocks 18-bit x 18-bit in some Stratix V FPGA sub-family
 42
2. Dynamically Reconfigurable Architectures Heterogeneous FPGAs
 (figure 2.11) and the FX/TX series of Xilinx Virtex FPGAs family (figure 2.12) are examples of
 FPGAs optimized for high speed serial connectivity. They enable data rate of up to 12.5 Gbps
 20
 ,
 particularly well-adapted to high-throughput telecommunication systems.
 Figure 2.11: Altera Stratix-V floor plan (Altera, www.altera.com).
 Figure 2.12: Xilinx Virtex II Pro FPGA with up to 4 hard core embedded processors(Xilinx,
 www.xilinx.com).
 20
 10.3125 Gbps for Xilinx Kintex-7 FPGA Family and 12.5 Gbps for Altera Stratix V FPGA Family
 43
2. Dynamically Reconfigurable Architectures FPGA Design Flow
 Embedded hard/soft core processors
 Figures 2.11 and 2.12 depict the floor plan of Altera Stratix V FPGA and Xilinx Virtex II Pro
 FPGA. The latter could embed up to four IBM PowerPC 405 RISC hard core processors within
 its structure (figure 2.12). The other option is to embed instantiated soft core processors instead
 (e.g. Altera Nios II soft core in figure 2.11), as they provide more flexibility on architecture and
 therefore instructions set. By merging hard or soft microcontroller cores in FPGA chips, FPGA
 manufacturers have launched the SOPC
 21
 age. A SOPC integrates on a one die one or several
 programmable processors (GPP/RISC, DSP), memory, reconfigurable fabrics and some dedicated
 blocks. Hence, it combines the programmability and well-tried connectivity of microprocessors
 (buses, uart, Ethernet) with the high performance of ASICs and the reconfigurability of FPGAs.
 Besides, the merged soft core microprocessor is customizable (pipeline depth, cache size, etc.) by
 the designer. Further, as shown in Meyer-B?se et al. (2006), this soft core allows the designer
 to extend its instruction set by implementing some acceleration units in the form of custom
 instructions. This processor is referred to as configurable processor. Tensilica's Xtensa
 22
 , Altera's
 NIOS and Xilinx's Microblaze soft core processors are a few examples of configurable processors.
 In the literature, Coarse-grained reconfigurable arrays presented below are also referred to as
 configurable processors.
 2.5.7 FPGA Design Flow
 Traditional design flow
 In the traditional FPGA design flow pictured in the grey coloured part of figure 2.13, a design is
 created using a HDL (e.g. VHDL or Verilog) or a schematic capture environment, then synthesized,
 placed and routed for a specific FPGA. The minimum steps of the design flow are listed below:
 1. Design description in HDL (VHDL, Verilog, SystemC), or through a graphical entry tool
 that generates the corresponding HDL code.
 2. Simulation to verify the correct behavior.
 3. Synthesis.
 4. Mapping, placement and routing for a specific FPGA.
 21
 System on a Programmable Chip
 22
 Tensilica (www.tensilica.com)
 44
2. Dynamically Reconfigurable Architectures FPGA Design Flow
 5. Binary configuration file (bitstream) generation and loading on the targeted FPGA.
 Nowadays, most of the above mentioned steps are push-button operations. However, as previously
 stated, the first step (design entry) using HDLs is still an almost manual and time consuming
 task. Indeed, even if C-like languages along with their C-to-hardware compilers are increasingly
 investigated
 23
 , the resulting FPGA implementation (in case of a successful synthesis) is far from
 competing manual design in terms of area efficiency and timing.
 The generated bitstream is downloaded into the FPGA through configuration interfaces like
 JTAG, SelectMap or Slave Serial ports. These ports enable numerous reconfiguration techniques,
 bitsream encryption and bitstream readback (through the SelectMAP and JTAG interfaces). As
 SRAM-based FPGAs are volatile, bitstreams are stored in an external non-volatile memory and
 are used by a CPLD
 24
 to configure the FPGA at each power on.
 Figure 2.13: Design flow for FPGA-based systems embedding a programmable processor.
 23
 SystemC (www.systemc.org), Celoxica (2000), ImpulseC (www.impulseaccelerated.com), etc...)
 24
 Complex Programmable Logic Device
 45
2. Dynamically Reconfigurable Architectures Modular Design Flow
 SOPC design flow
 As FPGAs enable a SOPC design approach, the design flow in that case is slightly different. The
 design of the software part of the SOPC is integrated in the flow and the whole process is depicted
 in figure 2.13. The main FPGA manufacturers tools and other third party companies provide
 complete IDE
 25
 for SOPC design. Such IDEs are capable of designing both the hardware part
 and the software part of the design by:
  supporting the traditional design flow that automatically runs the whole process, resulting
 in a single binary file (e.g. .bit file) which contains the configuration data for the FPGA.
  providing the environment for writing the code and developing the application to run on the
 embedded processor (generating the executable file), generating all communication media
 ( bus, UART, JTAG, Ethernet, etc.), and finally producing the binary file containing the
 configuration data for FPGA and the binary code for the processor.
 2.5.8 FPGA Modular Design for Runtime Partial Reconfiguration
 Modular design flow is an incremental design approach that differs from the above mentioned
 traditional design flow. The main drawback of the latter approach is that if a slight change is
 made on an implemented design, the five design steps (grey coloured part of figure 2.13) have to
 be completely redone for the entire design, making the redesign or modification process very long.
 Originally, Modular design flow aimed to partition an application into its natural functional
 units in a way that each unit corresponds to an independent module. Hence, each module can be
 separately designed, tested, modified, validated, implemented and even reused in another design.
 Modules can also be third party IPs (intellectual properties) released either as HDL-described
 circuits or pre-synthesized netlists.
 As Modular Design requires a clear partitioning of the design in different modules, a team of
 designers can work independently and concurrently on different modules and later merge them
 into one FPGA. This concurrent and hierarchical approach saves time and allows for independent
 modules modification and validation while leaving others unchanged and stable. The latter feature
 is exploited while designing dynamically and partially reconfigurable designs for SRAM-based
 FPGAs. Indeed, reconfiguring a module corresponds to changing a functionality by swapping
 functional units on the FPGA, each functional unit performing a specific task.
 25
 Integrated Design Environment
 46
2. Dynamically Reconfigurable Architectures Modular Design Flow
 Figure 2.14: Modular design enables dynamic module swapping.
 An example of dynamic reconfiguration is pictured in figure 2.14 where the FPGA can contain
 at most two modules at a time. One can switch from one design to another by reconfiguring at
 least one module of the FPGA. However, because of current lacks in FPGA design tools for partial
 runtime reconfiguration, the FPGA is block-partitioned and reconfigurable modules are position-
 constrained at design time. In the example of figure 2.14, the FPGA is two-blocks partitioned to
 accommodate two classes of reconfigurable modules, module class A and module class B. Modules
 of the same class (e.g. modules A, A' and A?) are allocated the same block and the size of the
 module cannot exceed the size of the containing block. There is at most one module in a block at
 a time and only the modules of the same class can be swapped on their allocated and position-
 constrained block. Hence, at a certain point in time and if necessary, one can switch from top
 design 1 to top design 2 by swapping module A for module A' without affecting the module B.
 The Xilinx ISE encompassing the PlanAhead tool is one example of an enabling design en-
 vironment for partial and dynamic reconfiguration of Xilinxs' FPGAs. In addition, the Xilinx
 Application Note (Xilinx, 2004) provides details on two ways for designing partial runtime re-
 configuration : difference-based and module-based. The latter approach uses the above-mentioned
 modular design flow and generates partial bitstream for different modules.
 Let's F being an m-blocks partitioned FPGA which can therefore hosts m modules at a time.
 Given n the number of physically pre-placed modules per block. If it is assumed that all the
 modules are reconfigurable
 26
 , then the FPGA can implement n  m different designs or parts of
 26
 This is rarely the case as in real design and with current design tools (e.g. Xilinx ISE) there is always
 47
2. Dynamically Reconfigurable Architectures Coupling FPGA & Processor
 designs as shown in figure 2.14.
 Hardware virtualization and hardware multitasking
 When the dynamically loaded and unloaded modules belong to the same application, this process
 is referred to as hardware virtualization otherwise hardware multitasking. Hardware virtualization
 stands in that, thanks to dynamic reconfigurability, the total amount of logic on an FPGA virtually
 appears bigger than it really is. Indeed if the modules of an application are properly scheduled, an
 FPGA of size W  H can fit the application even if the sum of the areas of the modules composing
 the application is bigger than the size of the FPGA as expressed in equation 2.1.
 V irtualization => Appl(total_size) =
 mX
 i=1
 wi  hi > FPGAsize = W  H (2.1)
 where W and H (resp. wi and hi) are respectively the width and the height of the FPGA (resp.
 of module i), W  H the total size of the FPGA (resp. wi  hi the size of module i ) and m the
 number of modules composing the application Appl.
 As stated in the previous chapter, Software Defined Radio is an example of a promising appli-
 cation field for reconfigurable architecture. The flexibility and the high performance required by
 the radio may be achieved by modules swapping as described earlier. Hence with proliferation
 of standards, the radio can dynamically switch from one standard to another by swapping corre-
 sponding functional units in and out of a limited and cost effective silicon die, where the cost of
 an integrated circuit is related exponentially to die size.
 2.5.9 Coupling with the Host Processor
 Reconfigurable hardware devices are used in several ways as hardware acceleration units to boost
 the performance of a computing system.
 Shown below are different scenarios of coupling the reconfigurable hardware fabric to the host
 processor. These coupling scenarios also define communication cost. Indeed, communication is of
 high cost in embedded systems as it has a great impact on the global performance of a system.
 Coupling scenarios tend to complicate the design process, as the designer faces a new programming
 model for targeting a new computing resources model. One of the main issues is to identify which
 coupling approach is likely to yield particular performance benefits in an application domain,
 at least one fixed parts of the design which is not reconfigurable and which hosts a hard-core or a softcore
 processor that manages the reconfiguration process for example.
 48
2. Dynamically Reconfigurable Architectures Coupling FPGA & Processor
 and to know whether or not using reconfigurable hardware is the most profitable. A significant
 challenge is to find a suitable architecture for the communication media which interfaces different
 parts of the system (FPGA, programmable processors, etc...). Here below are three ways of using
 reconfigurable logic in a SoC:
 1. Reconfigurable Hardware used as a peripheral co-processor (figure 2.15)
 In this design approach, any process intensive code (encryption/decryption, pattern recog-
 nition, etc.) within a given application, migrates to the reconfigurable hardware device
 (e.g. FPGA). The reconfigurable fabric accelerates the computations via devices such as
 an FPGA board, or a multi-FPGA board. Consequently, a programmable GPP or DSP
 is relieved of some complex processing tasks (mostly dataflow oriented tasks ) to increase
 the overall system throughput. In this case, the programmable processor and the reconfi-
 gurable logic could be connected to the same bus (or some kind of I/O bus) and therefore
 communicate through the bus protocol. Communicating through a bus leads to a higher
 communication cost, and consequently worths only if the speed improvement brought by
 the reconfigurable logic exceeds the overhead of transferring data through the communica-
 tion media. This is achieved either by implementing applications which do not need regular
 transfer of huge amount of data between the programmable processor and the co-processor,
 or by implementing a whole computationally intensive algorithm on the co-processor. Pe-
 ripheral single-function co-processors have been implemented using dedicated ASIC blocks
 which provide faster, power-efficient and area-efficient implementation compared to the cor-
 responding FPGA implementation. However, improvement in FPGA technology fills these
 gaps by providing flexibility (static or dynamic reconfigurability) and by improving upgrad-
 ability and Time-To-Market.
 2. System-on-a-Programmable Chip (SOPC - figure 2.16, left)
 In this approach, the whole system is designed on a single reconfigurable hardware device
 (e.g. FPGA) embedding some hardwired and optimized IP cores such as softcore and
 hardcore processors, Input/Output devices, memories, and DSP blocks within its structure.
 Low communication cost could be achieved especially when using configurable processors
 where parts of functional units are made of reconfigurable logic, and enable configurable
 instruction set. Thanks to advances in FPGA technology, the main FPGAs manufacturers
 (Xilinx, Altera, etc...) have made such reconfigurable systems design possible by releasing
 products such as Virtex-II Pro, Virtex 4, Virtex 5, Stratix-II, Stratix-5, etc, along with
 49
2. Dynamically Reconfigurable Architectures Coupling FPGA & Processor
 softcore processors (NIOS II, microblaze, etc...).
 Figure 2.15: FPGA as co-processor
 Figure 2.16: FPGA based SOPC (left) and embedded FPGA (eFPGA)
 3. Embedded FPGA (figure 2.16, right)
 An FPGA fabric is embedded in a complex heterogeneous chip containing other elements
 such as GPP and/or DSP processors, memories, I/O and communication modules along
 with hardwired IP cores. Such FPGA coprocessor can be used as hardware extension for
 custom instructions. The reconfigurable fabric behaves like an extended datapath of the
 processor. Once again, the main advantage of integrating the reconfigurable fabric and the
 processor on the same silicon die is to reduce the communication cost as stated earlier. This
 50
2. Dynamically Reconfigurable Architectures Types of Reconfiguration
 category could also be classified as Reconfigurable System-On-a-Chip (RSoC) which is a
 heterogeneous SoC containing a reconfigurable fabric.
 2.5.10 Types of Reconfiguration
 There are mainly two types of reconfiguration : static and dynamic.
 Static (or compile-time) reconfiguration
 Static reconfiguration is the most common way for implementing an application on reconfigurable
 logic using a classical design flow (HDL description or schematic capture, synthesis, place and
 route, download the bitstream in the FPGA) and conventional CAD
 27
 tools. Each application
 consists of one configuration. Once implemented, the application is supposed to run to completion
 without being interrupted. Static reconfiguration is mainly used when high performance, cost
 advantage (e.g. NRE costs) and upgradability are the main goals, and when there is no need to
 switch from one configuration to another in a relatively short amount of time (e.g. less than an
 hour).
 Dynamic (or runtime) reconfiguration
 Dynamic reconfiguration is the ability to reconfigure totally or partially the reconfigurable hard-
 ware while it is running. Hence, dynamic reconfiguration implies dynamic re-allocation of hardware
 blocks at run-time. It enables the concept of hardware virtualization where the physical hardware
 is smaller than the sum of the resources required to implement the whole application. Hardware
 virtualization is performed by time-sharing the same reconfigurable hardware (partially or totally)
 to different functions of the application which do not need to be run concurrently. This is known
 as temporal partitioning, and is still very challenging because of the lack of CAD tool support.
 In addition, swapping from one configuration to another brings additional challenges similar to
 context switching problem in traditional operating systems. In real-time constrained applications,
 reconfiguration time is the main bottleneck. It should be short enough to enable runtime hardware
 task switching. Despite the above-mentioned problems on implementing dynamically and partially
 reconfigurable systems, reconfigurable computing is referred to as the most suitable platform for
 DSP applications (e.g. jui Chou et al., 1993; Petersen, 1995; DeHon, 2000; Tessier and Burleson,
 2001), and far justifies global research on OS for reconfigurable systems.
 27
 Computer Aided Design
 51
2. Dynamically Reconfigurable Architectures Types of Reconfiguration
 Single-context reconfiguration
 In single-context reconfiguration, the reconfigurable hardware device has only one configuration
 downloaded on the device each time. In this case, any change on reconfigurable hardware func-
 tionality requires the complete reconfiguration of the entire chip, and therefore leads to high
 reconfiguration overhead. This feature makes this reconfiguration scheme more suitable for static
 reconfiguration where reconfiguration overhead is not a big concern.
 Multi-context reconfiguration
 In multi-context reconfiguration, many configurations are downloaded on the devices and are
 stored as planes of configuration information. Only one plane of configuration is active at each
 time, and the architecture could quickly switch from one configuration to another, just by ac-
 tivating one of the configuration planes. Obviously, switching from one configuration plane to
 another reduces configuration overhead as reconfiguration is not done sequentially, as during the
 download. Here, configuration switching is a matter of nanoseconds, where single-context needs
 milliseconds or more. Multi-context could be viewed as a kind of configurations prefetch approach
 which drastically reduces configuration overhead, but which requires more silicon area to build as
 many on-chip configuration planes as there are contexts.
 Partial reconfiguration
 Some reconfigurable hardware devices enable partial reconfiguration. Indeed, sometime, either
 only a part of a configuration requires some change or the incoming configuration is not big
 enough to fill the complete chip. By providing the ability for targetting a specific region of
 the chip while keeping other regions unaffected, partially reconfigurable hardware improves area
 efficiency and configuration overhead. The amount of reconfiguration data is smaller when it
 targets only a portion of the chip. One interesting feature of partially reconfigurable hardware
 is the smaller reconfigurable unit granularity. For example, in Xilinx FPGA Virtex II Pro, the
 smallest reconfigurable unit is a full column of the reconfigurable array. With its Virtex
 28
 family
 , Xilinx has been leading partially reconfigurable FPGAs market for years. However, Altera has
 entered this market in 2010 with its first 28nm FPGA chip, the Stratix V FPGA. The latter
 includes partial reconfiguration along with 28 Gbps transceivers and embedded hard IP blocks.
 28
 Virtex II Pro, Virtex-4, Virtex-5, Virtex-6, Virtex-7, and other families (Artix-7 and Kintex-7).
 52
2. Dynamically Reconfigurable Architectures Configuration Hierarchy
 In this thesis, the proposed scheduling strategies assume that the reconfigurable hardware device
 enables partial and runtime reconfiguration.
 2.5.11 Configuration Hierarchy
 An hierarchical model of reconfiguration is meaningful for heterogeneous architectures of reconfi-
 gurable devices presented above. This hierarchical approach aims to exploit the possibility of
 instantiating softcore IP blocks (e.g. processors) or hosting hardwired blocks (processors, mem-
 ory, multipliers) on FPGAs. Indeed, an FPGA instantiating a softcore processor is capable of
 running both hardware tasks and software tasks. Hence, a task may be implemented on its hard-
 ware form (synthesized, placed and routed digital circuit) on the FPGA, or run on its software
 form (executable) on the processor instantiated on the FPGA.
 Figure 2.17: Configuration hierarchy model.
 As stated throughout this thesis, reconfiguration overheads are among the main bottlenecks
 in real-time multitasking on reconfigurable hardware devices. However reconfiguration overheads
 could be reduced by using coarse-grained reconfigurable array operators (presented in section
 2.6 below) as the basis of reconfigurable computing machines. Such operators or blocks, pre-
 built or pre-instantiated into the reconfigurable fabric could be more easily programmed at run-
 time with a minimal amount of configuration data. Reconfiguration overheads are thus reduced.
 This is especially true when pre-instantiated IP blocks are softcore processors. Configuration
 hierarchy denotes here the fact that the reconfigurable hardware device functionality could be
 changed at different hierarchy levels. Figure 2.17 maps an example where there are three levels
 53
2. Dynamically Reconfigurable Architectures Configuration Hierarchy
 of reconfiguration : at bitstream level by reconfiguring the FPGA, at an intermediate level by
 instantiating another processor core, or at software level by changing the program executed by
 the instantiated softcore.
 Related work
 Through the hierarchical configuration concept, a few works (Schaumont and Verbauwhede, 2003;
 Nollet et al., 2006) have demonstrated the usefulness of this feature wish could improve flexibility
 and performance of heterogeneous platforms that include reconfigurable tiles. In addition, hierar-
 chical configuration provides the possibility of running some hardware tasks in their software form
 at the cost of QoS
 29
 , instead of rejecting them or keeping them in an increasing waiting queue.
 Furthermore, some applications use less hardware resources when implemented as a sequential
 machine on an instantiated softcore processor, while still meeting the necessary or a reasonable
 performance requirements.
 Schaumont and Verbauwhede (2003) illustrate the configuration hierarchy in an FPGA through
 many figures. Figure 2.17 depicts a summary of these figures on a single graph. The Thumbpod
 system presented in Schaumont and Verbauwhede (2003) is a perfect application example which
 points out the assets of configuration hierarchy concept. It consists of an embedded Java virtual
 machine executed by a Leon2 softcore processor instantiated on a Virtex-II FPGA. The Leon2
 soft IP core acts as program code (bitstream) for the Virtex-II FPGA and as hardware (micro-
 processor) for the user program (Java code). Configuration hierarchy provides more design and
 programming solutions ranging from harder (hardware design process) to easier one (software
 design process). Schaumont and Verbauwhede (2003) demonstrated that a hierarchical approach
 of configuration could reduce the reconfiguration time. Indeed, by increasing the configuration
 hierarchy the amount of configuration data to be processed is reduced. For example, reconfiguring
 a system designed according to figure 2.17
 30
 at the highest level (by changing the user program
 executed by the Java virtual machine) is faster and easier than reconfiguring the FPGA fabric to
 implement a new softcore or a new design, since this operation does not need a new bitstream to
 be sent on the device.
 More widely, by instantiating free available soft IP cores on FPGAs, designers take advantage
 of an easier and faster software design process (instead of hardware design process), leverage
 communication possibilities (e.g. buses) provided by microprocessors. This is also pointed out by
 29
 Quality of Service
 30
 figure 2.17 is deduced from the ThumbPod system, Schaumont and Verbauwhede (2003).
 54
2. Dynamically Reconfigurable Architectures Coarse-grain Reconf. Arrays
 Nollet et al. (2006) through a literature review on exploiting hierarchical configuration to improve
 run-time task assignment on a Multi-Processor System-on-a-Chip (MPSoC). Unlike the above
 cited work on configuration hierarchy, most of similar work use configuration hierarchy at design
 time. Hence, when implementing an application, functionalities are mapped at design time in
 building blocks pre-placed within the reconfigurable logic. The purpose of using such compiler
 techniques to map functionalities is to design in a more efficient way by separating complex issues
 in a faster way by using a software design process, and in a more resource efficient way.
 2.6 Coarse-grained Reconfigurable Arrays
 2.6.1 Raison D'?tre
 While presenting FPGAs above, some drawbacks of fine grain reconfigurable hardware devices
 have been pointed out and are listed below:
 i). Area utilization, power consumption and speed; programmability at bit level implies using
 exclusively configurable logic to implement logic functions, which leads to a lesser area
 efficiency, a lower operation frequency and a higher power consumption.
 ii). Configuration overhead ; the finer the configuration granularity, the more configuration data
 to process, and the longer the (re)configuration time.
 iii). Development ease ; it is more difficult to target such devices with high level languages.
 The aforementioned drawbacks can be reduced or overcome by using coarse-grained reconfigurable
 arrays operators as the basis of reconfigurable computing machines. In DSP applications, oper-
 ations are performed on word-size data. Therefore, there is a need of word-size operators and
 datapaths. Implementing such functionalities using fine grain configurable resources requires to
 built them at bit-level. This leads to a tremendous use of logic blocks and programmable inter-
 connections along with configuration data needed for the configuration. Configuration data are
 consequently bigger.
 2.6.2 Presentation
 A coarse grain reconfigurable architecture consists of hardwired word-size operators achieving
 nearly ASIC level features (high throughput, low power consumption, better area utilization, etc.)
 along with special purpose interconnections providing enough flexibility for targetting a given
 55
2. Dynamically Reconfigurable Architectures Platform-Based Design
 class of applications. This approach is justified by the fact that for a given class of application
 or program to run, about 90% of the computation effort in terms of execution time or power
 consumption is due to about 10% of the code describing the application. Hence, great perfor-
 mance are achieved by designing high performance operators to cope with these repetitive and
 computationally-intensive parts of the application. Moreover, these parts are made of common
 functions (e.g. DSP) which are coarse enough to be identified even at high abstraction level,
 making coarse grain hardware more easy to target with hardware description languages than their
 fine grain counterparts.
 Furthermore, the good performance of ASSPs is currently inspiring FPGA vendors. Indeed,
 FPGAs architectures are increasingly integrating more hardwired functionality (e.g. high speed
 transceivers, DSP blocks, etc.) in order to accommodate specific markets. This market-focused
 approach is seen as a move toward the ASSP path in terms of the targeted applications and
 the integration of cost-optimized peripherals that meet the needs of those targeted applications.
 These embedded coarse grain customized modules optimize one or many criteria and the FPGAs
 are classified in sub-families accordingly.
 Summarizing, coarse grain reconfigurable hardware partly sacrifices its flexibility to provide a
 better performance for a given class of application while overcoming the drawbacks of fine grain
 reconfigurable architecture listed above. These drawbacks explain attention paid on coarser grain
 dynamically reconfigurable hardware to leverage the runtime dynamic reconfiguration feature and
 improve performance. Raw (Taylor et al., 2002), PipeRench (Goldstein et al., 2000), RaPiD
 (Ebeling et al., 1996), ADRES (Mei et al., 2003), PACT?XPP (Baumgarte et al., 2003) and
 Montium (Heysters et al., 2003) are a few examples of these architectures. They are classified
 as application domain-specific coarse grain reconfigurable systems. Today, commercial FPGAs
 manufacturers are following the trend by proposing different application domain-specific FPGAs
 at each new family release.
 2.7 Platform-Based Design
 2.7.1 Introduction
 Choosing the right platform to implement an embedded application is getting more complicated.
 Indeed, as stated before, advances in technology bring new considerations. On one hand, those
 advances are the enabling technology for System-on-a-Chip (SoC) design approach. Hence, im-
 56
2. Dynamically Reconfigurable Architectures Platform-Based Design
 plementation platforms cited above could be all integrated on a single silicon die and, thereby,
 the design architectural space is getting enlarged. On the other hand, novel architectures are
 proposed. They combine features of those implementation platforms in order to overcome their
 drawbacks, or to achieve a given trade-off.
 2.7.2 Definition
 Platform-Based design is a SoC design methodology, instead of being an implementation plat-
 form. According to the VSIA
 31
 working group, a Platform is ?An integrated and managed set
 of common features, upon which a set of products or product family can be built. A platform
 is a virtual component (VC)?. Hence, the VSIA working group defines Platform-based Design as
 ?An integrated oriented design approach emphasizing systematic reuse, for developing complex
 products based upon platforms and compatible hardware and software virtual components (VCs),
 intended to reduce development risks, costs and time to market?. For Martin (2003) ?Platform-
 Based design is an organized method to reduce the time required and risk involved in designing
 and verifying a complex SoC, by heavy reuse of combinations of hardware and software IP. Rather
 than looking at IP reuse in a block by block manner, platform-based design aggregates a group
 of components into a reusable platform architecture?. Vincentelli and Martin (2001) state that a
 platform is built to provide to designer libraries of hardware and software components, software
 drivers, hardware and software design environment, and references designs that are easily used
 as basis to rapidly develop many products within a reduced application space. In the light of
 those definitions, Platform-Based design approach is essentially an IP-reuse based design of SoC.
 It is leveraging the development of IP-reuse methodology. IP-reuse methodology aims to ease
 the design of complex ASICs by partitioning the design into smaller IP blocks with well-defined
 functionalities. That way, a validated block can be re-used in many designs instead of building
 the whole system from scratch. Design Reuse is the generalization of IP-reuse principle, and the
 main pillar of the Platform-Based design methodology. Here below are a few examples of existing
 31
 Virtual Socket Interface Alliance; the VSI Alliance, founded in 1996, was an open, international
 organization comprised of representatives from all segments of the SoC industry. Its mission was to
 dramatically enhance the productivity of the SoC design community by providing leading edge commercial
 and technical solutions and insight into the development, integration, and reuse of IP.
 Following 12 successful years of developing IP and electronics standards, in 2008 the VSI Alliance dissolved
 operations and transferred ongoing work of the VSI Alliance to other industry organizations.
 Source : http://www.vsia.org
 57
2. Dynamically Reconfigurable Architectures The OveRSoC methodology
 Platforms:
  Philips Nexperia (for digital TV applications);
  TI OMAP (from Texas Instruments, for mobile terminals applications);
  Nomadik (for mobile terminals application domain);
  ARM PrimeXsys (for processor-centric applications);
  Altera's SOPC (e.g. Excalibur ARM or NIOS reconfigurable platform);
  Xilinx Virtex Platform FPGA (e.g. Virtex II Pro, Virtex 4 and 5 reconfigurable platforms).
 With these application oriented platforms, the design space is easier to explore. The designers
 directly derive their end product(s) from an existing platform(s); they use the complete hard
 IPs and software packaged (soft IPs, drivers, etc. . . ) solutions provided by the platform and
 modify application software according to their needs. They have to customize hardware and
 software components and/or developed new ones to achieve the given purpose. They work at the
 application level by developing software and using available IP from existing libraries.
 2.7.3 OS for Reconfigurable Platforms
 The use of Platform-based Design approach in the so-called Reconfigurable SoC design rises the
 challenge of designing a real-time operating system (RTOS) for the platform. At a technology
 independent level and no matter which PEs
 32
 are targetted, a RTOS provides services such as
 scheduling different tasks on different targets under certain constraints, managing memory and
 communication media, providing special services for the reconfigurable hardware part of the plat-
 form if exists, etc. As pictured in figure 2.18, the set of services provided by the RTOS allows
 the designer to abstract the underlying platform details while implementing an application. The
 application to be implemented is divided into tasks to be scheduled on the PEs on the platform.
 The OveRSoC methodology for DSE
 There are two main approaches in designing the RTOS for RSoC. In the first approach, the RTOS
 is derived from an existing RTOS which is tailored with additional services that manage the
 reconfigurable part of the platform. The second approach is to design the RTOS from scratch,
 by introducing RTOS services exploration at system level. This thesis relies on the approach
 32
 Processing Elements
 58
2. Dynamically Reconfigurable Architectures The OveRSoC methodology
 presented in Miramond et al. (2009a), which corresponds to the second approach. Doing so, the
 RTOS is viewed as a flexible component which features will be explored and tuned as same as
 the application and the platform architecture. Furthermore, in the OveRSoC design methodology
 (depicted in figures 2.18 and 2.19 and presented in Miramond et al., 2009a), each PE likely to be
 on the platform is modeled as a RTOS which is connected to the rest of the system. Hence, the
 methodology (figure 2.19) automatically explores tasks distributions on a scalable multi-RTOS
 architecture with respect to application requirements and system constraints in order to assess
 and refine different services which allow the RTOS(s) to efficiently manage PEs of the platform.
 In this thesis, such a methodology is denoted as OS-centric or OS-based, as it relies on a third
 element, the OS, unlike other design methodologies. These latter rely only on the application and
 the architecture.
 Figure 2.18: A view of the OveRSoC methodology with emphasis on DRA (dynamically
 reconfigurable architecture) management.
 59
2. Dynamically Reconfigurable Architectures The OveRSoC methodology
 Figure 2.19: OS services exploration in OveRSoC design methodology (Miramond et al., 2009a)
 which maps the system level part of the generic design flow of SoC (see figure 1.4).
 The concept of scalable multi-RTOS architecture is more detailed in figure 2.19 through the
 design flow of the OveRSoC methodology (Miramond et al., 2009a) and DOGME (Miramond
 et al., 2009b), its dedicated graphical user environment.
 60
2. Dynamically Reconfigurable Architectures Conclusion of the Chapter 2
 This thesis focused on the management of the dynamically reconfigurable hardware part of
 the platform and its corresponding DPRHW-OS
 33
 , as shown in figure 2.18. Numerous services
 such as tasks creation, tasks scheduling, tasks placement, tasks migration, tasks preemption and
 quality of service assessment could be involved here. A task preemption arises when a task Ti
 running on a processor PEi is stopped in order to assign PEi to another task Tj of higher priority.
 Ti is then resumed later. A task migration arises when the task Ti preempted on processor PEi
 is resumed later on a different processor PEj .
 As stated at the end of Chapter 1, the thesis deals with scheduling and placement services in
 DPRHW, as they are among OS services that need a different approach compared to conventional
 OS services for programmable processors. The two objectives remain as follows:
 1. Proposing new on-line real-time scheduling and placement strategies on DPRHW (denoted
 as DRA
 34
 in figure 2.19), and that suit to on-line real-time context. The latter aim is
 achieved by finding a reasonable trade-off between scheduling and placement algorithms
 complexity and their performance in terms of chip utilization ratio, tasks rejection ratio,
 and runtime overhead.
 2. Designing a set of scheduling and placement algorithms for DPRHW which could be used
 in the OveRSoC design methodology. Figure 2.19 shows the complete design flow of the
 methodology. Simulation is done at system level in order to refine the scalable architecture
 along with distributed OS, with respect to system and application constraints. In order to
 allow the methodology to perform a more accurate partitioning of the application, this work
 also aims at providing scheduling and placement algorithms for the DPRHW (denoted as
 DRA in figure 2.19) part of the architecture, along with the associated DPRHW models
 and metrics (utilization ratio, tasks rejection ratio, algorithms runtime overhead, partition-
 ing and defragmentation strategies, etc.). These algorithms are of various complexity and
 runtime overhead, and help on finding the best trade-off depending on the parameter(s) to
 optimize. In addition, as the simulation environment for the OveRSoC methodology (Mira-
 mond et al., 2009b) is SystemC-based, the C++ language is used to design and implement
 the models and the algorithms in order to insure a full compatibility.
 33
 Dynamically and Partially Reconfigurable Hardware - Operating System
 34
 Dynamically Reconfigurable Architecture
 61
2. Dynamically Reconfigurable Architectures Conclusion of the Chapter 2
 2.8 Conclusion of the Chapter
 This chapter presented the most common DSP implementation technologies and their derivatives.
 The chapter discussed strengths and weaknesses of each technology. Table 2.8 proposed by Adam
 (2002) shows a quick comparison. Thanks to experiences earned over the years in designing
 operating systems and compilers, GPP and MCU are the best in terms of time-to-market and
 flexibility. However, they are not suited neither to high-performance DSP applications with hard
 real-time constraints (figure 2.20) nor to power aware systems.
 Figure 2.20: Flexibility vs Performance of implementation platforms.
 Perfor- Flexi- Power TTM Price Develop-
 mance bility ment ease
 GPP/MCU fair exc. fair exc. exc. good
 DSP exc. exc. exc. exc. good exc.
 FPGA exc. good poor good poor exc.
 ASSP/ASIP good poor exc. fair good fair
 ASIC exc. poor good poor exc. fair
 exc. : excellent
 Table 2.1: Comparative table of implementation platforms for DSP applications (Adam, 2002)
 62
2. Dynamically Reconfigurable Architectures Conclusion of the Chapter 2
 Programmable DSP processors are enhanced for DSP operations and low power consumption.
 Today, they are still the best solution for many DSP applications. However, the gap between them
 and GPP is narrowing, and programmable processors are reaching a plateau in their performance
 increase. High-performance and power-efficiency is the realm of ASICs. But the lack of flexibility,
 in addition to high Non-Recurring-Engineering (NRE) cost has limited the use of this technology
 to high-volume products. FPGA technology combines programmability and high-performance.
 Using FPGA-based reconfigurable computing systems or coarse-grained reconfigurable systems
 to achieve future embedded systems requirements is a promising solution. A modern trend is
 to manufacture application-oriented families of reconfigurable hardware devices. The increasing
 success of configurable processors also confirms this trend. But as it appears in table 2.8, the
 research has to improve the two main lacks of configurable processors : the Time-to-market and
 the development ease.
 Since architectures are becoming more heterogeneous, spreading DSP applications within a
 system across an ever increasing architectural design space has complicated the design process.
 This chapter briefly introduced a new trend in embedded SoC design, the Platform-based Design
 approach, on which relies the OveRSoC methodology. The chapter has discussed how introducing
 scheduling and placement algorithms for DPRHW in a Platform-based Design methodology like
 OveRSoC, contributes to efficiently scour the design space in search for a good solution (e.g. a
 more accurate system partitioning and RTOS services refinement).
 The next chapter will give a background on real-time scheduling and then will emphasize the
 online real-time scheduling for reconfigurable hardware devices through a wide literature review.
 63
Chapter 3
 Background and Related Work
 3.1 Introduction
 This chapter gives a background and presents related work on real-time scheduling. The discus-
 sion starts by reviewing the scheduling problem in general, and, subsequently, emphasizes the
 online scheduling of real-time hardware tasks on reconfigurable platforms. In the latter schedul-
 ing, an underlying placement problem for the reconfigurable hardware arises. The scheduling
 and placement problems are difficult to study separately, as they always interfere. Different
 paradigms of real-time scheduling are also presented, especially in online context. The review
 constantly emphasizes the similarities and differences between programmable processors schedul-
 ing and reconfigurable hardware scheduling. Hence, whenever possible, the review relies on the
 knowledge previously experienced in uniprocessor and multiprocessor scheduling. The reason for
 this being that scheduling hardware tasks on reconfigurable hardware devices shows some similar-
 ities with multiprocessor scheduling as many tasks can run concurrently. However, reconfigurable
 hardware scheduling is more complex to study because of the everchanging number and size of
 tasks that could concurrently fit on the reconfigurable fabric. Since the research domain is very
 wide, the second half of literature review is restricted to topics in various layers that are relevant
 to the online real-time scheduling for dynamically and partially reconfigurable hardware devices.
 64
3. Background and Related Work Real-Time Systems
 Figure 3.1: Model of a Real-Time system
 3.2 Real-Time Systems
 A real-time system is a system that is subject to a real-time constraint. Its primary performance is
 to perform critical operations within a set of user-defined critical time constraints (Locke, 1986).
 As pictured in figure 3.1, a real-time system is primarily reactive as it continuously reacts to
 stimuli coming from its external environment. In reaction to these external events, the results
 provided by a real-time system are only valid if they are delivered within a predetermined time
 frame. Hence, the correctness of a real-time system depends on two conditions (Stankovic, 1988):
 1. its logical correctness; the system must compute correct outputs based on its inputs.
 2. its temporal accuracy ; output results must be delivered at the right time (a specified dead-
 line). Failure to do so results in invalid results. In other words, a result arrived after its
 deadline is necessarily false or useless, and can lead to serious consequences in some cases.
 Real-time computer systems are becoming ubiquitous in many applications such as control
 process system in factories, in aeronautics through on-board flight systems for aircraft and satel-
 lites, in IT
 1
 systems through multimedia communication, games development and virtual reality
 and even increasingly in finance through HPC
 2
 for real-time market data processing.
 1
 information technology
 2
 high performance computing
 65
3. Background and Related Work Real-Time Systems
 3.2.1 Hard vs Soft Real-Time
 In a real-time system, time scale is application-dependent and can range from a few milliseconds
 (e.g. airbag system, automatic pilot system, etc.) to several hours (e.g. weather forecast). This
 means that real-time is not only a matter of average speed of the system, and that all timing
 constraints have to be met otherwise the system will fail. Depending on whether a system fails
 to meet its time constraint is vital or not, one can distinguish hard real-time and soft real-time
 systems:
 Hard Real-Time does not tolerate any excess of time constraints, as such overflow can lead
 to critical situations with catastrophic consequences. For example, if an airplane autopilot system,
 a nuclear power station monitor, an airbag system or a medical systems such as heart pacemakers
 reacts beyond its strict time limits, it can seriously endanger the safety of human lives . In order
 to avoid such dangerous situations, the designer of a hard real-time system should be able to
 prove that the time limits will never be exceeded whatever the situation (even in the worst case
 situation, regardless of system load). Hence, designing a hard real-time system which interacts
 with its environment assumes that all possible behaviours of the system are predictable and time-
 bounded. A single task that misses its deadline constitutes failure of the whole system. Hard
 real-time systems are submitted to acceptability tests (e.g. threshold test, feasibility analysis,
 admission control, etc.) for their validation.
 Soft Real-Time is less restrictive; it tolerates deadline overruns at the cost of the quality,
 as far as they remain within certain limits beyond which the system becomes useless. Obviously,
 unlike hard real-time, missing the deadline affects the QoS (e.g. telephone, video conferencing,
 network games, etc.) without leading to catastrophic consequences. Sometime, the system allows
 some exceptional time limits excesses to be compensated during the next execution (e.g. video
 frames). What is important in most of the cases is the average number of deadline overruns that
 needs to be below a given threshold in order to insure a given QoS in a given situation. Most
 of the time, a statistical study of the system behaviour is enough to design a trustworthy soft
 real-time system.
 3.2.2 Requirements for Real-Time Computer Systems
 As time constraints are essential, designing a real-time system assumes that each element of
 the system itself is real-time constrained. Hence, services response time and algorithms runtime
 66
3. Background and Related Work Real-Time Scheduling
 overheads are necessary time-bounded. It also assumes that the time flows in the system and can
 be measured. A real-time system consists on one hand of resource consumers (tasks, application)
 and on the other hand of resources providers (processors, memories, etc.). As detailed in figure
 2.18 page 59 and summarized in figure 3.1, different parts of a real-time (reconfigurable) system
 may be seen as layers, with a central part denoted as OS or RTOS sandwiched between the
 application layer and the resources layer. The OS which hosts the scheduler acts as an interface
 between the application and the Processing Elements.
 3.3 Real-Time Scheduling
 3.3.1 Introduction
 The scheduling problems are present in many systems ranging from factory systems through
 process control, to embedded systems. It appears in any domain where there is a need to organize
 the allocation of finite resources to a given application which consists of a sequence of tasks. It is
 then necessary to coordinate the use of the resources in order to run the application to completion
 and as efficiently as possible. This efficiency means optimizing one or many criteria. Such criteria
 could be to minimize the schedule length (or makespan, defined in section3.3.4), to maximize the
 resources utilization ratio, to maximize the number of accepted tasks (e.g. tasks which meet their
 deadline), etc.
 In general, a scheduling problem is described by a triplet { ,  ,  } where  represents the
 machine or processor environment,  the application to process on the machine along with its time
 constraints, and  the objective function to be optimized. Put it simple, the application consists
 of n tasks that have to be processed on m processors while optimizing the objective function  .
 Many scheduling problems have been shown to be NP-complete optimization problems, and many
 scheduling heuristics of lesser complexity have been proposed.
 3.3.2 Real-Time Tasks
 A task is a set of instructions to be executed on a processor. It provides a given service to the
 application. In order to perform an efficient scheduling without violating any time constraints,
 the scheduler has to know timing characteristics of the tasks. These characteristics are of different
 importance depending on the scheduling policy used. In general, a timing parameter is given as
 a positive integer, multiple of the smallest indivisible time unit (denoted as tick in time-aware
 67
3. Background and Related Work Real-Time Scheduling
 systems). The most common parameters of a real-time tasks Ti are (see figure 3.2, page 70):
  ai : arrival time (sometimes denoted as release time or request time ri) is the time when
 task Ti is created and is ready to be run on the processor.
  ei : execution time (or computation time or processing time) is the duration needed by a
 task to run to completion on a given processor. Therefore, execution time is processor-
 dependent. In most of hard real-time scheduling problems, execution time ei is assumed
 to be equal to the WCET
 3
 on the considered processor. The WCET of a task Ti is the
 maximum time that the task will require to run to completion on the considered processor.
  Pi : period for a periodic task.
  deadline is the time at which the task must have been completed. There are 2 types of
 deadline :
 ? the relative deadline Di if the deadline is relative to the release time of the task
 instance.
 ? the absolute deadline di = ai +Di.
  li : laxity (or slack time) of a task Ti is the difference between its relative deadline and its
 execution time and is defined as
 li = Di  ei (3.1)
 Once released at time ai, a task Ti cannot wait more than its laxity li before starting,
 otherwise it will not meet its deadline.
  pi : priority if priorities are used. Scheduling algorithms based on the priority of the tasks
 are denoted as priority-driven or priority-based.
  si : start time corresponds to the date which task Ti will start its execution on the processor.
 si is assigned by the scheduler.
  fi : finishing time or completion time ci corresponds to the date which task Ti will end.
 Obviously it depends on its starting time.
  rti : response time is the difference between the arrival time and the finishing time, and is
 given by
 rti = fi  ai
 3
 Worst Case Execution Time
 68
3. Background and Related Work Real-Time Scheduling
  uTi : the utilization ratio of task Ti is the ratio of its execution time ei to its period or its
 minimal inter-arrival time Pi and is given by equation 3.2
 uTi =
 ei
 Pi
 (3.2)
 uTi reflects the chunk of time occupied by task Ti when executed on a single processor.
 A job is an instance of a task. Hence the jth instance of task Ti is denoted either as Ti;j or as
 Ji;j . In this thesis, a job Ti;j or Ji;j may sometime be referred to as task Ti or Ji if there are any
 possible ambiguities.
 Periodic Real-Time Tasks
 A task Ti is denoted as periodic if instances Ti;j of the tasks are released periodically and with
 a fixed periodicity Pi as shown in figure 3.2. In real-time computing systems, an application is
 sometime made of computation tasks that have to be performed periodically. Therefore, the model
 of periodic task proposed by Liu and Layland (1973) based on a WCET is widely used (e.g. in
 Danne, 2006). If the first release date of a periodic task is unknown (resp. to be known), the task
 is denoted as non concrete (resp. concrete). In addition, periodic tasks that are concrete are said
 synchronous if their first release date is identical (e.g. Ti and Tj in figure 3.2), otherwise they are
 asynchronous. Systems with non concrete tasks (i.e with some unknown first release time) show
 an event-driven behaviour while concrete tasks systems are time-driven. If equation 3.3 (where
 Pi is the period of task Ti) is verified, the real-time system is said harmonic.
 8i; j 2 N; Pi > Pj => 9n 2 N : Pi = n  Pj (3.3)
 If Di = Pi (resp. Di < Pi), then Ti is said to have implicit deadline (resp. constrained deadline).
 A system exclusively made of tasks with implicit deadlines (resp. with constrained deadlines)
 is denoted as an implicit-deadline system (resp. constrained-deadline system). In an arbitrary-
 deadline system some tasks could have their deadline greater than their period (Di > Pi). In
 periodic tasks systems, implicit-deadline system model is the most widely used (e.g. Danne, 2006).
 Let  n be a set of n tasks [T1; T2; :::; Tn] and [P1; P2; :::; Pn] the corresponding periods. The
 hyper-period Hp of the periodic tasks set is the least common multiplier of the periods of the tasks
 set, as defined in the following equation 3.4.
 9Hp : 8Ti 2  n; i 2 N; Hp mod (Pi=1::n) = 0 => min (Hp) is the hyper-period. (3.4)
 The pattern of jobs activation is repeated identically in time intervals equal to the hyper-period.
 69
3. Background and Related Work Real-Time Scheduling
 Figure 3.2: Different periodic real-time task according to their release time
 As a feasible schedule found and validated for the first hyper-period is used indefinitely, a system
 consisted only of periodic tasks is therefore much more easy to schedule.
 Aperiodic and Sporadic Real-Time Tasks
 A practical real-time system hardly consists of periodic tasks exclusively. Indeed, in a real life
 scenario some non-periodic tasks could appear at any time in addition to periodic tasks. External
 events such as sensor activation or target detection are such tasks that pop up and have to be
 urgently handled by the system. A task is aperiodic when instances of the task are released
 randomly in time (figure 3.3, down). However, as shown in figure 3.3 (top), a task is denoted as
 sporadic if there is a minimum inter-release time Pi between instances of the tasks.
 As summarized in figure 3.4, a periodic task is a special case for a sporadic task where inter-
 arrival time remains constant, and a sporadic task itself is a special case for an aperiodic task.
 70
3. Background and Related Work Real-Time Scheduling
 Figure 3.3: Aperiodic task and sporadic task.
 Figure 3.4: Periodic, sporadic and aperiodic tasks
 3.3.3 Different Scheduling Problems
 As previously said, a scheduling problem generally consists of processing a sequence of n jobs
 J1::n on m machine M1::m with respect to a given objective function to optimize. Consequently, a
 scheduling problem is expressed according to the jobs, the processor(s) and the objective function.
 Uniprocessor vs Multiprocessor Scheduling
 The distinction between uniprocessor and multiprocessor scheduling, depends on whether there
 is one or many processors available to process the jobs in the system. In both cases, each processor
 can process only one job at a time, and each job can only be processed by one processor at a
 time. Uniprocessor scheduling is the simplest scheduling approach and has been widely studied.
 Scheduling is much more difficult in a multiprocessor system. The number of processors are
 71
3. Background and Related Work Real-Time Scheduling
 sometimes denoted as m. Processors may or may not be identical.
 Soft vs hard Real-Time Scheduling
 Hard and soft real-time systems has been previously presented in section 3.2.1 page 66 above.
 Depending on which kind of real-time system is considered, there's hard real-time scheduling and
 soft real-time scheduling.
 Static vs Dynamic Scheduling
 In a static scheduling, tasks parameters (e.g. priority) are assigned beforehand and remain
 unchanged during the system life time. In contrast, dynamic scheduling assumes that scheduling
 relies on tasks parameters that vary at runtime. For example in priority-driven scheduling, the
 scheduling is denoted as static if a priority assigned to a task on release, cannot be changed
 throughout its life (e.g. RM
 4
 scheduling algorithm). However, if a dynamic scheduler is used,
 then the priority assignment of the task can be changed at runtime (e.g. priority inversion, EDF
 5
 algorithm, etc.).
 When applied to reconfigurable hardware scheduling and placement, Ahmadinia et al. (2004)
 defined a static scheduling and placement as being when the same scheduling and placement rules
 apply to every single arriving task and the entire reconfiguration area is available for the placement
 of any task. Their algorithm is dynamic in the sense that their scheduling and placement heuristics
 are adjusted at runtime depending on some parameters of arriving tasks. In addition their FPGA
 is divided in clusters, and a task could be placed in a given cluster depending on its completion
 time.
 Fixed vs Dynamic Priority Scheduling
 A fixed priority scheduling algorithms can be either at tasks level or at jobs level. In the first
 case, the priority of each task is assigned at design time and remains unchanged (e.g. Rate
 Monotonic). In the second case, at different jobs of the same task could be assigned different
 priorities. However each job's priority remains unchanged during its execution. Example, EDF
 scheduling algorithm.
 Unlike fixed priority algorithms, the priority of any job could be changed at any time in dynamic
 priority algorithms. Example, LLF - least laxity first scheduling algorithm.
 4
 rate monotonic
 5
 earliest deadline first
 72
3. Background and Related Work Real-Time Scheduling
 Preemptive vs Nonpreemptive Scheduling
 Preemption arises when a task is interrupted before its completion (and resumed later) in order
 to assign its processor to another task of higher priority. Preemption only happens when every
 processor is busy. The preempted task is resumed later. In a nonpreemptive tasks system, once a
 task starts its execution it runs to completion.
 Nowadays, preemption is a mechanism widely used in operating systems. However, preemption
 makes scheduling much more difficult as the scheduler has more operations to perform. Indeed,
 resuming a task that has been suspended earlier needs a context switching. The latter consists
 of tasks context saving and restoring necessary to insure continuity in scheduling. Preemptive
 algorithms can only be applied on tasks that are preemptable
 6
 . In addition, a task migration is
 necessary if the task is to be resumed on a different processor. Task preemptions, context switches,
 task migrations, and the scheduling itself are system overheads as they take time in addition to
 the application tasks execution time. Furthermore, these system operations are very challenging
 and time consuming on heterogeneous platforms
 7
 .
 In general, a cost is associated to each of these system operations while coping with preemptive
 scheduling. However, most of the studies assume that preemption cost could be neglected as far
 as tasks execution times are far higher than context switching runtime overheads. Obviously,
 preemptive scheduling is much more difficult, compared to nonpreemptive scheduling, and even
 worst in heterogeneous multiprocessor system especially when a suspended task is resumed on a
 different kind of processor.
 Throughout this thesis, it is assumed that system overheads are included in tasks execution time,
 which is the WCET.
 Precedence Constraints
 In general, a system could be represented as a functional block diagram. The functional blocks
 are identified and characterized tasks. They are sometime dependent as some consume (con-
 sumers, successors) outputs computed by others (producers, predecessors), imposing a network
 6
 some tasks are preemptable at any time, others at precise times, and others simply too difficult to
 preempt properly, because the context saving and restoring is difficult to insure.
 7
 e.g. hardware and software tasks running on heterogeneous processors including programmable, ded-
 icated and reconfigurable. Hardware tasks preemption is a very challenging research concern in reconfi-
 gurable computing. So are software-to-software and hardware-to-software tasks migrations on heteroge-
 neous processors.
 73
3. Background and Related Work Real-Time Scheduling
 of interdependencies. Precedence constraints require that one or more jobs have to be completed
 before another job is allowed to begin its processing. Precedence constraints clearly appear in
 dataflow graph (DFG) modeling commonly used in signal processing.
 Given two jobs Ji and Jj , if a Ji has to be completed before Jj starts, then Jj is the successor
 of Ji while Ji the predecessor of Jj . Constraints are referred to as chains if each job has at most
 one predecessor and at most one successor. If each job has at most one successor, the constraints
 are referred to as an in-tree or join type. They are fork type or an outtree if each job has at most
 one predecessor.
 This thesis mostly considers independent tasks. However, it also considers the jobs arrive over
 time online tasks model (described later in section 3.4.2) that may implicitly model some
 precedence.
 3.3.4 Objective Functions
 To measure the quality of scheduling, different indicators are at disposal. Depending on the aim,
 these indicators appear as objective functions. Hence, the scheduling problems usually consist of
 optimizing (by minimizing or maximizing) a given objective function. Objective functions could
 be either the sum of these factors or their average values over all the jobs. The most common of
 them are :
 (i). Makespan or length of the schedule; the makespan Cmax or mk is the finishing time
 of the last job completed in the system. The objective function is generally a function of
 the completion times of each job, consequently the makespan which is to be minimized. A
 minimum makespan leads to a high processor utilization ratio, which is desirable for optimal
 use of resources. The makespan could be also defined as the length of the schedule or the
 maximum completion (or finishing) time, not to be mistaken for the sum of the completion
 times of all jobs defined as
 Pn
 i=1 ci for a set of n completed jobs.
 (ii). Response time or total flow time; it reflects the time that flows from the release of the
 job in the system to its completion time.
 (iii). Waiting time; is the time that flows from the release of a job in the system to its starting
 time.
 (iv). Lateness and Tardiness; let's Ji being a job with an absolute deadline di and completed
 on time ci; its lateness Li and its tardiness  i are defined respectively in equation 3.5 and
 74
3. Background and Related Work Real-Time Scheduling
 equation 3.6 as follows:
 Li = ci  di (3.5)
  i = max { Li; 0}; with  i  0 (3.6)
 where Li is positive when Ji completes late and negative when Ji completes before its
 deadline. In the latter case, Ji can be associated with some value vi which is obtained only
 if the task is completed prior to its deadline. Therefore, the higher the values, the smaller
 the makespan, and the better the scheduling.
 However, the lateness (hence tardiness) is very meaningful in soft real-time systems where
 the system could still insure a given QoS even when some tasks miss their deadline. One
 example is when a deadline violation of a job Ji can be tolerated as long as it does not affect
 or delay the release of the next job Ji+1 of the same task.
 The QoS can be expressed by appointing a tardiness bound to every task in the system.
 That is, each job Ji of task Ti must be completed at most  i time units after its deadline in
 the worst case. The tardiness could be also the same for all the tasks in the system. Viewed
 as an objective function, the tardiness of a job could be expressed with respect to a given
 algorithm.
 In preemptive tasks systems, minimizing the number of times a job is preempted could be among
 objective functions.
 3.3.5 O?ine Scheduling
 Depending on whether scheduling decisions are taken o?ine or online, scheduling problems can
 be distinguished into two categories: o?ine and online.
 A scheduling is denoted as o?ine if the flow of the program is known beforehand. This assumes
 that parameters (e.g. release time, execution time, deadline, size, etc.) of all the tasks are known
 in advance, allowing the scheduling to plan resources allocation at design time. Most scheduling
 problems are NP-complete. High performance scheduling algorithms and heuristics are used to
 optimize objective functions (e.g. makespan, resources utilization, waiting time, response time,
 etc.). However, one could afford to run such computationally intensive algorithms as analysis and
 computations are done o?ine
 8
 .
 One example of an o?ine scheduling arises when an application is made of a set of well known
 8
 e.g. on a host PC or a high performance computer
 75
3. Background and Related Work Online Real-Time Scheduling
 periodic tasks. The scheduling simply consists of finding a feasible schedule over the hyper-period.
 Consequently, at each time interval equal to the hyper-period, the same feasible schedule is applied
 over the hyper-period. A feasible schedule is statically stored in a scheduling table that indicates,
 on each clock cycle which job is to be run.
 O?ine scheduling is not in the scope of this thesis. However, it can be used to provide an
 optimal solution for competitive analysis of its online counterpart. An interesting survey on
 o?ine scheduling could be found in Graham et al. (1979).
 3.4 Online Scheduling
 3.4.1 Introduction
 Online scheduling is used when the flow of the program is either totally or partially unknown
 beforehand. In other words, parameters of tasks to be scheduled are partially or totally unknown
 prior to their release time. Hence, tasks are scheduled as they arrive, without a priori knowledge
 of future tasks. The scheduler relies on incomplete information to assign jobs to resources. Such
 a scenario is even much more realistic, compared to the o?ine scenario. Indeed, even if o?ine
 scheduling algorithms provide optimal solutions, they are less suitable for most real-time systems.
 Generally, as such systems are mainly reactive, there is a lack of information on future tasks (e.g.
 their release time) and sometimes even on tasks not yet completed (e.g. their execution time).
 This lack of information is the main difference between o?ine scheduling and online scheduling.
 It explains why unlike online scheduling, o?ine scheduling leads to an optimal scheduling. In
 general, online scheduling is priority-driven scheduling. The highest priority job is chosen by the
 scheduler among the ready jobs in order to be executed on the available resource. The scheduler
 is invoked each time a job is released, or a running job is stopped, or finished, or even at regular
 time interval for quantum-based scheduling.
 3.4.2 Different Online Paradigms
 As stated above, online scheduling relies on incomplete information about input instances to take
 scheduling decisions. Depending on the type of information lacking and on the way how new
 information becomes known, different online paradigms are possible. Some are widely presented
 in Sgall (1998) and here below are a few that drew the attention to the scope of this work :
 76
3. Background and Related Work Online Real-Time Scheduling
 (i). Scheduling jobs one by one;
 also denoted as jobs arrive over list, there is no job release time in this online paradigm. Jobs
 appears as an ordered list and are presented to the decision maker (scheduling algorithm)
 as is, one by one. However, as soon as a job is presented, all its characteristics are known,
 including its running time. Therefore, the job could be assigned to a time slot with an
 immediate or a delayed start. When dealing with the currently presented job in the list,
 the assignment of the previous job cannot be changed anymore, leading to a situation
 where previously scheduled jobs cannot be rescheduled. To summarize, jobs are absolutely
 scheduled as ordered in the list, but are not necessarily executed in that order. The online
 feature is the fact that parameters of future tasks are unknown before they are presented.
 LS
 9
 algorithm is based on this paradigm.
 (ii). Clairvoyant Scheduling;
 unlike the previous paradigm, release time is meaningful in clairvoyant scheduling algo-
 rithms. The algorithm only becomes aware of a job when it arrives. Hence, it knows the
 running time of a job (along with its other parameters, e.g. its size in the case of hardware
 tasks) as soon as it arrives. Clairvoyant scheduling is also denoted as jobs arrive over time
 online paradigm, as jobs become available to the algorithm over time as soon as they are
 released.
 (iii). Non-Clairvoyant Scheduling;
 in non-clairvoyant scheduling, the online algorithm only becomes aware of a job when it
 arrives. In addition, unlike clairvoyant scheduling, the processing time of a running job
 remains unknown until the job finishes. Indeed, the algorithm only becomes aware of the
 end of the job at its completion. Consequently, non-clairvoyant scheduling is sometimes
 described as blind scheduling. This paradigm is also referred to as unknown processing
 times as the latter is its main online feature. Jobs are released over time according to their
 arrival time or their precedence constraint. However, if there are other characteristics of
 the task (e.g. size of a hardware task), they are known when the job arrives. Sometimes in
 this paradigm, it is assumed that at every moment the number of pending jobs is known.
 Non-clairvoyant scheduling has been widely studied (e.g. Rajeev Motwani and Torng, 1994;
 Edmonds, 2000). Its online features rely on the lack of knowledge on future tasks (e.g arrival
 time and running time).
 9
 list scheduling algorithm, paragraph 3.5.5 on page 82
 77
3. Background and Related Work Online Real-Time Scheduling
 (iv). Interval Scheduling;
 this paradigm assumes that each job has to be executed in a precisely given time interval,
 otherwise it will be rejected. Consequently, the number of accepted tasks is much more
 meaningful here than the makespan or the total completion time.
 This thesis mainly considers the clairvoyant scheduling as the online paradigm. Hardware jobs
 become available over time according to their release time. Other parameters of each job such as
 execution time, deadline, size (width and height) etc. are available as soon as the job arrives.
 3.4.3 Performance Analysis
 Optimal solutions are not expected in online problems because of lack of priori knowledge of the
 problems. However, it is important to measure the performance of a scheduling with respect to
 a given objective function. To achieve this, one way is to perform an average-case analysis by
 considering average performance of the scheduling algorithm or heuristic over all its possible inputs.
 The latter analysis may even be performed on classes of inputs in order to classify the algorithm
 behaviour with respect to the range of tasks parameters. Indeed, even in online scheduling, the
 range value of some tasks parameters could be partially or totally known beforehand (e.g. number
 of jobs, range of execution times, range of sizes, minimum inter-tasks arrival time, etc.). In
 addition, as the algorithm or the heuristic learns about jobs piece by piece over time, another
 way to improve its behaviour is to make some probabilistic assumptions on future jobs, based on
 current and past jobs. However, an average-case analysis cannot trap worst case scenario where
 the algorithm performs very poorly.
 Competitive Analysis
 A failure in an online system is avoidable by measuring the minimum guarantee that an online
 scheduler could provide, even in its worst case behaviour. This is achievable by using competitive
 analysis, first introduced by Sleator and Tarjan (1985). Based on a worst-case analysis, competitive
 analysis of an algorithm consists of comparing the worst solution
 10
 provided by the algorithm with
 the optimal solution. For example in an online scenario, the worst solution of the online algorithm
 ~Aon will be compared with the optimal solution ~Aopt obtained in the corresponding o?ine case.
 Consequently, competitive analysis is a more intuitive way of assessing the performance of an
 online algorithm.
 10
 with respect to a given objective function
 78
3. Background and Related Work Online Real-Time Scheduling
 Competitive Ratio
 The performance of an online algorithm is measured by its competitive ratio. As scheduling aims
 at optimizing a given objective function, the latter could be a cost to minimize (e.g. makespan)
 or a benefit to maximize (e.g. utilization ratio, number of accepted tasks, etc.). Let  ( ~Aon ,  n)
 be the objective function resulting from the online algorithm
 ~Aon applied on  n, a set of n tasks
 T1, T2,..., Tn. Let |n be an input jobs sequence J1, J2,..., Jn of tasks set  n. Algorithm ~Aon is
 said c  competitive if for any input instance |n , the objective function  produced by ~Aon on
  n) is at least c times better than that obtained with the optimal algorithm ~Aopt, as shown in
 equation 3.7.
 8|n , c =
 8
 >>><
 >>>:
 sup  (
 ~Aon;|n)
  ( ~Aopt;|n)
 if  is a cost to minimize
 sup (
 ~Aopt;|n)
  ( ~Aon;|n)
 if  is a benefit to maximize
 (3.7)
 As competitive ratio is based on a worst case analysis, c is chosen as the supremum of ratio in
 equation 3.7 over all possible inputs instances |n. In such a case, c is also denoted as the upper
 bound of the competitive ratio of the algorithm
 ~Aon.
 Equation 3.7 also shows that c is at least equal to 1 (c  1). The closer c to 1, the better the
 online algorithm
 ~Aon.
 The competitive ratio c of a given algorithm ~Aon expresses the fact that there is no any other
 algorithm capable of performing over c times better than ~Aon, even if considered the o?ine case
 where the entire problem instance is known in advance. Let's notice that the latter case enables
 optimal solutions. Therefore the competitive ratio clearly reflects the advantage of knowing the
 entire problem beforehand.
 The upper bound reflects the worst competitive ratio that could achieve the algorithm on any
 input instance. Hence, to obtain a more accurate upper bound, it could be necessary to carefully
 build some input instances in order to make the online algorithm performs the worst possible.
 One way of improving the algorithm design is then to reduce its upper bound.
 In contrast, the lower bound (of the competitive ratio) of an online algorithm indicates that
 for an online problem, the competitive ratio of that algorithm cannot be less than its lower bound.
 In other words, the lower bound of an algorithm reflects the best performance achievable by the
 algorithm on an online problem.
 79
3. Background and Related Work Uniprocessor RT Scheduling
 3.4.4 Schedulability Analysis
 Feasible Schedule: The basic challenge of scheduling is to assign a starting time and a
 processor to any single task or instance of task in the system. For example in a microprocessor
 based system, the scheduler decides in which order and at which date the single-task processor will
 be assigned to different tasks of the application. This aim is not always achievable as processing
 resources are finite and probably cannot run all the tasks in the system to completion while meeting
 their deadline, no matter which scheduling policy is used. Hence, a schedule is denoted as feasible
 on a given computing resources (e.g. one or many processors) if it allows all the tasks of the
 application to be successfully scheduled on the resources without violating their time constraints
 (e.g. deadlines). A set of tasks is schedulable on a real-time system if there is a feasible schedule. In
 the latter case, the system is underloaded, otherwise overloaded. The schedulability of the system
 highly depends on :
 (i). the constraints of the application (tasks).
 (ii). the available resources (number, size and type of PEs, etc...).
 (iii). the scheduling policy used. It consists of algorithms and heuristics.
 3.5 RT Scheduling for Uniprocessor Systems
 Scheduling theory has been intensively studied over the years and uniprocessor scheduling takes
 the lion's share in the rich literature review. Liu and Layland (1973) provide a foundation reference
 on the subject. Many optimal uniprocessor scheduling algorithms have been proposed along with
 their schedulability analysis. Here below are some scheduling algorithms. The end of this section
 discusses how microprocessor scheduling could be applied to reconfigurable hardware scheduling.
 3.5.1 Rate Monotonic (RM)
 RM scheduling was first introduced by Liu and Layland (1973). It is a preemptive and fixed
 priority scheduling for periodic and independent tasks systems. Priorities are assigned to tasks
 in a way that the shorter the task period, the higher the priority. Liu and Layland (1973) have
 proven RM optimal among all fixed priority scheduling when applied on the above-mentioned
 preemptive, periodic and independent tasks system. They also provided a sufficient but not
 necessary scheduling test of RM algorithm on a set n tasks as defined in expression 3.8 below,
 80
3. Background and Related Work Uniprocessor RT Scheduling
 where ci is the processing time of task Ti, Pi its period and
 Pn
 i=1
 ci
 Pi
 the processor utilization
 ratio.
 nX
 i=1
 ci
 Pi
  n  (2
 1
 n  1) (3.8)
 However, RM is not optimal for non-preemptive scheduling.
 The RM scheduling is not used in this thesis as periodic tasks systems are not used.
 3.5.2 Deadline Monotonic (DM)
 DM is a scheduling algorithm that gives the highest priority to the task with the least relative
 deadline Di. In general, DM is optimal for systems that consist of preemptive synchronous inde-
 pendent tasks whose relative deadline is less to their period (Di  Pi). DM could be used with
 periodic, aperiodic and sporadic tasks systems.
 A sufficient schedulability test for DM scheduling of n tasks  n = [T1; T2; :::; Tn] using DM algo-
 rithm is expressed as :
 8i : 1  i  n : ci +
 i 1X
 j=1
 Di
 Pi
  cj  Di (3.9)
 3.5.3 Earliest Deadline First (EDF)
 Also first introduced by Liu and Layland (1973), EDF could be applied to various tasks models
 (preemptive, non-preemptive, periodic, non-periodic, etc.). EDF scheduling is a dynamic assign-
 ment scheme where the highest priority is assigned to the task with the closest absolute deadline.
 Priorities are reassessed and updated at runtime if necessary (e.g. on each task arrival). EDF
 has been proven to be optimal for all kind of tasks models, including preemptive scheduling of
 periodic and independent tasks sets on a microprocessor. The necessary and sufficient scheduling
 condition is given by :
 nX
 i=1
 ci
 Pi
  1 (3.10)
 with Pi = Di, where Pi (resp. Di) is the period (resp. the relative deadline) of task Ti.
 Pn
 i=1
 ci
 Pi
 is the time utilization factor of the processor and reflects the fraction of time the processor
 is really running a task. Obviously, this fraction may not exceed 100%.
 Later, EDF has also been shown to be optimal in the case of non-periodic tasks. EDF scheduling
 outperforms RM and produces less preemption compared to RM.
 81
3. Background and Related Work Uniprocessor Model for Reconfig. HW
 3.5.4 Least Laxity First (LLF)
 LLF is a priority based scheduling where the task with the smallest laxity is scheduled first.
 This algorithm is quite close to EDF algorithm, with a similar necessary and sufficient scheduling
 condition
 Pn
 i=1
 ci
 Pi
  1 for a set of n preemptive tasks, and for Pi = Di. However, in LLF
 algorithm, many tasks are likely to have the same laxity (which means same priority), leading to
 a situation where many preemptions are performed in a short time frame, which is not desirable.
 The latter situation is less problematical in a multiprocessor system.
 3.5.5 List Scheduling (LS)
 Ronald L. Graham was the first to prove competitiveness on an online scheduling algorithm in
 1966, through the so-called list scheduling algorithm. LS
 11
 is the most commonly used online
 scheduling approach thanks to its simplicity. Indeed, it relies on the scheduling jobs one by one
 online paradigm described on page 76. In the LS algorithm, tasks to be processed are listed in
 order. When a resource is available, the first task listed is elected, computed and removed from
 the list. A task is available if heading the list, and free of precedence constraints if any. As there
 is no prior knowledge of task in the list, LS is a suitable and simple model for online problems.
 3.5.6 Uniprocessor Scheduling Model for Reconfigurable Hardware
 In this scheduling model the entire reconfigurable hardware device is viewed as a single processor.
 Therefore the device is assigned to only one task (or set of tasks taken as a whole and active at
 nearly the same time) at a time. One example is shown in figure 3.5 where tasks T1, T2, T3 and
 T4 are sequentially executed on the reconfigurable hardware, on a time-sharing basis. In order to
 achieve this, the tasks or jobs ( 1;  2; :::;  10) are time-partitioned into four tasks T1, T2, T3 and
 T4. The reconfigurable hardware device is not space partitioned in this model.
 Hardware tasks scheduling on a time sharing model of the reconfigurable hardware device is
 the simplest, thanks to its similarities with the well-studied scheduling on uniprocessor systems.
 Hence, the scheduling problem is reduced to a uniprocessor scheduling. The same above-mentioned
 scheduling algorithms (EDF, LLF, RM, etc.) along with their results could be directly applied.
 However the intrinsic parallelism provided by hardware implementation is not exploited here.
 Moreover, this approach leads to a high internal fragmentation (presented later in section 3.9.1,
 page 124) and a low reconfigurable hardware device utilization ratio as the whole device is assigned
 11
 list scheduling
 82
3. Background and Related Work Uniprocessor Model for Reconfig. HW
 to a single task at a time, no matter how small the size of the task compared to the size of the
 reconfigurable array.
 Compound tasks
 A compound task is a task composed of many tasks or subtasks. In order to exploit the parallelism
 provided by reconfigurable hardware and improve the reconfigurable devive area utilization ratio,
 one solution is to concurrently run many tasks instead of one at a time as previously described.
 This could be achieved by forming subsets of tasks that could be treated as one global task, as
 long as they share some timing and/or geometric similarities.
 Let  n be a set of n jobs  1;  2; :::;  n to be scheduled on the DRHW. As depicted in figure 3.5
 and 3.7 there are two ways of forming subsets of jobs that could be executed concurrently on the
 partially reconfigurable hardware device.
 1. Time sharing (space overlapping) compound tasks
 As pictured in figure 3.5, one way of overcoming the aforementioned drawbacks is to partition
  n into m subsets of compound tasks T1; T2; :::; Tm in such a way that :
 (i). m  n
 (ii). 8 i 2 Tj ,
  ai 2 [ta   ; ta] => all arrival times ai in the subset Tj are within a given
 interval; ta = max(ai) is the arrival time that is assigned to the tasks subset
 Tj .
  td = min(di) => the smallest deadline di in the tasks subset Tj is assigned as
 the deadline of Tj , which is td.
 where ai (resp. di) is the arrival time (resp. deadline) of job  i, and ta (resp. td)
 the arrival time (resp. deadline) of the tasks subset Tj .
 (iii). 8i 2 [1; n];8j 2 [1;m];
 (a). wj  hj  W  H => the reconfigurable hardware must be able to fit the
 compound task Tj .
 (b). if  i \ Tj 6= 0 then wi  hi  wj  hj => compound tasks Tj are bigger.
 where wi  hi (resp. wj  hj) is the size of task  i (resp. task Tj) and W  H the total
 size of the reconfigurable hardware.
 83
3. Background and Related Work Uniprocessor Model for Reconfig. HW
 (iv).  1 [  2 [ ::: [  n = T1 [ T2 [ ::: [ Tm
 (v).  1 \  2 \ ::: \  n = T1 \ T2 \ ::: \ Tm = 0
 Figure 3.5: Uniprocessor model for reconfigurable hardware devices with time sharing
 compound tasks.
 Figure 3.6: Compound tasks timing characteristics
 Condition (1i) expresses the fact that each subset Tj is a compound task that contains one
 or many tasks  i, making therefore Tj bigger (as also expressed by condition (1(iii)b)). This
 is seen in figure 3.5 where each task Tj contains 2 or 3 jobs  i and therefore improves the
 utilization ratio of the reconfigurable array. Furthermore, the number of reconfiguration is
 bounded by m.
 Conditions in (1ii) express the fact that timing similarities of jobs  i are the main factors
 that govern tasks grouping. Hence all the jobs  i of a given compounded task Tj must
 be active at nearly the same time in order to be managed as a single task Tj . This is
 also illustrated in figure 3.6 where  i is the time interval within which all the subtasks of
 compound task Ti arrive. Conditions (1iv) and (1v) emphasize that compound tasks T1,
 T2,...,Tm differ each other. As T1, T2,...,Tm do not overlap in time, they can be sequentially
 executed on the reconfigurable array on a time sharing basis, as illustrated in figure 3.5.
 These timing constraints could lead to situations where the reconfigurable array is poorly
 occupied. The reconfigurable array is fully reconfigured between two tasks execution.
 84
3. Background and Related Work Uniprocessor Model for Reconfig. HW
 However, on one hand partitioning a set of n tasks in m distinctive subsets of tasks as
 described above is not easy to perform. On the other hand, if most of the subtasks  i of
 task Tj are mostly idle during the running time of Tj (as they are probably not strictly all
 active at the same time), the real utilization ratio of the reconfigurable array will remain
 poor.
 One main advantage provided by this time sharing uniprocessor model is that the frag-
 mentation problem is eschewed. Each compound task Tj is designed as a whole and its
 bitstream file is used to fully reconfigure the device on demand. Each compound task Tj
 can contain as many jobs  i as possible as long as the total amount of resources does not
 exceed the amount of resources available on the reconfigurable hardware device, so a high
 device utilization ratio is achievable. As this design approach is free of rectangular partition
 constraints that govern the module-based design methodology described in Chapter 2 section
 2.5.8, there is no internal fragmentation. Figure 3.5 shows that a job  i could be of any
 shape without preventing the placement of other jobs  k that overlap with its surrounding
 rectangle (see section 3.9.1, page 124 on internal and intra-task fragmentations).
 Partial reconfiguration is of no use in this model. Full reconfiguration results in higher
 reconfiguration time overheads and complex tasks switching. However the number of full
 reconfiguration (which is m as shown in figure 3.5, and bounded by n in the worst case, if
 preemption is not allowed) is lower compared to the following uniprocessor model.
 2. Space sharing (time overlapping) compound tasks
 Figure 3.7 depicts another variant of uniprocessor model. Once again, Let  n be a set of n
 jobs  1;  2; :::;  n.  n is partitioned into m subsets of compound tasks T 01; T
 0
 2; :::; T
 0
 m in such
 a way that :
 (i). m  n if at most one job is released at a time.
 (ii). T 01 \ T
 0
 2 \ ::: \ T
 0
 m 6= 0
 (iii). 8i 2 [1; n];8j 2 [1;m];
 (a). wj  hj  W  H => the reconfigurable hardware must be able to fit the
 compound task T 0j .
 (b). if  i \ T 0j 6= 0 then wi  hi  wj  hj => compound tasks T
 0
 j are bigger.
 where wi  hi (resp. wj  hj) is the size of task  i (resp. task T 0j) and W  H the total
 size of the reconfigurable hardware.
 85
3. Background and Related Work Uniprocessor Model for Reconfig. HW
 Figure 3.7: Uniprocessor model for reconfigurable hardware devices with space sharing
 compound tasks
 As each compound task Tj contains one or many jobs  i, Tj is therefore bigger (as also
 expressed by condition (2(iii)b)) and improves the device area utilization ratio.
 Condition (2i) expresses the fact that there are at least as many compound tasks (T1,
 T2,...,Tm) as jobs ( 1,  2,...,  n). The main reason is that two or more compound tasks
 can share some common jobs  i. This is also shown by condition (2ii) which means that
 compound tasks are very likely to be partly similar. For example in figure 3.7, each of
 the four compound tasks T 01, T
 0
 2, T
 0
 3 and T
 0
 4 contains job  1, making them slightly similar.
 However, they are considered as 4 distinctive tasks that need a complete reconfiguration of
 the device at every task switching. The idea is to treat any change on the list of currently
 running tasks as task switching. Hence, as  2 is replaced by  3 and then  3 by  4, these
 replacements correspond to tasks switching from T 01 to T
 0
 2 and then from T
 0
 2 to T
 0
 3 as pictured.
 One advantage in fully reconfiguring the device at every task switching is that some tasks
 can be relocated in order to make bigger contiguous free space, hence improving the device
 utilization ratio.
 The main drawback of this model is that the device is often and fully reconfigured, which
 is time consuming. As shown in figure 3.7 the similarities between T 01 and T
 0
 2 are not taken
 into account during the reconfiguration. Indeed the device is fully reconfigured everytime
 a new job  i is removed and/or added on the device (4 full reconfigurations for 5 jobs,
 compared to 4 full reconfigurations for 10 jobs in figure 3.5). For example, for n jobs
  1;  2; :::;  n, if we assume that at most one job arrives at a time, then the device needs
 to be fully reconfigured at least n times throughout the execution of n jobs (or the m
 compound tasks). In any case, each task ending and each task beginning (and even each
 86
3. Background and Related Work Multiprocessor RT Scheduling
 task preemption, if allowed) is likely to induce the full reconfiguration of the device. Thus,
 the number of reconfigurations drastically increases with the number of tasks and each new
 task corresponds to a reconfiguration or bitstream file to be downloaded on the device.
 As shown in Chapter 2 section 2.5.8 while describing the FPGA modular design flow, each
 compound task T 0j is designed and synthesized off-line and stored as a bitstream file that fully
 reconfigures the reconfigurable array at the appropriate time. Thus, all possible scenarios
 of tasks that are likely to run concurrently on the device are prepared beforehand. As the
 number of reconfigurations is huge (at least equal tom, withm  n), the resulting bitstream
 files storage memory is tremendous.
 However, if the reconfigurable device enables runtime partial reconfiguration, the amount
 of configuration data and therefore the reconfiguration time overhead could be reduced.
 Depending on the similarities between two tasks T 0j and T
 0
 j+1 to be sequentially executed on
 the device, the partial bitstream can be orders of magnitude smaller than the full bitstream
 and can be quickly loaded. In fact for a set of n tasks  i=1::n, one only needs to generate and
 store beforehand n partial bitstreams, one per task. In such a case, the switch from task
 T 0j to task T
 0
 j+1 is performed by reconfiguring only the part of the reconfigurable hardware
 device that needs to be changed. If applied on the example in figure 3.7, throughout the
 execution of T 01, T
 0
 2, T
 0
 3 and then T
 0
 4, only the area of the device where the new task  i
 is to be hosted is reconfigured on-the-fly while the remaining part occupied by  1 is not
 reconfigured and remains unchanged.
 Partial reconfiguration capability fully exploits tasks redundancies that could appear in some
 compound tasks. Either a module-based or difference-based design methodology presented
 earlier in Chapter 2 section 2.5.8 can be used here to implement partial reconfiguration.
 However the module-based methodology induces more fragmentation as tasks are rectangu-
 lar shape constrained.
 3.6 RT Scheduling for Multiprocessor Systems
 The problem of scheduling tasks on many processors cannot be seen as a simple extension of the
 uniprocessor scheduling. Before focusing on reconfigurable hardware devices scheduling in the next
 section, the current section gives a short taxonomy of multiprocessor systems scheduling along with
 most significant results. The reason is that in a certain way reconfigurable hardware scheduling
 shares some similarities with multiprocessor scheduling. These similarities will be pointed up as
 87
3. Background and Related Work Multiprocessor Platforms
 they appear.
 An increasing number of real-time applications require many processors to achieve their per-
 formance goals. In uniprocessor systems, the performance is improved by increasing the speed and
 unfortunately the power consumption. However, a multiprocessor system achieves a better perfor-
 mance and is of a lower cost compared to its fastest uniprocessor counterpart. Consequently, on
 one hand advances in technology are enabling Multiprocessor Systems-On-a-Chip (MPSoC), and
 on the other hand in embedded systems many processors could be distributed in different parts
 of the system, each dedicated to a specific task. As stated in the first chapter, examples could be
 found in modern car control systems which are usually divided into fifty-plus processor-controlled
 subsystems (e.g. ABS-Anti-lock Braking System, airbags systems, fuel injection system, etc.).
 3.6.1 Multiprocessor Scheduling Problem
 In a multiprocessor system there are m  2 processors available in the system for computations.
 These processors are denoted as M1;M2; ::::;Mm. In a multiprocessor scheduling problem, a list
 of n jobs J with nonnegative processing times have to be scheduled on m processors with the aim
 of completing all the jobs without violating their time constraints and while minimizing a given
 objective function (e.g. makespan). Additional constraints are :
  the type of tasks (periodic, aperiodic or sporadic)
  the preemption (allowed or not)
  task migration
  the precedence constraints (independent tasks or not)
  task parallelism; if allowed, different jobs of a task can be executed concurrently on different
 processors.
  processor Mi capability of executing a single tasks at a time (e.g. sequential processors) or
 more (e.g. reconfigurable hardware devices).
 By default, this thesis assumes that tasks are aperiodic, non-preemptive and independent.
 3.6.2 Multiprocessor Platforms
 Unlike uniprocessor systems it is possible to execute many jobs of the application concurrently.
 There are mainly two types of multiprocessor platforms:
 88
3. Background and Related Work Partitioned vs Nonpartitioned Scheduling
 1. Homogeneous Processor System: Processors in the platform are identical in terms of
 speed. In such a system, any task could be executed on any processor with an execution
 time which is therefore processor-independent as speed of processors are strictly similar.
 Identical processors approach is the simplest multiprocessor model but would lead to a huge
 internal fragmentation (as discussed later below in section 3.9.1) if applied to hardware tasks
 scheduling on partially reconfigurable hardware devices.
 2. Heterogeneous Processor System : Processors in the platform are of different types and
 speeds. Tasks processing time is therefore processor-dependent. On one hand if any task
 could be run on any processor with different execution times the platform is called uniform
 processors platform. Hence, each processorMi is characterized by its speed si. On the other
 hand when some tasks can only be run by some processors, the heterogeneous platform is
 denoted as independent processors platform. Consequently, each task is characterized by
 the processors capable of running the task, along with the execution time required by each
 processor.
 As stated in the introduction of the thesis, embedded platforms are turning heterogeneous
 MPSoC. As this thesis considers SoC with reconfigurable parts, heterogeneous processors
 system can perfectly be such a platform in some special cases where the reconfigurable fabric
 contains a fixed number ofm slots or clusters. The tasks could be therefore as heterogeneous
 as the platform and consist of software and/or hardware versions.
 3.6.3 Partitioned vs Nonpartitioned Scheduling Strategies
 Traditionally, there are two classes of scheduling algorithms for multiprocessor platforms: parti-
 tioned scheduling and global scheduling. Partitioned scheduling assigns each task to a particular
 processor. The task is scheduled on the local processor using a uniprocessor strategy. Task mi-
 gration is not allowed. Global or nonpartitioned scheduling shares all tasks across all processors.
 At every moment the m highest-priority tasks ready for execution are scheduled on m processors.
 Thanks to task migration, global scheduling is likely to achieve higher schedulability than par-
 titioned scheduling. However, task migration and preemption increases runtime overhead. This
 section presents the partitioned scheduling and the non-partitioned scheduling. Throughout this
 section, it is assumed that  n is a set of n tasks to schedule on m processors M1;M2; :::;Mm.
 1. Partitioned or Local Scheduling
 Let  n be a set of n jobs  1;  2; :::;  n.  n is partitioned into m subsets of tasks T1; T2; :::; Tm
 89
3. Background and Related Work Multiprocessor Model for Reconfig. HW
 in such a way that :
 (i). 8Tj ; j 2 [1;m], Tj constains at least one task  i.
 (ii). T1 \ T2 \ ::: \ Tm = 0
 (iii).  n =  1 [  2 [ ::: [  n = T1 [ T2 [ ::: [ Tm
 Each subset Tj is scheduled on the same processor Mj using a monoprocessor scheduling
 approach. Task migration is not allowed. Hence, all instances of a task is scheduled on the
 same processor by monoprocessor scheduling algorithms such as EDF, LLF or RM.
 Indeed, as monoprocessor scheduling has been widely studied in terms of schedulability
 analysis, the same results are easily transposed to each tasks subset Tj and its associated
 processor Mj . Moreover, a monoprocessor approach avoids task migration and the resulting
 context switching which is very challenging and time consuming in multiprocessor systems.
 2. Non-partitioned or Global Scheduling
 Unlike the local approach presented above, any task or any job in the set  n could be
 executed on any processor. Furthermore a task could be preempted on one processor and
 resumed later on another processor (which is task migration, if preemption is allowed). The
 strategy is global as there is any partitioning done beforehand, and at every decision time,
 all the m processors of the platform are assigned to the m higher priority tasks.
 3.6.4 Multiprocessor Scheduling Model for Reconfigurable Hardware
 The previous section has presented the partitioning strategies both from multiprocessor platform
 and task system perspectives. This section will discuss how these strategies could be adapted to
 hardware tasks scheduling on reconfigurable hardware devices.
 Partitioned Scheduling
 Partitioned scheduling could be transposed to dynamically reconfigurable hardware device. In-
 deed, this could be done by partitioning the reconfigurable fabric in m equal slots as pictured in
 figure 3.8 (a), and by partitioning the tasks system  n accordingly. If we assume that any slot
 can fit any hardware task (e.g. if the thinnest slot can fit the widest task, or if the width of slots
 is equal to the width of tasks), then a monoprocessor assignment scheme could be applied to each
 slot. The set of n tasks  n is partitioned into m subsets following the three conditions (1i), (1ii)
 and (1iii) as described above, m being the number of slots. Each slot is therefore assigned to a
 90
3. Background and Related Work Multiprocessor Model for Reconfig. HW
 subset of tasks as shown in 3.8 (a) where arriving tasks are divided in three subsets.
 In such an equal-slot partitioned strategy, the number of slots m corresponds to the number of
 processors, and the whole reconfigurable array is seen as the aforementioned homogeneous multi-
 processor system. With the assumption that each slot is big enough to fit any task of its subset,
 monoprocessor scheduling algorithms along with theories could be directly and successfully applied
 to executed each subset on its corresponding slot.
 This assignment scheme could be extended to the case where the above-mentioned assumption
 is not verified. Indeed, as illustrated in figure 3.8 (b) the reconfigurable array could be unequally
 partitioned with a fixed number of slots m. If m is fixed, the model is much closer to the
 heterogeneous multiprocessor system presented earlier. In addition if some tasks don't fit the
 slots, then the reconfigurable array may be viewed as an independent processors platform. Each
 task is therefore characterized by the processors (or slots) that are capable of processing it. Tasks
 are assigned to different slots according to their size and eventually a given scheduling strategy.
 Roman et al. (2006) is an example where the area of the reconfigurable array is divided into
 four unequal partitions, each holding one task at a time. A queue of tasks Qi is associated
 to each partition Pi. The size of each partition is adjusted during run-time according to the
 profile of the tasks set being processed. Depending on its size, each arriving task is added to the
 queue of the partition that fits best. Each tasks queue Qi is then scheduled on its corresponding
 partition Pi. Another example is shown in Ahmadinia et al. (2004) where the reconfigurable array
 is 1D-partitioned and each task is associated to a partition according to its finishing time. A
 monoprocessor assignment strategy along with theories could be then separately applied to each
 slot and its corresponding tasks queue.
 To summarize, if applied to scheduling for multitasking or hardware virtualization on reconfi-
 gurable hardware devices, the partitioned scheduling strategy is similar to an m monoprocessors
 scheduling, where each processor or slot is locally managed by a monoprocessor scheduling scheme.
 Hence, hardware tasks scheduling strategies could be directly derived from partitioned scheduling
 theory for heterogeneous multiprocessor systems.
 Global Scheduling
 As global or non-partitioned scheduling is meant to use a global assignment scheme, the tasks are
 not partitioned, and any task could be executed on any processor of the multiprocessor platform.
 Hence, a global scheduling could be used even if the reconfigurable array is managed in a slot-
 91
3. Background and Related Work Multiprocessor Model for Reconfig. HW
 Figure 3.8: Equal sizes and unequal sizes partitioning of a DPRHW (dynamically and partially
 reconfigurable hardware device)
 based approach. Once again the basic assumption being on one hand the fixed number of slots,
 and on the other hand the ability for the smallest slot to accommodate the biggest task. Such a
 situation would correspond to the well studied global scheduling of software tasks on m identical
 processors. Unfortunately, hardware tasks scheduling on partially reconfigurable hardware devices
 is more challenging.
 In the model in figure 3.8 (b), the slots are not equally partitioned. Furthermore the number
 of slots could be dynamically changed over time. For example in Ahmadinia et al. (2004) which
 used a 1D partitioning, the number and the width of the slots are dynamically modified according
 to the characteristics of the incoming tasks.
 In Lu et al. (2008) the reconfigurable array is first pre-partitioned in slots of various heights.
 Afterwards, they are merged vertically and/or horizontally on demand in order to fit wider or
 taller tasks (e.g. in figure 3.8 (b), slot 3 and slot 4 can merge to fit a wider task).
 Figure 3.8 (c) depicts another model where there are any predefined partitions. As tasks are
 placed and removed, the array is split and merged accordingly. The number and size of partitions
 (if considered as) are continuously variable, depending on the number and the shape of the placed
 tasks. Hence, unlike the above-mentioned array partitioning strategies, neither the number (m)
 nor the size of processors are known beforehand.
 92
3. Background and Related Work On-line RT Scheduling for Reconfig. HW
 It is much more difficult to clearly map any multiprocessor scheduling approach on these above-
 mentioned models. However if considered as an heterogeneous multiprocessor systems with a
 variable number (m) and size of processors, a non-partitioned or global scheduling strategy is more
 likely to be used. Indeed, the latter strategy reflects a more realistic behaviour of a reconfigurable
 array which is not pre-partitioned in slots. As it will be seen later, the main drawback is that the
 underlying area management is very difficult to handle.
 To summarize, hardware tasks scheduling on a dynamically and partially reconfigurable
 hardware is akin to scheduling on a multiprocessor system, but with a continuously variable num-
 ber and speed (size) of processors over time. Therefore, it is more complicated by consideration
 of this dynamicity. In addition, the resulting area management is highly complex because numer-
 ous splitting and merging operations. However, a partitioned approach is more likely to fit in a
 multiprocessor scheduling strategy and leads to a more simple area management. This results in
 a higher reconfigurable hardware device area fragmentation and a lower utilization ratio.
 3.7 Online Real-Time Scheduling on Reconfigurable Hard-
 ware Devices
 Introduction
 As previously stated, scheduling policies can be classified as static or dynamic. In static scheduling
 tasks parameters are assigned beforehand and remain unchanged during the tasks life time, unlike
 dynamic scheduling where the scheduling relies on tasks parameters that change over time. O?ine
 scheduling is also differentiate from online scheduling. The latter introduces a certain dynamicity
 in the system, making it difficult to find optimal scheduling solutions.
 Online here means that the flow of the program is unknown in advance and hence task char-
 acteristics (arrival time, shape, size, execution time, deadline, etc...) are unknown before its
 arrival. This corresponds to the clairvoyant paradigm presented earlier. Hence, the scheduler has
 to dynamically reassess at runtime the task(s) to be placed (e.g. O. Diessel and Schmidt, 2000;
 Ahmadinia et al., 2004).
 In this thesis, online real-time scheduling algorithms are classifies in two main families, depen-
 ding on whether the scheduling algorithm exploits or not the fact that online real-time tasks are
 clairvoyant. Indeed, as the execution time of each task is known at its release, the scheduler
 93
3. Background and Related Work On-line RT Scheduling without-looking-ahead
 could rely on currently running tasks to determine current and future states of the reconfigurable
 array, in order to properly place or plan each new task. This family of scheduling is denoted as
 looking-ahead scheduling, in opposition to without-looking-ahead scheduling.
 3.7.1 Online Scheduling Without-Looking-Ahead and Related Work
 While scheduling without-looking-ahead, ready tasks are scheduled only on areas that are currently
 available on the reconfigurable device. If there is not enough free area to accomodate the task at
 current time and if it can still meet its deadline, the task is managed according to the scheduling
 policy (e.g. simply rejected or kept in a waiting list in order to be placed later if worthy). This
 approach is the most used in online and o?ine scheduling of hardware tasks on reconfigurable
 hardware devices (e.g. Bazargan et al., 2000; Ahmadinia et al., 2004; Danne and Platzner, 2005).
 Without-looking-ahead scheduling approach is detailed in figure 3.9 where T1, T2, T3 and T4
 are to be scheduled on the FPGA. ai, ei, li, si and fi are respectively the arrival time, execution
 time, laxity, starting time and finishing time of task Ti. Tasks T1 and T2 are placed at time
 t = 1 as soon as they arrive. At time t = 2, tasks T3 and T4 arrive and cannot fit immediately
 on the reconfigurable hardware. In the without-looking-ahead approach, the scheduler (through
 the placer or area manager) checks areas that are available only at current time t = 2 without
 prospecting the future state of the reconfigurable hardware. Hence, tasks T3 and T4 are kept in a
 ready list as long as they can still meet their deadline, for further attempts. At time t = 6 when
 T2 is completed, it could be replaced by T3 or T4 depending on the scheduling policy. If T3 is the
 next task elected and if the system is submitted to hard real-time constraint, T3 will be placed
 at time tp3 = 6 while T4 will be rejected at time trej4 = 7, because T4 won't be able to meet its
 deadline anymore.
 One of the main drawback of without-looking-ahead approach in hard real-time scheduling is to
 keep the tasks that will never be placed anyway in a ready or waiting list and reject them too late
 at time trej , expressed by equation 3.11
 trej = ai + li ) Rdi = trej  ai = li (3.11)
 where trej is the rejection time of the rejected task Ti, ai its arrival time, li its laxity, and Rdi
 its rejection delay (the waiting time for a rejected task, detailed in Chapter 4, equation 4.23 page
 162). Such a late rejection is not desirable as it prevents the operating system from considering
 94
3. Background and Related Work On-line RT Scheduling without-looking-ahead
 Figure 3.9: Looking-ahead vs without-looking-ahead scheduling approaches
 tasks implementation on other resource than the reconfigurable hardware
 12
 . In addition, keeping
 the tasks lengthens the ready tasks list and, the longer the ready tasks list, the more difficult it is
 to scan and/or sort it, especially in an online real-time scenario.
 Related Work
 Bazargan et al. (2000) presented online and o?ine scheduling of hardware tasks on reconfigurable
 hardware devices through a bin packing approach. For online scheduling, their studies are mainly
 focused on 2D placement strategies along with area partitioning and management. The simulation
 results are provided with respect to different classes of tasks, where a class refers to the sizes (width,
 height) of tasks. However, there is no time-based scheduling strategy (e.g. EDF) and runtime
 overhead of online algorithms are not measured. As these studies mainly coped with placement,
 they are presented in depth later below.
 Danne (2006) dealt with the problem of scheduling real-time and periodic tasks on partially
 reconfigurable hardware devices. He formalized the real-time scheduling problem. In its model,
 the reconfigurable hardware device is seen as a homogeneous multiprocessor platform that consists
 12
 alternative resources could be hardcore and/or sotfcore CPUs. Sometime, the remaining reconfi-
 gurable resource may not be enough to accommodate a given hardware task, but may fit an instantiated
 softcore processor that can run the task.
 95
3. Background and Related Work On-line RT Scheduling without-looking-ahead
 ofm identical processors, each processor being capable of running any task. Consequently, m tasks
 instances can concurrently run on the device as long as the sum of their area does not exceed the
 area of the device (expressed by equation 3.12 for the 1D model).
 Danne (2006) proposed three preemptive scheduling algorithms: global EDF (Earliest Deadline
 First), partitioned EDF and server based. These algorithms are adaptation of well-known software
 tasks scheduling algorithms for multiprocessor platforms.
 Danne (2006) presented two variants of global EDF scheduling for reconfigurable hardware devices,
 denoted respectively as EDF-First-k-Fit and EDF-Next-Fit :
  At each scheduling time, EDF-First-k-Fit selects the k first jobs (in the sorted list of active
 jobs) that can fit on the array. The k first jobs are selected in such a way that the kth job
 Jk may be placed on the array only if all the jobs J1 to Jk 1 preceding Jk in the list can
 also fit on the array. This leads to a situation where a remaining free area may be kept idle
 while there are some active jobs Jk+i that can fit on it.
  EDF-Next-Fit scheduling overcomes this limitation of EDF-First-k-Fit, as the algorithm
 scans the list of active jobs and places as many jobs as possible on the array, as far as they
 fit in the array. Therefore EDF-Next-Fit outperforms EDF-First-k-Fit as in the latter, a
 free area that can fit an active job may remain unused.
 However, Danne (2006) relied on the 1D model of reconfigurable hardware devices. This model
 requires rectangular-shaped tasks and assumes that each hardware task spans the entire device
 width instead, tasks' heights being proportional to tasks' areas. The model is simpler to manage
 compared to 2D model. It eases task relocation and therefore reduces external fragmentation.
 But it induces internal fragmentation that leads to a lower device utilization ratio.
 Using a 1D model makes the condition for simultaneously running n jobs Ji=1:::n on the reconfi-
 gurable array simple as follows:
 nX
 i=1
 wi  W (3.12)
 where wi andW are respectively the width of task Ji and the width of the reconfigurable hardware
 device. Therefore, the schedulability analysis is simpler and can rely on schedulability analysis
 for multiprocessor scheduling. For example, Danne and Platzner (2006b) relied on schedulability
 analysis for EDF algorithm upon multiprocessor to provide an efficient though pessimistic schedu-
 lability analysis for global EDF scheduling. This scheduling test is of linear complexity O(n), n
 being the number of tasks. The test guarantees at design time that no deadline will be missed.
 Hence, any tasks set which passes the test will be feasibly scheduled by EDF algorithm. However,
 96
3. Background and Related Work On-line RT Scheduling without-looking-ahead
 as the test gives a sufficient but not necessary condition, a tasks set that fails may be feasibly
 scheduled.
 At that period, the other advantage of using 1D placement was technological limitations of FP-
 GAs. FPGAs were providing only a column-wise partial and runtime reconfiguration. Hence, the
 only way of performing a 2D placement was to totally reconfigure all the columns (and therefore
 the tasks in these columns) that were interfering with the task to place. Such a process were likely
 to affect many tasks at each task placement, making reconfiguration overheads higher.
 However, as discussed later in section 3.9.1 page 124, any advantage of 1D placement (including
 its lower algorithm complexity) comes at the cost of an internal fragmentation that lowers the
 reconfigurable array utilization ratio and increases tasks rejection. Fortunately, using a 2D placer
 is getting more meaningful as nowadays, many FPGAs are enabling 2D partial reconfiguration.
 Partitioned EDF scheduling is presented in Danne and Platzner (2006a) where an extended
 model of ILP (Integer Linear Programming) for bin-packing problem is developed in order to
 compute the optimal partitioned schedule for a given set of tasks. The reconfigurable array is
 horizontally partitioned instead (in opposition to vertical partitions as mapped in figure 3.8-(a)
 page 92), following the 1D area model. At any time, at most one hardware task is allocated to a
 slot. Consequently, the width of the task spans the entire width of the slot, leading to a low device
 utilization ratio. Each task is modeled using many variants. Therefore, ILP selects a variant for
 each task and a partitioning that minimize the overall required device area. That is, the smallest
 area that allows to feasibly schedule a tasks set is found. Albeit the optimal partitioning for
 medium-sized tasks sets can be computed in reasonable time, ILP-based partitioning scheduling
 is not suitable for online scheduling.
 Server-based approach is detailed in Danne et al. (2006) through MSDL scheduling technique.
 In MSDL technique, tasks are grouped and executed on the reconfigurable array according to the
 time sharing monoprocessor scheduling model presented earlier and mapped in figure 3.5 page 84
 and figure 3.7 page 86. The reconfigurable device is fully reconfigured and a server corresponds to
 what was earlier denoted as a compound task. As stated while discussing about that model, one
 of the challenge here is to minimize the number of device configurations, which corresponds to the
 number of required bitstream files. This number of configurations is bounded by the number of
 tasks. This scheduling technique offers an o?ine schedulability test. Hence, different combination
 of tasks that have to be run concurrently are computed beforehand and synthesized in order to
 be loaded at the appropriated time, following a given scheduling policy (e.g. global EDF).
 97
3. Background and Related Work On-line looking-ahead RT Scheduling
 Ahmadinia et al. (2004) presented a cluster based dynamic scheduling in which the FPGA
 is partitioned into slots of dynamic size. Tasks which will complete at nearly the same time are
 placed in the same slot in order to free up contiguous spaces at the same time and create large
 empty rectangles for later placement. Even if this scheduling strategy takes into account the future
 state of the reconfigurable array by assigning slots to tasks according to their completion time, it
 remains a without-looking-ahead approach as at each time, the tasks are scheduled only on the
 currently available areas.
 Conclusion on online scheduling without-looking-ahead
 To summarize, the section discussed different work on online scheduling without-looking-ahead
 of real-time tasks on reconfigurable hardware devices. Bazargan et al. (2000) first presented
 both optimal and nonoptimal approaches in area management for online and o?ine placement.
 Danne (2006) formalized real-time scheduling of periodic and preemptive tasks, and provided few
 algorithms along with efficient schedulability analysis that derived from multiprocessor scheduling.
 Ahmadinia et al. (2004) proposed a cluster-based scheduling approach that tends to free contiguous
 areas at nearly the same time in order to accommodate next tasks. These different approaches
 showed the tightness that exists between scheduling and placement problems. The coming section
 introduces looking-ahead scheduling and emphasizes the aforementioned tightness.
 3.7.2 Online Looking-Ahead Scheduling and Related Work
 Presentation
 In looking-ahead scheduling, if there is not enough place at current time to place a given task, the
 scheduling algorithm goes further by looking into the future, mimicking the end of certain running
 tasks to see if the space thus freed can fit the task (e.g. Steiger et al., 2004; Chen and Hsiung, 2005;
 Marconi et al., 2008). By doing so, the scheduler either accepts or rejects the task immediately
 and, therefore, allows the Operating System to find alternative implementation solutions in case
 of rejection.
 In an online real-time context, this immediate rejection of any task Ti that cannot fit in the array
 is the main advantage of a looking-ahead approach. The rejection time trej of a task Ti scheduled
 with a looking-ahead scheduling is expressed in equation 3.13 where ai is the release time of Ti
 and  i the scheduling runtime overhead or time required by the scheduling algorithm to schedule
 Ti. Rdi is the corresponding rejection delay (or the waiting time for a rejected task, as detailed
 98
3. Background and Related Work On-line looking-ahead RT Scheduling
 later in equation 4.23 page 162).
 trej = ai +  i ) Rdi = trej  ai =  i (3.13)
 Equation 3.13 implies that the rejection delay is equal to the looking-ahead scheduling algorithm
 runtime overhead, which must be far lesser than a scheduler tick or time unit ( i  Ttick).
 Looking-ahead algorithms certainly improve the online real-time scheduling quality by taking rapid
 scheduling decisions. However when a 2D placement is used as in Steiger et al. (2004), too many
 areas splitting and merging operations are performed at runtime, which is prejudicial in online
 real-time scenarios.
 Related Work
 Steiger et al. (2004) proposed two looking-ahead scheduling algorithms denoted as Horizon and
 Stuffing. Both algorithms are online and clairvoyant. They schedule tasks to arbitrary areas that
 are either currently free or that will be free at a given time (or time interval) in the future. A task
 that is assigned a future free areas is put on a reservation list while the currently running tasks
 are recorded on an execution list.
 As pictured in figure 3.10, the algorithms differ from each other in the way they manage the areas.
 At any scheduling time, before assigning any area to tasks, the area manager makes sure that
 the area is not conflicting with any other area currently occupied or already booked for another
 task. This verification is made using the two above-mentioned lists, in addition to a third list that
 records the state of the reconfigurable hardware device. In horizon scheduling, once an area is
 assigned to a task, no matter when the tasks starts its execution, the area is marked as occupied
 from current time until the end of the task. Hence, as shown in figure 3.10(a), areas are marked as
 either ?totally free?, or ?occupied from the current time until a given point in time?. This leads to
 situations pictured on the left of figure 3.11 where some reserved areas are available within a time
 interval but cannot be used. The shaded parts in figure 3.11(a) are such lost areas. As depicted
 by the latter figure, task T7 that arrives at time 3 is planned to start at time 18 when task T6 will
 end. Indeed, T6 has been planned earlier at time 2 to be started at time 15. Hence, from time 2
 when T6 has been planned, the whole space that will be occupied by T6 cannot be assigned to any
 other newly arrived task within the interval that spans from the reservation of task T6 (at time 2)
 until its end at time 18.
 In order to draw a parallel between hotel management and horizon scheduling, let's assume that
 a room is booked for a client who is arriving in one week time and who is planning to stay one
 99
3. Background and Related Work On-line looking-ahead RT Scheduling
 Figure 3.10: Managing areas availability or occupancy in looking-ahead scheduling (e.g. horizon
 and stuffing algorithms, Steiger et al., 2004)
 Figure 3.11: An example of 1D Horizon and Stuffind scheduling algorithms (Steiger et al., 2004)
 100
3. Background and Related Work On-line looking-ahead RT Scheduling
 week in the hotel. In horizon scheduling, the room is no longer available from the booking time
 until the end of the stay which is in two weeks. This scheduling approach prevents the room from
 being rented during the week preceding the client arrival. Therefore, the occupation ratio of the
 hotel is lower, where this ratio corresponds to the reconfigurable hardware utilization ratio in our
 present case.
 The aforementioned lack of horizon scheduling is overcome by stuffing scheduling. Indeed, as
 illustrated in figure 3.10(b), areas on the reconfigurable hardware are marked either as ?totally
 free? or as ?occupied in a precise time interval?. Figure 3.10(c) shows the same information, but
 from availability point of view. This is similar to rooms management in hotels where rooms are
 marked according to their availability. Hence many clients (resp. tasks) may share the same
 room (resp. area) as long as it is on a time-sharing basis. In figure 3.11 the same sequence
 of tasks is scheduled using horizon and stuffing algorithms. Task T7 starts earlier with stuffing
 algorithm. Consequently, as shown by Steiger et al. (2004) both for 1D and 2D placement, for
 the same placement strategy, stuffing scheduling algorithm reduces the makespan and the tasks
 rejection ratio while increasing the device utilization ratio compared to horizon scheduling. These
 improvements come at the cost of higher scheduling algorithm runtime overhead.
 Chen and Hsiung (2005) proposed the Classified Stuffing (CS) algorithm as an improvement
 of the 1D version of the original stuffing scheduling (Steiger et al., 2004). In Classified Stuffing,
 hardware tasks are classified in two types and the placement location of each class of tasks is
 different. This differs from the original stuffing (OS) algorithm where when an available area is
 found, the task is placed on its leftmost in the case of 1D placement (on its bottom left corner
 in the case of a 2D placement). Classified Stuffing can place a task either on the leftmost or the
 rightmost of the available area depending on the task Space Utilization Rate (SUR), given by :
 SURi =
 wi
 ei
 where SURi is the space utilization rate of the task to place Ti, wi and ei respectively its width
 and its execution time. Tasks with a SUR > 1 are placed starting from the leftmost available
 area while tasks with SUR  1 are placed from the rightmost available columns. As shown
 through simulations performed with a large number of tasks sets with various SUR (4, 1 and
 0.25), Chen and Hsiung (2005)'s 1D stuffing approach reduces the fragmentation (from   5:5%
 to   23:3%) and the makespan (from   4% to   21%) for the same tasks rejection ratio
 and almost the same algorithm runtime overhead, compared to original stuffing. Obviously, best
 101
3. Background and Related Work On-line looking-ahead RT Scheduling
 results are obtained with tasks set with low SUR. Indeed, as Chen and Hsiung (2005) uses a 1D
 placement, the smaller the width of the task, the lower the resulting internal fragmentation.
 Cui and Deng (2007) proposed a One-Level Look-Ahead scheduling strategy. Indeed as frag-
 mentation is one of the main problem to overcome, their work coped with finding a fragmentation-
 based scheduling policy, as presented later below in section 3.9, page 122. Their ultimate proposi-
 tion is a looking-ahead approach in tasks placement, which aims at reducing the area fragmentation
 of the reconfigurable hardware device by delaying the placement of a task to the next event, in-
 creasing the solution search space. Consequently, the placement of an arriving task could be
 delayed even if there is enough place on the device to place the tasks at its release time. The basic
 idea behind this strategy is that this delay is sometimes more benefitial in terms of area fragmenta-
 tion. In their so-called one-level looking-ahead, a task released at time ta could be delayed at most
 until time ta +  if the task can still meet its deadline, and if the resulting overall fragmentation
 of the device is lower compared to an immediate placement at time ta. One-level here means that
 within the time interval [ta ; ta+ tl] where tl is the laxity of the task, the algorithm checks the first
 time instant ta +  when one (or more) currently running task finishes its execution and therefore
 frees more spaces. The fragmentation resulting from a placement at time ta is then compared to
 the one produced by a placement at time ta +  , the starting time that is more benefitial in terms
 of fragmentation is choosen accordingly.
 Cui and Deng (2007) was actually an improvement of a similar scheduling approach that has
 previously been proposed by Tabero et al. (2006). Both approaches were fragmentation-based as
 they both tried to minimize the fragmentation. However they differed on the way they managed
 their free areas. Tabero et al. (2006) used a Vertex List to keep the track of the free areas in
 the FPGA while Cui and Deng (2007) used a MER-based approach. Albeit fragmentation-based
 scheduling approaches, neither Tabero et al. (2006) nor Cui and Deng (2007) are not fully a
 looking-ahead approach as the algorithm only looks a single (and closest) time instant in the
 future, instead of looking at all the events in time interval [ta ; ta + tl] which are likely to free
 more space on the reconfigurable device. However, it outperforms Handa and Vemuri (2004a) and
 Tabero et al. (2006) in terms of tasks rejection ratio and device utilization ratio.
 Marconi et al. (2008) proposed the Intelligent Stuffing (IS) algorithm as an improvement
 of the original stuffing (Steiger et al., 2004) and the classified stuffing (Chen and Hsiung, 2005).
 Therefore, IS is a 1D looking-ahead stuffing algorithm which provided better results compared to
 102
3. Background and Related Work On-line looking-ahead RT Scheduling
 the previously mentioned 1D stuffing algorithms. As in the classified stuffing, IS algorithm differs
 from the original 1D stuffing in the criteria used to place a task either on the leftmost or on the
 rightmost of the available area. While tasks positions are SUR-based in CS algorithm (Chen and
 Hsiung, 2005), Marconi et al. (2008)'s approach introduces the so-called alignment status of a free
 space. Hence, each free space (FS) is assigned a boolean parameter (alignment status). The latter
 indicates the side (leftmost or rightmost) of the free area where to place the next task that will
 be accomodated to the area.
 Figure 3.12: Intelligent Stuffing (IS) scheduling algorithm using 1D placement (Marconi et al.,
 2008)
 Everytime a task is placed in a free area, the alignment status of the remaining free area is
 toggled. One example is mapped in figure 3.12-(a-1) where the whole area of an empty FPGA
 is represented by a single area denoted as FS1, spanning from column CL1 and CR1 and with
 an alignment status set to leftmost. That is, the first arriving tasks T1 is placed at the leftmost
 position of FS1 as pictured in figure 3.12-(a-2). This action reduces the size of FS1, toggles the
 alignment status of FS1 from leftmost to rightmost, and creates a new free space FS2 which spans
 from columns CL2 and CR2. As for any newly created area, the alignment status of FS2 is set to
 leftmost. If a new task T2 arrives, the free space FS1 will accomodate the task at its rightmost
 edge, according to its current alignment status. Consequently as illustrated in figure 3.12-(a-3),
 103
3. Background and Related Work On-line looking-ahead RT Scheduling
 the size of FS1 is reduced, its alignment status toggled to leftmost, FS2 is resized accordingly and
 a new free area FS3 created with a leftmost alignment status.
 Figures 3.12-(b-1) and (b-2) show two scenarios of 3 tasks (T1, T2 and T3, not dotted) scheduled
 using IS scheduling algorithm. In figure 3.12-(b-1) the dotted tasks T2 and T3 show how the original
 stuffing (OS) would have scheduled tasks T2 and T3. The makespan resulting from OS scheduling
 is greater compared to IS scheduling (MOS > MIS). A similar result (MOS = MCS > MIS) arises
 from the second scenario mapped in figure 3.12-(b-2) where the dotted version of tasks mimics a
 classified stuffing scheduling. IS provides smaller makespan as it allows future tasks to be placed
 earlier, thanks to the area management based on the alignment status. The latter is easier to
 compute than the SUR used by CS algorithm. Consequently, IS reduces the average waiting time
 (up to 12:8% ) and decreases the total wasted area (up to 53%) for slightly the same algorithm
 runtime overhead, compared to CS scheduling.
 Conclusion on online looking-ahead scheduling
 To summarize, this section discussed different works on online looking-ahead scheduling of real-
 time tasks on reconfigurable hardware devices. Steiger et al. (2004) first presented two main
 approaches denoted respectively as horizon and stuffing, and that used 1D and 2D placement
 strategies. For obvious reasons recalled earlier, stuffing algorithm outperforms horizon algorithm
 with respect to the same placement strategy. In addition, as it produces less fragmentation, 2D
 placement provides better results compared to 1D in terms of reconfigurable hardware utilization
 ratio and tasks rejection ratio.
 Chen and Hsiung (2005) and Marconi et al. (2008) have respectively proposed CS and IS algo-
 rithms as improvements of the 1D version of stuffing algorithm. Their algorithms provide a better
 quality placement and a shorter makespan for slightly the same runtime overhead. However, the
 results remain far from being optimal as the quality placement of a scheduling strategy is highly
 tied to the quality of the underlying placement algorithm (which is 1D here) along with its area
 management strategy.
 Looking-ahead scheduling requires to regularly mimic future tasks placement and withdrawal in
 order to assess future states of the reconfigurable array. This leads to numerous area splitting and
 merging operations that can be highly complex depending on the placement algorithm. Conse-
 quently, unlike Steiger et al. (2004) other related works in looking-ahead scheduling in an online
 real-time context tend to use a simple 1D placer. The resulting runtime overhead and overall
 complexity are then affordable in this context, but come at the cost of reconfigurable hardware
 104
3. Background and Related Work Placement issues
 device utilization ratio.
 3.8 Tasks Placement and Related Work
 3.8.1 Online Placement Issues
 Placement aims at efficiently allocating reconfigurable area to different modules (tasks) of an
 application to implement. The problem of placing modules on the reconfigurable device is simpler
 if the device is big enough to concurrently fit all the modules of the application. In this case,
 a placed module is permanently on the device and the placement problem consists of finding a
 position (if one exists) to each task in a way that all the tasks fit on the reconfigurable array.
 However, if the chip is not big enough, there is a need to perform a hardware virtualization (or
 multitasking) where modules could be dynamically swapped in and out the chip according to
 their idle or operating time. This process of assigning both a position and a starting time to each
 task makes the placement problem more challenging by turning it into a scheduling problem as
 previously described.
 To achieve a good placement, one needs to have a model for each component involved as
 detailed in Chapter 4. In most studies, the reconfigurable fabric (e.g. FPGA) is seen as an array
 of resources with a given size (area) and hardware tasks as rectangular modules to be placed on
 the array. Basically the modules and the reconfigurable area are rectangular-shaped. Hence, the
 problem of placing modules on an FPGA is similar to the well-known 1D and 2D bin-packing
 problems as presented in Coffman et al. (1997). 1D and 2D bin-packing correspond respectively
 to 1D and 2D placement; a 3rd dimension is commonly added for time flow, in order to bring out
 the scheduling problem.
 In hardware tasks scheduling for reconfigurable platforms, the placement problem appears
 as a scheduling sub-problem. Hence, a Scheduler assigns the management of the reconfigurable
 array to a Placer, sometimes denoted as resource manager. The placer reacts upon scheduler
 requests. In general, the placement problem is usually divided in two main parts (detailed below
 and depicted in figure 4.12, page 157, Chapter 4):
 (i.) area manager
 most of the time, the area manager is a free space manager as it manages the free space
 still available on the reconfigurable fabric. Hence, it maintains a data structure that stores
 information about the state of the reconfigurable fabric (free spaces, occupied spaces). Con-
 105
3. Background and Related Work Placement issues
 sequently, the manager is invoked to update the state of the chip at every change (task
 placement or withdrawal). Most work on placement differ from each other depending on
 the data structure that stores the state of reconfigurable area, and on the strategy used to
 manage it. The data structure and its update deeply impacts on the complexity of placement
 algorithms and heuristics.
 (ii.) area finder
 the area finder checks and assigns a rectangular area on the reconfigurable array that could
 accommodate a task, on request of the scheduler. It uses a given rule (e.g. bin-packing
 rules) to fit the task to one of the free spaces maintained by the free space manager. In
 addition, it uses different fitting strategies (see section 3.8.4, page 113) to decide which area
 among the possible candidates will be assigned to the task to be placed.
 The area finder and the free space manager are tied as the strategy of finding and choosing
 one area among others depends on the data structure that keeps a record of the available ar-
 eas. Consequently, related work on placement will be presented first, through different free areas
 management strategies, sometimes referred to as areas partitioning strategies. Afterwards, another
 comparative study of placement strategies will be done, from reconfigurable area fragmentation
 perspective.
 The special case of o?ine placement
 If the flow of the program is predictable at design time, and all the parameters (timing informa-
 tion, size, etc.) of different tasks or modules to be inserted are known beforehand, the placement
 of different modules on the reconfigurable hardware could be planned at compilation time. Hence,
 the reconfigurable resources are statically and deterministically assigned to different modules at
 the compilation time. In o?ine scheduling and placement, high performance algorithms and
 heuristics can be developed to optimally manage the reconfigurable fabric area. So, even if these
 very efficient algorithms compute slowly, they remain affordable as computations are done o?ine
 (e.g. on a host PC before the system starts running). Today, compilation times of FPGA-based
 designs are dominated by placement and routing. FPGAs are becoming denser. The designs
 to implement are more complex, requiring tremendous amount of interconnections to be routed.
 EDA tools require great amounts of CPU time (even hours) to achieve a high quality placement
 and routing. Different high-quality P&R
 13
 heuristics have been developed over the years to solve
 13
 Place and Route
 106
3. Background and Related Work Free areas management
 placement and routing in FPGAs. Further, EDA tools enable incremental placement and modular
 placement for modules based designs. In a modular placement, a rectangular portion of the FPGA
 is assigned to each module. The module is placed and routed within this portion. This feature
 can be exploited in online placement. However, these optimal P&R heuristics used in EDA tools
 are too greedy to be used in an online scenario. They are suited to o?ine placement (e.g. running
 on a desktop PC) and are not in the scope of this thesis.
 3.8.2 Free Area Partitioning
 As stated earlier, the area manager keeps and updates a data structure representing the current
 state of the reconfigurable area in terms of empty spaces or inserted modules. It also divides
 the empty space in empty rectangles. As detailed below, there are many ways of splitting and
 managing an area that accomodate a smaller task.
 Nonoverlapping vs overlapping area partition
 The area manager keeps the track of available free areas as a list of rectangles. These rectangles
 result from different partitions that are performed after tasks placement. Depending on the par-
 tition strategy applied on the remaining empty space, there are two types of generated rectangles
 as pictured in figure 3.13:
 (i)- Nonoverlapping rectangles that result from a nonoverlapping partition, as depicted in figure
 3.13-(a) and 3.13-(b).
 (ii)- Overlapping rectangles that result from an overlapping partition, as depicted in figure 3.13-
 (c).
 As seen in figure 3.13-(a) and 3.13-(b), when a task T1 is placed in a rectangle, the remaining free
 area is split up into several (two here) nonoverlapping rectangles, R11 and R12. The split could
 be either vertical or horizontal.
 In nonoverlapping partition, the number of rectangles to manage grows linearly with respect to
 the number of placed tasks. Hence the time complexity of placing or removing a task is in general
 O(n) where n is the number of tasks currently placed. The reason is that only the accommodating
 rectangle is involved in the placement/withdrawal process.
 Figure 3.13-(c) depicts another option where the resulting rectangles are overlapping. Indeed,
 as detailed in figure 3.14, generating overlapping rectangles improves the placement quality. The
 example given shows that task T2 would have been rejected if rectangles R11 and R12 resulting
 107
3. Background and Related Work Free areas management
 from the placement of task T1 were not overlapping. More precisely, T2 would have been rejected
 in the event of a vertical split. T2 is placed in rectangle R11. R11 is then split up into two
 rectangles R21 and R22. This simple example shows that overlapping partitioning decreases the
 number of rejected tasks by providing more solutions for fittings.
 However, overlapping partitioning induces numerous resizing processes. For example, as task T2
 overlaps with rectangle R12, the latter has to be resized. Consequently, the number of rectangles
 involved in a task placement/withdrawal is one order of magnitude greater if compared to the case
 of nonoverlapping partitioning. In addition, when using overlapping partitioning the resulting
 rectangular space partition is quadratic with respect to n, the number of tasks currently placed
 on the chip. Consequently, the time complexity of inserting/removing a task is O(n2).
 Figure 3.13: Nonoverlapping vs overlapping partition; vertical vs horizontal split for overlapping
 partition
 MER (Maximum Empty Rectangles)
 A maximum empty rectangle is an empty rectangle that cannot be completely covered by
 any other empty rectangle. Figure 3.14-(c) shows one way of managing the remaining free area
 by splitting it up using the overlapping partition approach. Furthermore, another overlapping
 approach is mapped in figure 3.14-(d) where all possible maximum empty rectangles that could
 be built in the remaining free area are identified. R12, R21 and R22 are these so-called MERs
 (Maximum Empty Rectangles).
 The difference between figure 3.14-(c) and figure 3.14-(d) is on the size of R22. R22 is bigger
 in the latter figure thanks to a MER-based partition. Therefore, a MER-based partition provides
 more candidates for placing a new task. The placement is optimal since the list of MERs will
 contain a placement solution for any task, if one exists. However, inserting/removing a task has
 108
3. Background and Related Work Free areas management
 Figure 3.14: Placing a task in an overlapping rectangle
 Figure 3.15: Maximum empty rectangles
 time complexity O(n2), n being the number of currently placed tasks.
 Another example of tasks placement in MERs is pictured in figure 3.15. It shows how the number
 of rectangles involved in each task placement or withdrawal grows drastically compared to the
 nonoverlapping case. Identically, the number of rectangles grows quadratically with respect to the
 number of placed tasks. In figure 3.15-(d) there are 5 rectangles for two currently placed tasks,
 T1 and T2.
 109
3. Background and Related Work Free areas management
 3.8.3 Data Structure to store the State of the Reconfigurable Array
 The most challenging part of placement problem is probably to build a data structure that repre-
 sents the state of the reconfigurable array, e.g. in terms of available resources and their location.
 While trying to place a task, the algorithm checks existing free areas in the data structure. In
 addition, at each task placement or withdrawal on the reconfigurable array, the data structure is
 updated in order to reflect the new state of the array. The placement algorithm complexity and
 performance highly depends on the data structure. A good data structure should keep a complete
 record of existing areas, should be easy to scan for finding free areas, and should be easy to update.
 These objectives are conflicting. Indeed, the quantity of information required to represent the free
 rectangles grows with the accuracy of the representation. For example, recording all the MERs
 will require far more information than recording nonoverlapping rectangles. Hence, a reasonable
 trade-off has to be found while choosing the area management strategy.
 No matter which data structure is used to represent the free of rectangles, remember that the
 number of rectangles involved in a task placement or withdrawal depends on the partitioning
 strategy used, as mentioned earlier. The updating process is consequently closely related. Depen-
 ding on the way the data structure keeps the state of available areas and the resulting update
 process, one distinguishes 3 main data structures of various algorithmic complexity : list of over-
 lapping/nonoverlapping rectangles, list of MERs and Vertex-List.
 List of overlapping and nonoverlapping rectangles
 In this approach, the reconfigurable array is represented by a list of overlapping or nonoverlapping
 rectangles. Each rectangle is represented by its coordinates and its size.
 A binary search tree as adjacency graph :
 Bazargan et al. (2000) mentioned a binary search tree used as an area adjacency graph that
 stores the state of the reconfigurable array. Figure 3.16 depicts an example of this tree. In this
 example, the tree stores nonoverlapping rectangles. When a task is placed in a rectangular area
 that is bigger than the task, the area is partitioned up into two or three parts according to vertical,
 horizontal or overlapping split. In the tree, each node represents an area and could generate up
 to two children nodes when a task is placed on it. R1 represents the whole reconfigurable array
 and is the root of the tree. Its size is 10 X 6 where 10 is its width. Task T1 is first placed on the
 bottom left of R1. R1 is then split up into 3 parts denoted as T1, R2 and R3. R2 and R3 are
 110
3. Background and Related Work Free areas management
 inserted in the tree as children rectangles of R1 while the part denoted as T1 host the task T1.
 Then T2 arrives and is placed in rectangle R2. As T2 spans the entire width of R2, only one child
 rectangle R4 is generated and inserted in the tree. The main observation at this stage is that each
 internal node of the tree corresponds to an area hosting a task, while leaves of the tree represent
 the current free areas on the chip. At this point in time after placing tasks T1 and T2, available
 areas are R3 and R4. Then arrives T3 which is placed on R3. R3 is vertically split, R5 and R6 are
 generated. At this point, free areas are R4, R5 and R6 and running tasks are T1, T2 and T3. T4
 arrives and is placed on R4 which generates R7. The tree is updated accordingly. At this point,
 the tree is in the state pictured in figure 3.16, without the area RX which is the second child area
 of R2.
 R1:10x6
 T1:3x3
 R2: 7x6
 T2: 7x5
 R3: 3x3
 T3: 2x2
 R6: 1x2
 R5: 1x3
 R4 : 7x1
 T4: 5x1
 R7: 2x1
 Rx: 7x5
 R: Width x Height
 T: Width x Height
  T1
 3 x 3
  Area up
     R3
    3x3 Area down 
     R2
      7x6
  T1
 3 x 3
  R3
 3 x 3
 R4:  7x1
     T2 
    7x5
  T1
 3 x 3
 R4:  7x1
      T2 
    7x5
  T3
 2 x 2
  R
  5
  R6
  T1
 3 x 3
 T4:  5x1
      T2 
    7x5
  T3
 2 x 2
  R
  5
  R6 R7
  (a?)
  (b?)
  (c?)  (d?)
  (a)
  (b)
  (c)
  (d)
 10
 6
 Area up
 Area 
down
 Figure 3.16: A binary tree used as a data structure that records the state of the FPGA
 Storing the states of the reconfigurable array in such a binary tree structure makes merge
 and split operations more intuitive. Indeed, initial area of a node could be retrieved when all its
 children areas are free. For example, when task T3, placed on node R3 completes, the original size
 of R3 is retrieved on the chip by moving back the tree to a former state. This is done by simply
 deleting children area nodes R5 and R6. However, a situation can arise when the task hosted in
 the node finishes while there is one or many tasks still running in the subtree. In this situation,
 an area node RX is generated as pictured in figure 3.16, where the size of RX is equal the size
 of the finished task. Indeed, if T2 finishes before T4, R2 cannot be retrieved immediately. RX is
 generated as an extra child node, in order to make it available for further placement.
 111
3. Background and Related Work Free areas management
 The complexity of finding an internal node (that host a task) is bounded by O(n) in the case of
 a binary tree, and by O(log(n)) in the case of a binary search tree, n being the number of tasks
 currently on the array; in addition if the binary search tree is balanced the complexity drops to
 O(log2(n)). A binary search tree can be properly built based on the coordinates of the generated
 areas. For example, in the depicted case of figure 3.16, a vertical split is used. Hence, two children
 rectangles (e.g. area down R2 and area up R3) generated from the same father rectangle (R1)
 will differ from each other by their X-coordinate. As shown on node (a) and (a'), rectangle R3
 along with its descendants will always be at the left of the X-coordinate of rectangle R2. The X-
 coordinate may be a building criteria of the binary search tree. However in the case of overlapping
 rectangles, the criteria differentiating the area up from the area down while building the tree will
 be much more difficult.
 A tree structure is usually used with an additional list that contains the available free areas
 (leaves of the tree). The reason is that checking such a shorter list to find a feasible placement is
 much more easier than scanning the tree.
 To summarize, the complexity of managing a simple list is interwoven with the complexity
 of the partitioning strategy. A simple list provides a low complexity, but leads to a higher frag-
 mentation especially in the case nonoverlapping partitioning. A tree structure eases the merging
 process and therefore limits the fragmentation.
 List of MERs
 Building a list of MERs at each task placement and withdrawal leads to high quality placement.
 The free area is partitioned in MERs as pictured in figure 3.15. By principle, finding all MERs
 assumes that the whole reconfigurable array is scanned. Hence, the worst case performance for
 finding all the MERs is O(w h), w and h being respectively the number of columns and the number
 of lines of the reconfigurable array. However, many techniques have been proposed to identify and
 build all MERs. Most of them rely on the fact that each time a task is placed, the reconfigurable
 array is affected locally. Hence, only the MERs that overlap with the MER accomodating the
 task are updated.
 Handa and Vemuri (2004c) described the Staircase algorithm as technique that finds MERs in
 a structure. The main advantage of the Staircase algorithm is that it scans an average of 15% of
 the whole array in order to update the list of MERs. Therefore, Staircase algorithm is of lower time
 complexity. This lower time complexity is obtained by keeping concurrently a table that contains
 112
3. Background and Related Work Free areas management
 free and occupied cells in the array, and an encoding table that contains some coefficients. These
 data are used to build stairs of avalaible areas from which MERs are deduced.
 Cui and Deng (2007) also proposed the ScanLine Algorithm (SLA), an algorithm that effi-
 ciently finds MERs in a reconfigurable array. It used the same principle as Staircase by con-
 currently maintaining a table that contained occupied and free cells in the array, and a table of
 coefficients. However the way of calculating the MERs were completely different. While looking
 for a MER, each column was checked in order to find its key element if one existed. If found, then
 its MKE (Maximal Key Element) was found and the MER deduced. The main advantage of SLA
 algorithm was its shorter updating process, at each task placement or withdrawal.
 SLA and Staircase algorithms were quite similar as they only needed a partial scan of the array
 while refreshing the MERs. Therefore they had lower time complexity. The simulation results
 showed that Staircase achieved a speedup of 2.5 times in comparison with SLA, as discussed in
 section 4.2, Chapter 4.
 Vertex-List
 To the best of our knowledge, Tabero et al. (2004) first used a Vertex-list to manage free areas on
 a reconfigurable array. As stated previously, Tabero et al. (2006) implemented a one-level looking-
 ahead scheduling algorithm which used Vertex-list to record the state of the chip. Tabero's works
 emphasized the fact that Vertex-list based placement approach allowed the algorithm to take into
 account and therefore minimize chip fragmentation. More details are given below in section 3.9
 which presents a few fragmentation-based scheduling algorithms, including Tabero's Vertex-list.
 It is shown that Vertex-list approach is slow because it is basically MERs-based.
 3.8.4 Fitting Strategies
 Packing rectangular-shaped hardware tasks on a reconfigurable chip is similar to the 2D bin-
 packing problem, which is an extension of the classical 1D bin-packing.
 1D Placement
 One-dimensional bin-packing consists of placing rectangular modules in rows. One example
 applied to hardware tasks placement on a reconfigurable array is mapped in figure 3.17. Table 3.1
 shows parameters of 6 real-time tasks T1:::T6 to be placed. As they arrive, the tasks are placed
 as shown in figure 3.17. A 1D placement assumes that each task spans the entire height of the
 113
3. Background and Related Work Free areas management
 device, whatever the real height of the task may be. Consequently, task T6 is rejected despite the
 fact that there are enough contiguous space that can accomodate it.
 Tasks parameters T1 T2 T3 T4 T5 T6
 ai : arrival time 1 1 3 3 8 10
 ei : execution time 14 17 18 16 14 12
 di : deadline 20 20 23 23 23 23
 wi : width of the task 2 1 1 2 1 1
 hi : height of the task 4 2 5 2 2 2
 Table 3.1: Tasks to schedule on an FPGA of size 7 X 6
 Figure 3.17: Scheduling tasks on a 7 X 6 reconfigurable array using a 1D placement model
 1D bin-packing algorithms are used to perform 1D placement. In 1D placement, modules
 are placed and routed in a vertical manner as depicted in figure 3.17(a). Hence, modules cannot
 interfere on the X-axis. The placement problem is simplified to the allocation of one interval on
 the X-axis to each incoming module. Heuristics implementing 1D placement are more simple and
 hence more suited to online placement. Until recently, because of technological restrictions, FPGAs
 were hardly supporting 2D placement. Indeed, the devices were only reconfigurable column-wise.
 Hence, even if 1D and 2D heuristics were developed and simulated, mosts of prototyping works
 were limited to a 1D scenario. Fortunately, these restrictions are progressively overcome and
 random parts of FPGAs are independently reconfigurable.
 Best Fit (BF) and First Fit (FF) are two well-known fitting strategies for bin-packing algo-
 114
3. Background and Related Work Free areas management
 rithms (Coffman et al., 1997). The FF algorithm tends to be faster by putting the arriving module
 in the lowest indexed bin that may accommodate the module. The BF algorithm minimizes the
 wasted space by choosing the bin that has the smallest room to accommodate the arriving mo-
 dule. Both algorithms require O(n) time for each insertion operation in the worst case, n being
 the number of bins.
 2D Placement
 In the 2D bin-packing problem, the rectangular module to be inserted can be placed anywhere
 on the reconfigurable area (figure 3.18). 2D bin-packing heuristics are used with some restrictions
 to perform 2D placement of rectangular modules on a reconfigurable chip. As depicted in figures
 3.18 and 3.19, the modules to be inserted cannot be rotated and therefore must be positioned with
 a fixed orientation.
 Figure 3.18: 2D placement model of tasks on a 7 X 6 reconfigurable array
 3D Placement for scheduling
 When considering scheduling, a third dimension has to be added for the time (figure 3.19).
 Hence the 3D placement is similar to packing rectangles into a container W  H  D where W is the
 weight, H is the height and D is the depth. In the case of a 3D placement, the depth is replaced
 by a time axis t. Hence, contrary to container loading, the rectangular modules can only be placed
 in limited places because some places are lost to time.
 115
3. Background and Related Work Free areas management
 Figure 3.19: 3D view of the 2D placement model illustrated in figure 3.18
 3.8.5 Related Work
 Bazargan et al. (2000) have presented a fast online placement method for dynamically reconfi-
 gurable systems, as well as 3D placement algorithms for statically reconfigurable architectures.
 The reconfigurable fabric is homogeneous and consists of a 2D array of reconfigurable functional
 units (RFUs). This work provided a valuable framework for research on online and o?ine schedul-
 ing and placement of tasks on a reconfigurable computing system. Their placement algorithms
 are divided in the classical two main parts: an empty space manager for partitioning and a search
 engine and bin-packing rule for tasks insertion/deletion.
 (i). The first part (partitioning manager) uses both overlapping and nonoverlapping approaches
 and a binary tree as described and depicted earlier in figure 3.16 page 111. The Bazargan
 et al. (2000)'s overlapping approach is MER-based and is proposed through the KAMER
 algorithm. KAMER stands for Keeping All Maximum Empty Rectangles. This first method
 (KAMERs) takes quadratic space in terms of number of modules on the reconfigurable ar-
 ray. It has to manage O(n2) empty rectangles for n placed tasks on the array. Indeed,
 inserting a new task splits empty rectangles into smaller ones while removing a task merges
 empty rectangles to form bigger ones. This increases the time needed to insert an arriving
 module on the chip. However, it provides a high quality placement (while combined with a
 bin-packing algorithm like Best Fit) in spite of the fact that it is slower. Consequently, the
 KAMERs approach is much more suitable for o?ine problems.
 In the nonoverlapping approach, Bazargan et al. (2000) proposed the Keeping Nonoverlap-
 116
3. Background and Related Work Free areas management
 ping Empty Rectangles method. This method induces O(n) complexity algorithms as the
 number of empty rectangles to manage is linear in terms of the number of placed tasks.
 Therefore this approach which comes in a variety of forms is not optimal but faster. For
 example, by giving up slightly on the placement quality, a variant of this approach, com-
 bined with the Best Fit (BF) fitting strategy, has a speedup of 16 times compared to the
 KAMERs (Bazargan et al., 2000). Keeping Nonoverlapping Empty Rectangles approach
 is consequently more suitable for online problem, even if it increases fragmentation of the
 chip. The interesting part of this work is the one devoted to online placement. Although,
 thanks to its highest placement quality, the KAMERs method is used as the baseline for
 comparison against other placement algorithms, in terms of the quality of placement that
 is lost to the benefit of the amount of speedup that is gained.
 Many nonoverlapping partitioning variants are assessed by Bazargan et al. (2000) in ad-
 dition to static vertical and horizontal partition pictured in figure 3.13. Shorter Segment
 (SSEG), Longer Segment (LSEG), Square Empty (LSQR), these are examples of partition-
 ing strategies that dynamically partition vertically or horizontally depending on the size
 and the shape of the resulting free rectangles. None of them significantly outperforms the
 other. However, SSEG-BF (SSEG partition strategy coupled with a Best Fit fitting strat-
 egy) shows some improvement over other nonoverlapping variants and therefore provides a
 good trade-off for online placement, when compared to KAMER algorithms.
 (ii). The second part (search engine) of Bazargan et al. (2000)'s algorithms uses bin-packing
 fitting strategies as described above (Best Fit, Bottom Left, First Fit, etc.) to select an
 empty space among those that can accommodate the module whose insertion is requested.
 Ahmadinia et al. (2004) proposed a new 1D online dynamic task scheduling algorithm using
 1D FPGA partitioning in order to provide a better result than the KAMER and the Keeping
 Nonoverlapping Empty Rectangles methods presented by Bazargan et al. (2000). Their FPGA
 area model is homogeneous indeed and is divided into slots (or clusters). The arriving tasks are
 placed inside one of the slot depending on their completion time. The main idea behind this so-
 called cluster based algorithm is to avoid fragmentation occurring in the Keeping Nonoverlapping
 Empty Rectangles method, by freeing contiguous region (removing contiguous tasks) on the FPGA
 at nearly the same time. Indeed, if the tasks placed in the same slot (or neighborhood) have nearly
 the same end of execution time, the tasks will be removed at nearly the same time, and a large
 empty space will be created at a precise location. Hence, even larger arriving tasks could be easily
 117
3. Background and Related Work Free areas management
 Figure 3.20: Algorithms execution time comparison between KAMER algorithm (Bazargan
 et al., 2000) and 1D Cluster-based algorithm (Ahmadinia et al., 2004)
 placed in the newly created empty space. In addition, a new free space manager is proposed
 which requires linear memory (O(n)), and which runs faster since it doesn't need to divide or
 merge empty rectangles. In their example, the FPGA is divided into 3 slots.
 The results of experiments compared with the KAMER algorithm in terms of how fast they
 execute on one hand, and of how many tasks get rejected on the other hand. Figure 3.20 shows the
 comparison result obtained by simulating 1000 tasks with width and height uniformly distributed
 in three intervals corresponding to the three slots. Their algorithm has an improvement of 15 to
 20% compared with the execution time of the KAMER algorithm (Bazargan et al., 2000) while
 both algorithms have nearly the same percentage of rejected tasks (15.5% for KAMER vs 16.2%
 for cluster-based algorithm).
 Walder et al. (2003) improved the Barzagan's nonoverlapping partitioner proposed in Bazargan
 et al. (2000). These improvements aimed at limiting the chip fragmentation and therefore decreas-
 ing tasks rejection. They are discussed later in this chapter in section 3.9.3, from a fragmentation
 perspective.
 In addition, Walder et al. (2003) proposed a hash matrix as a data structure that represents the
 state of the reconfigurable array (figure 3.21-(b)). As depicted in figure 3.21-(c), (d) and (e),
 the matrix stores nonoverlapping rectangles that are available for placement. Each free rectangle
 118
3. Background and Related Work Free areas management
 is referenced by a key. A hash function maps a key to the entry of the matrix that holds the
 information on the free rectangles referenced by the key (see figure 3.21-(b)). Each entry (a; b) in
 the table consists of a list of free rectangles (of width  a and height  b) and a pointer to the
 elected rectangle in the list. Hence, the entry stores information about any rectangle capable of
 accomodating any task T[W;H] of size W x H . A rectangle is elected among others according to a
 fitting strategy (e.g. best fit, first fit, etc...).
 With this structure, a feasible placement is found for an arriving task in a constant time. Hence,
 the area finder algorithm performs in time complexity O(1). The latter features make the matrix
 particularly suitable for online placement.
 However, the main drawback of a hash matrix is its size which is equal to the size of the reconfi-
 gurable array. Therefore, updating the matrix after each task placement or deletion is a time
 consuming operation. For example, in figure 3.21-(c), the reconfigurable array is empty and is
 assumed to be a free rectangle R1. When a task of size 3X5 is placed (see figure 3.21-(d)), R1 is
 deleted from the matrix and R2 and R3 are inserted. Therefore, the entries have to be updated in
 order to take into account newly inserted or deleted rectangle(s). After inserting or deleting a task
 Ti of width wi and height hi, the matrix update process could scan and update up to wi  hi cells.
 Despite this long update process, the main advantage of using a hash matrix is that as soon as an
 area has been found in O(1) time for a task, the latter could be placed and started immediately,
 and the hash matrix updated later. This feature makes Walder et al. (2003)'s placement method
 valuable for online real-time systems that require short waiting time.
 Using a hash matrix that stores nonoverlapping rectangles affects the scheduling/placement
 algorithm runtime overheads, but can not affect the chip utilization ratio and the tasks rejection
 ratio. Indeed, searching for the area that can accommodate a given task either in a list or in
 hash matrix provide the same result in terms of the selected area. However, the time complexity
 depends on the data structure.
 Roman et al. (2006) also proposed a partition-based management of the empty space. The
 FPGA model is a homogeneous FPGA, made of a 2D grid of identical basic cells as depicted in
 Chapter 2, figure 2.6, page 34. The FPGA area is divided into four partitions with different sizes
 where the tasks will be executed. The size of the partitions is adjusted during run-time according
 to the profile of the tasks set being processed, in order to adapt it to the variations of tasks profile.
 Each task Ti is modelled by a rectangle and is expressed by a tuple: Ti = wi; hi; tarri; texi; tmaxi
 where wi is the task width, hi is the height, tarri, the clock cycle at which it arrives, texi the
 119
3. Background and Related Work Free areas management
 Figure 3.21: The hash matrix approach (a), the hash table (b) rectangle insertion/deletion in the
 hash matrix (c), (d) and (e) (Walder et al., 2003)
 execution time of the task and tmaxi the time needed by the tasks to run to completion. With
 such a fixed and reduced number of partitions, managing the data structures representing and
 organizing the FPGA space is far simpler. This leads to a fast algorithm (of constant complexity
 O(1)) particularly suitable for runtime scheduling and allocation of incoming hardware tasks on
 runtime reconfigurable FPGAs. External fragmentation is also avoided. Their algorithm is O(1)
 complex and may compete in performance with other common algorithms like First Fit (FF)
 with exhaustive search, of O(n4) for the 2D allocation problem (where n is the dimension of
 square FPGA). Experiments carried out have shown that this is particularly true for task sets
 with heterogeneous size. Hence, the strength of this work is to have proved that a simple O(1)
 algorithm is a feasible solution for heterogeneous size tasks scheduling / allocation problem in
 multitasking on FPGAs. But this approach is not suited to homogeneous size tasks set and its
 FPGA area model doesn't take into account the ever increasing heterogeneity of new FPGAs.
 Considering the heterogeneity of the reconfigurable array
 Homogeneous and heterogeneous reconfigurable arrays have been pictured respectively in figures
 2.6, page 34 and 2.11, page 43. Many researchers (e.g. Koester et al., 2005, 2006) attempted to
 improve placement algorithms by taking into account this hardware heterogeneity in order to
 optimize the resource utilization. Consequently, depending on the logic it uses, a given module
 (hardware task) will have a few feasible placement positions on the reconfigurable area. For
 120
3. Background and Related Work Free areas management
 Figure 3.22: Placing tasks m1 and m2 on an heterogeneous reconfigurable architecture (Koester
 et al., 2005)
 example, a module using only configurable cells (CLBs) would have a large amount of feasible
 positions while a module using a great amount of memory would have feasible positions restricted
 to the positions of the memory blocks.
 Koester et al. (2005) proposed a heterogeneous model of reconfigurable hardware. In the
 model, the reconfigurable area consists of two types of components or cells as shown in figure
 3.22 : Configurable cells (e.g. CLBs) which can implement any logic function, and embedded
 static cells (e.g. multipliers, embedded memories, processors, etc...) which are dedicated high
 performance IP blocks merged into the FPGA.
 In their model pictured in figure 3.22, each task (e.g. tasks m1 and m2) is assigned a set
 of feasible positions on the reconfigurable architecture before run-time. The feasible positions of
 a hardware task that use static cells (e.g. m1) depend on the location of the static cells on the
 reconfigurable architecture. A hardware task that only uses configurable cells (e.g. m2) can only be
 placed in positions where static cells are located. According to the two rules, taskm1 has 4 feasible
 positions; same for task m2. On one hand, each cell of the reconfigurable fabric has a utilization
 probability caused by the feasible position of all hardware tasks. This utilization probability is
 determined before run-time (at design time) since all hardware tasks and the reconfigurable fabric
 models are known beforehand. On the other hand, for each feasible position of a requested task
 m, the mean utilization probability of the corresponding cells give a position weight wpos that
 is used to decide which feasible position to select for the hardware task placement. Depending
 on whether the position weights wpos are generated before run-time or dynamically updated at
 run-time according to the current allocated tasks, two placement algorithms were respectively
 proposed : The Static Utilization Probability Fit (SUP Fit) and the Runtime Utilization Probability
 Fit (RUP Fit). In the light of metrics such as the average device utilization, the hardware task
 121
3. Background and Related Work Fragmentation
 rejections and the average priority of the rejected tasks, simulation results of SUP Fit and RUP
 Fit algorithms for 1D and 2D placement have shown some improvements on the standard Best
 Fit (BF) placement algorithm, presented in Coffman et al. (1997).
 Table 3.2 from Koester et al. (2005) summarizes the simulation results of the 1D-placement
 approach compared with the standard Best Fit (BF) placement algorithm. The SUP Fit algorithm
 has a low run-time complexity and generates less rejected tasks than the Best Fit algorithm. The
 RUP Fit algorithm produces the highest device utilization (less fragmentation) and less relative
 tasks rejection. But it has a high run-time complexity since the position weights have to be
 updated after each task removal or insertion. Because of technological restrictions of FPGAs at
 that time, Koester et al. (2005) prototyped an 1D placement approach in their work.
 Metrics vs Algo Best Fit Sup Fit Rup Fit Homogeneous Best Fit
 Av. device utilization (%) 49.21 47.94 50.46 56.28
 Tasks rejection (%) 74 66 68 33
 Relative tasks rejection (r) (%) 24.43 23.94 22.92 14.88
 Av. priority rejected tasks 1.68 1.68 1.73 1.88
 Table 3.2: Comparison of the simulation results with the 1D-placement approach (Koester et al.,
 2005)
 3.9 Fragmentation and Related Work
 Partial and runtime reconfiguration allows a single FPGA chip to concurrently executed many
 tasks in a space sharing basis. These tasks of arbitrary sizes are dynamically swapped in and
 out the chip over time, leading to an increasingly fragmented FPGA as shown in figure 3.25 page
 126. Fragmentation arises when the available resources are spread over the FPGA, and consist
 of noncontiguous small areas. This could prevent tasks from being placed even if there is enough
 available resources to accommodate them. Indeed, a task could only fit in contiguous resources
 that cover an area at least as bigger as the task area.
 In an o?ine placement on partially reconfigurable FPGAs, as the sequence of tasks to be
 executed are known in advance, greedy algorithms or heuristics could be found to determine
 122
3. Background and Related Work Fragmentation
 the optimal fitting strategy that minimizes the fragmentation. For example, highest complexity
 strategies (e.g. a MER based defragmentation combined with a cells scan based defragmentation
 strategy as illustrated in figure 3.23) could be afforded.
 In an online context, since the sequence of modules to be loaded on the reconfigurable area
 is not known beforehand, insertion and deletion of modules leads to the fragmentation of the
 available space. In order to avoid task rejection because of this fragmentation, there is a need
 of low complexity algorithms or heuristics that defragment the FPGA at runtime. Depending on
 the way fragmentation is measured and on the way free areas are managed, there are different
 way of coping with the chip fragmentation problem, leading to different results and complexity as
 detailed below and mapped in figure 3.23:
 (i). The nonoverlapping rectangles approach
 As previously stated, available areas are kept as nonoverlapping rectangles. While placing
 a task, the placer chooses among all the rectangles which could fit the task, the rectangle
 which minimizes the fragmentation (e.g. Best Fit or more complicated heuristics). Hence,
 as for placement heuristics, fragmentation heuristics are of lower complexity thanks to the
 limited number of rectangles, but are less efficient.
 Figure 3.23: Defragmentation strategies: complexity grows with performance
 123
3. Background and Related Work Fragmentation
 (ii). The MER (Maximum Empty Rectangle) approach
 Managing nonoverlapping rectangles leads to more fragmentation, as any single of such
 rectangles is included in a MER, if it is not a MER itself. As already stated, the advantage
 of managing MER (Maximum Empty Rectangles) is that placement algorithms or heuristics
 always find a placement solution if one exists. This means that any single candidate area
 is identified, the best candidate which optimizes a given criteria (e.g. minimizes the final
 fragmentation in the present case) could also be identified, leading to a better placement
 quality. A way of reducing the chip fragmentation is to deal with MERs while placing a task.
 Hence, one could choose the MER which minimizes the fragmentation. This generally means
 choosing among the MERs which could fit the task, the one which contributes the most
 to the final fragmentation of the chip. Consequently, optimal defragmentation heuristics
 are MER-based. However, the runtime overhead is very high because of greedy scanning
 processes preformed while updating the list of MERs at each task placement or withdrawal.
 (iii). The Cell level approach
 The two approaches (MER and nonoverlapping) presented above directly derive from the
 way free areas are managed. The trend of the resulting complexity of fragmentation heuris-
 tics are quite similar to the complexity of placement heuristics in the two cases. However,
 the final complexity of determining the fragmentation also depends on whether the contri-
 bution of each free area to the final fragmentation results in the contribution of each cell in
 the area or not. If the fragmentation is determined at cell level, there is a need to scan the
 complete cells in the area. A cell by cell scanning process remain the only way to obtain op-
 timal results in terms of minimum fragmentation, but lead to greedy computing operations
 with high runtime overheads. A fragmentation metric combining the MER approach with
 a cell level fragmentation is definitely optimal (e.g. Cui and Deng, 2007), but the greediest
 solution in terms of complexity and runtime overheads.
 There are two types of fragmentation, internal and external.
 3.9.1 Internal and Intra-task Fragmentations
 In 1D placement, an internal fragmentation occurs when a task does not utilize the full height
 of the reconfigurable device area (see pictures on the left of figures 3.24 and 3.25). This internal
 fragmentation leads to a lower FPGA utilization ratio in 1D placement compared to 2D. One way
 of minimizing internal fragmentation is to maximize the height of hardware tasks during its design.
 124
3. Background and Related Work Fragmentation
 By doing so, some resources could be gained (saved) as shown in the right of figure 3.24. Indeed,
 tasks T1 and T 01 are similar with respect to total area and functionality but are of different shape.
 T 01 occupies the full height of the device for a smaller width, and therefore produces less internal
 fragmentation than T1. Such an improvement is obtained by using appropriate shape constraints
 in place and route tools at design time.
 In addition, because of the fact that modules are rectangular-shaped, there are intra-task
 fragmentations. Indeed, an intra-task fragmentation is a loss of resources that arises when the
 hardware task is assumed to be rectangular. As pictured in figure 3.24, in the rectangle bounding
 all the resources needed by task T1 or T2, there are some resources that remain unused and that are
 consequently lost. However, as rectangular tasks are much more easy to manipulate by placement
 algorithms, this assumption is made by almost all studies on hardware task scheduling.
 3.9.2 External Fragmentation
 In figure 3.25 (b), there are currently 4 modules spread on the reconfigurable array, but almost
 the half of the array is still free. However, because of its shape, the incoming module T5 cannot
 be inserted even though its size is smaller than the total remaining free space. This fragmentation
 of the free space is known as external fragmentation. It increases the tasks rejection ratio and
 decreases the chip utilization ratio because of the waste of resources.
 3.9.3 Related Work
 The main challenge of placement heuristics is to manage the free space and schedule arriving tasks
 so that the fragmentation is avoided. This is quite difficult to achieve in an online scenario.
 As stated earlier, o?ine algorithms or heuristics are very efficient but slow, while online
 placement heuristics are faster with less efficiency. In some systems with some recurring idle times
 (e.g. in an automotive electronic system during the night when the car is parked), an o?ine
 placement component can optimally relocate the modules on the FPGA, in order to defragment
 the available free space during the idle time. Unused or non-critical modules could be interrupted
 and relocated in order to free the maximum contiguous space. This o?ine defragmentation eases
 the job for the online placement component while the system is in use. Hence, one can overcome
 fragmentation problems in some specific application by combining online placement with an o?ine
 defragmentation. Veen et al. (2005) present such a strategy. Their online placer has an o?ine
 component called the defragmenter which performs an o?ine relocation of the currently placed
 125
3. Background and Related Work Fragmentation
 Figure 3.24: Intra-task and internal fragmentation
 Figure 3.25: A fragmented FPGA: the free space available on the chip is sufficient to insert the
 arriving task, but its shape doesn't allow it.
 tasks with the objective to maximize the areas of contiguous free resources. This defragmentation
 eases the online tasks placement process which is done at runtime. This approach is extremely
 useful for FPGA in which partial reconfiguration can only be performed columnwise (e.g. Xilinx
 Virtex II Pro FPGAs). Indeed, in such an FPGA, while performing a 2D placement, reconfiguring
 (or placing) a module on the FPGA affects all the modules interfering columnwise. If defragmented
 online, one has to reconfigure all the interfering modules, which is not desirable at runtime.
 126
3. Background and Related Work Fragmentation
 The KAMER algorithm used in Bazargan et al. (2000) and combined with various fitting
 strategies was also aiming at minimizing the fragmentation by providing the optimal fitting solu-
 tion. Hence, if there are more than one solution, one can choose the fitting strategy that optimizes
 (minimizes) the fragmentation (e.g. best-fit). This optimality comes at the cost of algorithm com-
 plexity as explained earlier in this chapter. Therefore, such a MER-based areas management is
 difficult to be combined with defragmentation strategies such as partial rearrangement approach
 (O. Diessel and Schmidt, 2000) discussed below.
 However as noticed earlier, the adjacency graph (or binary tree, Bazargan et al., 2000) avoids an
 unlimited fragmentation by retrieving previous states of the array, as shown in figure 3.26. Over-
 lapping or nonoverlapping areas are inserted in or deleted from the binary tree at tasks placement
 or deletion. Figure 3.26 depicts the case of nonoverlapping rectangles. It shows how the area of
 reconfigurable array is fragmented during tasks placement and restored at tasks ending.
 When an available area is selected for task placement, if the task is smaller than the area, the
 remaining area is divided in two nonoverlapping rectangles, bounding the time complexity of in-
 serting a task to O(n). As illustrated in figure 3.26-(a) rectangle A0 is chosen to accommodate
 the task T1. Two nonoverlapping children rectangles A1 and A2 are then generated, resulting
 from a vertical split of the remaining area (figure 3.26(b)). As tasks T1 and T2 are successively
 placed (corresponding to states (a) ! (b) ! (c) of areas splitting process...), the tree is updated
 accordingly (corresponding to states (a0) ! (b0) ! (c0) of the tree update process). Assuming
 that tasks T1 and T2 successively finished (which is not always true), the original area A0 is
 rebuilt along the reverse process (states (c)! (b)! (a) of bigger areas recovering process, states
 (c0) ! (b0) ! (a0) of the tree update process). At any time, the currently free rectangles are the
 leaves of the tree. Storing successive states of the array and restoring them eases the process of
 rebuilding areas which have been split during a task placement.
 However, no matter which split is used (vertical, horizontal, overlapping, etc.), a situation may
 arise where a task is rejected because the contiguous space which could fit it does not belong to the
 same rectangle or sub-graph. Indeed, the tree intrinsically produces a fragmentation problem that
 is lower for overlapping rectangles. The tree is non optimal as a task could be rejected because
 of this fragmentation problem. The main advantage of the tree approach is its automatic bigger
 areas restoration and its algorithmic complexity.
 O. Diessel and Schmidt (2000) have proposed a quad-tree structure which store the informa-
 tion of the available resources on the FPGA. Three methods are studied (local repacking, ordered
 127
3. Background and Related Work Fragmentation
 Figure 3.26: Bazargan's adjacency graph: bigger rectangles restoration process (Bazargan et al.,
 2000)
 compaction and genetic algorithm) for placing tasks on a partially and dynamically reconfigurable
 FPGA by a so-called partial rearrangement process. Partial rearrangement aims at rearranging
 some of the already executing tasks in order to fit an arriving task which normally couldn't fit
 immediately. Partial rearrangement hence significantly reduces queuing delays as tasks are placed
 as soon as possible and are completed earlier, freeing the place for other incoming or waiting
 tasks. The proposed three defragmentation methods are of different but still high complexity.
 The genetic algorithm approach is more efficient in finding fitting areas at the cost of complex
 rearrangements, but is worthy only if the reconfiguration delay (time needed to configure the
 task) of the task is small compared to its execution time. Hence, there is more time for proces-
 sing complex rearrangements without causing execution delays. The two other approaches (local
 repacking, ordered compaction) are less complex, and then suit to tasks with shorter execution
 time and longer reconfiguration delays. The computation time of the three different methods are
 not included in their study, as they assumed that these computations might be executed in the
 background during inter-tasks arrival period. In addition, these methods also assumed hardware
 tasks preemption, which is not that obvious in an online real time context.
 Contrary to O. Diessel and Schmidt (2000) who coped with preemptive tasks, Walder and
 Platzner (2002) presented a transformation method (Footprint Transform, figure 3.27) for non-
 128
3. Background and Related Work Fragmentation
 preemptive multitasking on FPGAs. What is new in their case is that their task model also
 consider coarse granularity tasks of non rectangular shape. However, each task consist of a set
 of rectangular tasks denoted as subtasks (e.g. figure 3.27). Hierarchically, as shown in figure
 3.27, S1, S2 and S3 are rectangular subtasks constituting coarse grain task T . Relative positions
 of subtasks can be modified during allocation (footprint transforms), in order to find a suitable
 shape to task T (shape 1, 2, 3, etc. in figure 3.27) which fits in the available free area. Footprint
 transform is performed on a task if its placement fails at the first attempt. Footprint transform
 consequently reduces queuing delays as tasks are placed as soon as possible.
 Also presented are first-fit and best-fit placement techniques. Among a list of positions in the free
 area that may accommodate the task to be placed, best-fit algorithm selects the best candidate
 according to a given criteria. The selection criteria in the present case is the fragmentation of
 the residual area. Indeed, this fragmentation is to be minimized. The residual area here is the
 remaining free area on the array after placing the current task. Accordingly, Walder and Platzner
 (2002) introduced a new fragmentation metric denoted as fragmentation grade and expressed by
 equation 3.14.
 F = 1 
pP
 i(ni  ai
 2)
 P
 i(ni  ai)
 (3.14)
 Figure 3.27: Footprint Transform (Walder and Platzner, 2002)
 Best fit algorithm first determines as many fragmentation grade F (of the residual area) as
 possible rectangles or positions where the task could be placed. It then assigns to the task the
 129
3. Background and Related Work Fragmentation
 position with the minimum F . As F is calculated for any single possible fitting position of the task
 T on the free part of the FPGA array, the bigger T (and/or the smaller the available area on the
 array), the less possible fitting positions. Equally, the lower F , the more the future task likely to
 fit. However, determining fragmentation grade remains a computationally heavy iterative process
 as for a given allocation, all the possible rectangles of the residual area are identified and classified
 according to their size. The resulting histogram of free areas consists of i classes of ni rectangles
 of size ai, and the deriving fragmentation grade for one allocation is given by the equation 3.14.
 Their results are mitigated as footprint transform in combination with first-fit strategy sometimes
 outperforms best-fit in terms of tasks set total execution time (scheduler makespan), sometimes
 not. However these results are obtained with only 25% of the tasks set being footprint transformed.
 Such results are interesting as they show how different combinations could lead to different algo-
 rithm complexity vs scheduling quality trade-offs. However, complexity and runtime overhead of
 algorithms are not assessed in order to measure their suitability to online scheduling.
 In a later work, Walder et al. (2003) proposed an improvement of the Bazargan's partitioner
 presented above (Bazargan et al., 2000) in order to limit the fragmentation and improve placement
 quality. They gave up the idea of dealing with non rectangular task and the novelty is to delay as
 much as possible the split decision while placing a new task in a bigger rectangle. Walder et al.
 (2003) relied on the same binary tree structure mentioned by Bazargan et al. (2000) and presented
 above in section 3.8.3 to manage the areas. Hence, no matter which heuristic is used by Bazargan
 et al. (2000) to decide how to split the hosting rectangle (vertical, horizontal, etc...), a situation
 could arise where the next task could not fit in any of the resulting rectangles because of the
 split previously done. In order to minimize such cases, one solution is to delay the split decision
 until the arrival of the next task to be placed, and then to perform the right split accordingly. For
 example, in their so-called on-the-fly partitioning (OTF), Walder et al. (2003) manage overlapping
 rectangles in innovative ways. Hence, an overlapping rectangle is resized only if it overlaps with
 the just placed task. Generated children rectangles are kept overlapping as much as necessary.
 Two other variants of partitioning algorithms are assessed. Walder et al. (2003)'s partitioning
 approaches outperform Bazargan et al. (2000)'s partitioner by up to 70% in terms of average
 waiting time of tasks.
 In order to limit the fragmentation of the reconfigurable array, Ahmadinia et al. (2004)
 proposed an algorithm that uses two horizontal lines to manage free spaces on the array. Instead
 of maintaining a list of empty rectangles, they managed two horizontal lines, one above and one
 130
3. Background and Related Work Fragmentation
 Figure 3.28: Using horizontal line to manage free space Ahmadinia et al. (2004)
 under the already running tasks as depicted in figure 3.28. As any significant free area shouldn't
 be between the two horizontal lines, newly arriving tasks could therefore be placed either on the
 top of Horizontal_line_1 or at the bottom of Horizontal_line_2 and the corresponding line is
 modified accordingly. The overall idea behind this approach is to put beside each other tasks with
 nearly the same finishing time. Hence, contiguous spaces are free at nearly the same time, reducing
 the fragmentation. The resulting empty space is then more likely to fit future and even larger
 tasks. Furthermore the principle is extended to a clustered approach, where the reconfigurable
 array is split in a number of 1D clusters, each cluster accommodating tasks with similarities in
 terms of finishing time.
 What is interesting in this algorithm (which doesn't manage MERs) is its comparison to MER
 approach. Indeed, as already stated, managing MER certainly lead to a better placement quality in
 terms of tasks rejection ratio, but at the cost of high complexity and algorithm runtime overhead.
 Results show that their algorithm takes slightly 18% less time to compute than the KAMER
 algorithm in Bazargan et al. (2000) for almost the same tasks rejection ratio (resp. 16.2% and
 15.5%).
 In Handa and Vemuri (2004a) and Handa and Vemuri (2004b) are proposed MER based ap-
 proaches for avoiding fragmentation while scheduling non-preemptive tasks. In Handa and Vemuri
 (2004b), free areas are kept as list of MERs. In order to avoid areas fragmentation, scheduling
 decisions are deferred as much as possible to accommodate dynamically changing task priorities.
 Results show that delaying scheduling decisions leads to a reduced areas fragmentation (better
 area utilization) only when tasks are not data-dependent upon each other (out of order processing,
 unlike in order processing) and can be executed in any order. In Handa and Vemuri (2004a) frag-
 131
3. Background and Related Work Fragmentation
 mentation is calculated at cell level. In their approach, the total fragmentation contribution of
 each cell in the MER TFCCC is obtained by summing its contribution in the horizontal dimension
 (FCCCx on x axis) and in the vertical dimension (FCCCy on y axis) as shown in equation 3.15.
 The total fragmentation of a MER TF is the average value of TFCCC for all the cells in the
 MER, as given by equation 3.16
 horizontal dimension : FCCCx =
 8
 <
 :
 1 vx
 Lx 1
 if vx  Lx
 0 otherwise
 vertical dimension : FCCCy =
 8
 <
 :
 1 vy
 Ly 1
 if vy  Ly
 0 otherwise
 8cell C , TFCCC = FCCCx + FCCCy (3.15)
 8MERs of k cells , TF =
 1
 k
  
kX
 i=1
 TFCCC(i) (3.16)
 Lx (resp. Ly) represents twice the average width (resp. height) of the tasks currently placed
 on the array, and vx (resp. vy) is the number of contiguous empty cells horizontally (vertically)
 aligned with the involved cell. FCCCx and FCCCy express the fact that the more are empty cells
 surrounding a cell in the MER and the smaller is the average size of the tasks currently placed,
 the lesser is TFCCC.During a placement process, if there are many MERs which can fit the
 task, the MER with the maximum TF is chosen. Hence the MER with the highest fragmentation
 impact on the FPGA is preferably destroyed (by placing a task on it), preserving other MERs with
 lesser contribution to the fragmentation. The idea is to consequently reduce the overall FPGA
 fragmentation. Once the hosting MER is found, the task is placed in one of its four corners with
 the highest TF for a rectangle of the size of the task.
 At all levels, this algorithm is based on a scanning approach as on one hand it deals with MERs
 (which is computationally greedy), and on the other hand all the cells of MERs are scanned and
 a huge amount of time is spendind computing the right corner where to place the task.
 Tabero et al. (2004) use a Vertex List structure to keep the track of the available free
 area. They propose a fragmentation-based heuristics (among other criteria-based heuristics) that
 prevents the proliferation of holes in the FPGA. Hence, before assigning one position among
 possible candidates, the fragmentation produced by each candidate area is first evaluated. Hence,
 the area that minimizes the fragmentation is chosen. This approach is nearly similar to the one
 132
3. Background and Related Work Fragmentation
 used by Walder and Platzner (2002), even if they use different fragmentation metrics. Indeed,
 Tabero et al. (2004) first calculate the fragmentation produced by each MER, which is obtained
 by summing the contribution of each cell of the MER. Obviously, this slows down the placement
 process.
 Koester et al. (2006) adopted a defragmentation-by-modules-relocation approach to deal
 with continuous fragmentation of the reconfigurable array over time. As hardware tasks could be
 placed and removed at runtime
 14
 , an increasing fragmentation of the FPGA prevents next tasks
 from being placed. Their solution is to relocate at runtime the currently placed tasks for being
 able to place the requested task. Their runtime defragmentation algorithm aims at implementing
 such an approach. Prototyped on a dynamically and partially reconfigurable Xilinx Virtex-II
 FPGAs, their algorithm applied to the 1D placement shows some improvement of placement
 quality. For example, the total execution time of tasks set is reduced to 87.1% in the best case,
 compared to another scheduling algorithm without any runtime defragmentation. However, using
 this deframentatio-by-relocation approach is worthy only if tasks reconfiguration time is negligible
 compared to tasks execution time. In addition, as noticed earlier, their approach takes into account
 the heterogeneity of the FPGAs by identifying feasible positions for tasks, according to their types
 of resources (logic cells or embedded memory).
 Cui and Deng (2007) proposed an online task placement algorithm which aimed at minimiz-
 ing fragmentation on partially reconfigurable FPGAs. They introduced a 2D area fragmentation
 metric that takes into account the probability distribution of sizes (width and height) of future
 tasks arrival. Hence, in their assumption, dedicated embedded applications are targetted and
 consequently, task arrival patterns are predictable. Cui and Deng (2007) is also a MER based
 approach which uses a Scan Line Algorithm (denoted as SLA and studied in a previous work in
 Cui and Deng (2007)) to determine the set of MERs and maintain a fragmentation matrix (FM).
 As in Handa and Vemuri (2004a), the contribution to the fragmentation is calculated at cell level
 and then at MER level. However, apart from assuming that the probability distribution of sizes
 of arriving tasks are known, what is also new is that they introduced a Time-Averaged Area Frag-
 mentation (TAAF) metric. TAAF metric is used to evaluate the fragmentation of a MER over
 present and future time. Hence, MERs contribution to fragmentation are not compared only at
 present time, but averaged over a time interval [tpresent; tfuture]. Indeed, in order to compute
 14
 at restricted or feasible positions as presented in Koester et al. (2005)
 133
3. Background and Related Work Conclusion of the Chapter 3
 the TAAF, one has to mimic future events (end of some tasks and beginning of planned task,
 modification on MERs along with cell fragmentations) which is computationally intense. As for
 almost all similar works coping with the fragmentation problem, there is no algorithm running
 time measurement. In Cui and Deng (2007), the running time of algorithms that update the
 list of MERs is measured on a Solaris workstation. Combining a MERs-based area management
 with a cell-based defragmentation strategy and a one-level looking-ahead-based scheduling ap-
 proach leads to a time-consuming placement not suitable for an online context on an embedded
 processor.
 3.10 Conclusion of the Chapter
 This chapter reviewed the background literature relating to the research. The chapter discussed
 the real-time scheduling problem for uniprocessor and multiprocessor systems. From there, the
 chapter drew similarities between this well known scheduling problem and the problem of schedul-
 ing real-time hardware tasks on dynamically and partially reconfigurable hardware devices. The
 main advantage of this approach was to see how the models used in microprocessor scheduling
 could be transposed in reconfigurable hardware device scheduling. Afterwards, the chapter pre-
 sented different works on hardware tasks scheduling through the two mains scheduling strategies
 denoted as looking-ahead and without-looking-ahead respectively. The review also discussed the
 underlying placement problem which is specific to reconfigurable hardware devices scheduling.
 The fragmentation problem was presented separately from the placement problem in order to
 point out its importance in tasks placement. This separation led sometimes to some redundancy
 as the same related work could be presented from a scheduling/placement perspective and from a
 fragmentation perspective.
 A summary and a classification of related work on scheduling and placement for reconfigurable
 hardware devices is presented in table 7.1, page 246, Appendix A.
 The next chapter will present the methodology along with the models and the metrics proposed
 in this thesis in order to cope with the problem of scheduling online real-time hardware tasks on
 dynamically and partially reconfigurable hardware devices.
 In this thesis, most of the scheduling algorithms proposed in Chapter 5 and simulated in Chapter
 6 use placement strategies that rely on the hash matrix (Walder et al., 2003) and the tree
 (Bazargan et al., 2000) discussed earlier in this chapter (respectively on pages 118 and 110). If
 134
3. Background and Related Work Conclusion of the Chapter 3
 not so, it is clearly indicated. This does not affect the comparative study of scheduling algorithms,
 as two algorithms can be compared only if they all rely on the same placement strategy.
 135
Chapter 4
 Proposed Methodology, Models and
 Metrics
 4.1 Introduction
 This thesis relies on a formalism well-known in real-time scheduling modeling. According to this
 formalism, discrete time is measurable, and centric as it characterizes each element in the system.
 In the previous chapter, the formalism was customized whenever possible to take into account the
 specificity of systems that include dynamically reconfigurable parts. Some models and metrics
 for microprocessor scheduling have been already introduced in the previous chapter. However,
 the latter mainly discussed real-time scheduling problems in general, and exhaustively presented
 related work on scheduling and placement targetting reconfigurable hardware devices.
 A part of this chapter deals with models and metrics that are meaningful for hardware tasks
 scheduling on reconfigurable hardware devices. Application and hardware tasks models are pre-
 sented, followed by a resources model, then a scheduler model along with the underlying placer
 model. As stated previously, the thesis mainly considers the problem of scheduling a set of aperi-
 odic real-time tasks on a reconfigurable array. The proposed models and the corresponding metrics
 are derived from multiprocessor platforms models described in the previous chapter.
 Hereinafter is the proposed methodology that results from the understanding of the literature re-
 view done in the previous chapter, the primary objective of this thesis being: ?the online scheduling
 of real-time hardware tasks on partially and dynamically reconfigurable hardware devices?.
 136
4. Methodology, Models & Metrics Methodology
 4.2 Methodology
 4.2.1 Introduction
 The problem of scheduling on-line real-time tasks for multitasking or hardware virtualization on a
 reconfigurable hardware device is coupled with a placement problem. Real-time scheduling aims
 to define how to schedule elementary tasks of an application on a limited computing resource in
 order to complete the application within a given time frame. The placement aims to use area
 management algorithms and heuristics in order to efficiently allocate the reconfigurable array to
 tasks. Therefore, scheduling and placement are quite closely linked.
 The proposed methodology first took into account the targetted architectures which are embed-
 ded reconfigurable systems submitted to many constraints, including online real-time issues. As
 a whole, online scheduling and placement heuristics should be light enough to run on an embed-
 ded processor and to take fast placement decisions as tasks arrive. Consequently, fast schedul-
 ing/placement heuristics may be preferred to those that provide high quality placement at the
 cost of high runtime overhead. Different scheduling and placement approaches were discussed in
 the previous chapter and summarized in tables 7.1 and 7.2, pages 246 and 247.
 The following methodology will aim to provide a quick guidance that leads to acceptable trade-offs
 between runtime overhead and placement quality, and which are likely to enable online scheduling
 of real-time hardware tasks on a runtime and partially reconfigurable device.
 4.2.2 Proposed Methodology
 As scheduling tasks on a reconfigurable platform brings an additional placement problem, the
 methodology consists of assessing the runtime overhead of placement algorithms in order to see
 which are suitable for online real-time scheduling. Indeed in all previously cited work (Chapter
 3) on reconfigurable hardware scheduling and placement, algorithm simulations are performed
 on desktop computers. Desktop computers are usually clocked at more than 1GHz, are power
 hungry, and host sophisticated cache organizations (e.g. 1MB L2 cache) and memory management
 units. Furthermore, unlike embedded processors, they concurrently run numerous services and
 applications, most devoted to ergonomics and human machine interaction. These factors make
 runtime overheads very difficult to measure accurately (e.g. at cycle level). The obtained results
 may not reflect what would have been the real runtime overhead on an embedded processor running
 these scheduling and placement algorithms. In general embedded processors are rarely clocked at
 137
4. Methodology, Models & Metrics Methodology
 frequencies higher than 250 MHz.
 Figure 4.1: A simple architecture of a reconfigurable SoC
 The first step in the methodology was to implement MERs-based placement algorithms on an
 embedded platform in order to see the real runtime overhead of such placement algorithms. This
 step relied on a simple model of an embedded reconfigurable system-on-chip, depicted in figure 4.1.
 This simple model consists of a CPU-based processor and a reconfigurable fabric. The CPU-based
 processor runs scheduling and placement algorithms that manage the reconfigurable part of the
 chip. Therefore, scheduling and placement algorithms were mapped as software programs written
 in C language. As illustrated in figure 4.1, the placer maintains a data structure that reflects
 the state of the reconfigurable matrix. When the placer finds a place to fit a task Ti, a loader
 (also run by the CPU and not explicitly represented here) loads the corresponding bitstream from
 the memory and partially configures the reconfigurable matrix, through a configuration interface.
 However, configuration time overhead is technology dependent as shown later in figure 4.6. Here,
 the focus was on assessing the runtime overheads of scheduling and placement strategies.
 The next step was to perform timing measurements on the embedded CPU-based processor when
 running placement algorithms. The aim of the experiments was to:
 1. see what the timing limitations are for systems that could be online real-time scheduled by
 an embedded processor running a MERs-based placer, and
 2. identify which combinations of scheduling and placement algorithms are likely to be used
 in an online real-time context.
 138
4. Methodology, Models & Metrics Methodology
 The next section presents these experiments, and the obtained results, along with subsequent
 conclusions.
 4.2.3 Running two MERs-based Algorithms on an Embedded Processor
 One important factor is to know the time taken by the scheduler to place or remove a task on the
 FPGA structure. In a MERs-based algorithm, what is time consuming is the process of scanning
 the array in order to find all the MERs that are within it. The update is done at each placement
 or removal.
 1. A basic Scheduler/Placer
 A basic scheduler that manages two lists of tasks has been implemented : a running tasks
 list and a waiting tasks list. The scheduler is invoked by two events : either when a task
 has just arrived (figure 4.2-a) or when a task has just completed (figure 4.3-c).
  In the event of task(s) release ; as shown in figure 4.2-a, when a task arrives, the
 scheduler calls the placer (figure 4.2-b) that checks whether there is an MER to ac-
 commodate the task. If there is one, it is assigned to the task. The placement is
 successful. The placer updates the list of MERs (figure 4.2-b) and the task is added to
 the running tasks list (figure 4.2-a). However, if there is no MER that accommodate
 the task and if the task can still meet its deadline, the task is added to the waiting
 tasks list for further attempts. Otherwise the task is rejected.
  In the event of task(s) termination ; as depicted in figure 4.3-c, when a task ends, the
 scheduler calls the placer (figure 4.3-d) that removes the task from the array and that
 updates the list of MERs. Afterwards, the scheduler (figure 4.3-c) tries to place as
 many tasks as possible from the waiting list, and updates the list of MERs accordingly
 after each successful placement.
 This scheduling algorithm is quite simple but sufficient to assess the efficiency of MERs-
 based free areas management. The scheduler maintains two lists of tasks: the running list
 and the waiting list. The tasks in the waiting list are sorted according to their arrival
 time. If an area is freed, the algorithm attempts to place the tasks in the waiting list,
 beginning from the heading task. Hence, a free area is assigned to a task in the list only if
 the area cannot fit any of its preceding tasks (that normally arrived earlier). The scheduling
 algorithm is detailed in the next chapter where it is denoted as basic scheduling. Following
 139
4. Methodology, Models & Metrics Methodology
 the same principle, other priority policies can be applied (e.g. EDF, LLF, etc.).
 2. Two MERs-based algorithms
 The SLA - Scan Line Algorithm (Cui and Deng, 2007) and Staircase algorithm (Handa
 and Vemuri, 2004c) have been implemented. SLA and Staircase algorithms are improved
 versions of Bazargan et al. (2000)'s MERs-based algorithms. They are slightly similar in
 algorithmic complexity. The process of placing or removing a task on the array induces
 a time consuming operation: the update of the list MERs as highlighted respectively in
 figures 4.2-b and 4.3-d. The average time taken by the two algorithms to perform the
 update operation were measured. Staircase scans an average of 15% of the reconfigurable
 array in order to update the list of MERs on the chip after a task placement or removal.
 Therefore, it runs faster than SLA, as confirmed in the final results shown in figure 4.4.
 3. Design environment and tools
 In order to carry out accurate timing measurements on an embedded platform, the two
 algorithms were implemented in C language, and then cross-compiled to target an embedded
 processor, the Xilinx MicroBlaze soft core. The soft core along with a timer (for timing
 measurement purposes) were instantiated on a Xilinx Spartan 3E FPGA hosted by the
 Spartan 3E Starter board.
 Both hardware and software parts of the design were designed using the Xilinx EDK devel-
 opment design environment. 1M bytes of external memory storing the program, data and
 stack, was attached to the embedded processor. The heap and the stack were sized to 12k
 bytes each. These sizes were chosen according to the size of the reconfigurable array and
 the number of tasks to be placed.
 4. Simulation parameters and results
 Parameters of generated tasks are detailed in table 4.1-(a). These parameters are uniformly
 distributed in the interval [min;max]. Table 4.1-(b) shows the results of the tasks rejection
 ratio and the number of calls to MERs update function. The array is scanned and the list
 of MERs updated every time a task is added to or removed from the array. For measuring
 tasks rejection ratio and the number of calls to function that updates the list MERs, the
 simulations were conducted using a set of 2000 tasks. They were run on a desktop computer
 clocked at 1.8 Ghz. Therefore, these results are platform independent and are easier to
 implement and run on a desktop PC.
 The tasks rejection ratio is almost identical for best fit and first fit fitting strategies. In fact
 140
4. Methodology, Models & Metrics Methodology
 Figure 4.2: Scheduling one task on the reconfigurable array using a MERs-based
 placement algorithm.
 Figure 4.3: Scheduling the end of a task using a MERs-based scheduling algorithm.
 141
4. Methodology, Models & Metrics Methodology
 Tasks parameters min max
 Arrival time 1 2000
 Width 1 19
 Height 1 19
 Execution time 5 35
 Deadline 2 8
 (a) Tasks parameters for simulation
 SLA/Staircase First Fit Best Fit
 Rj(%) 13 12
 NMERs 3470 3502
 Rj : Tasks rejection ratio (%)
 NMERs : Number of calls to MERs
 update function
 FPGA : width = 50 ; height = 40
 (b) Partial simulation results
 Table 4.1: Simulation paremeters for tasks and the reconfigurable array (FPGA)
 in most cases best fit rejects only 1 to 2% fewer tasks than first fit. However, the global
 runtime overhead is greater for the best fit algorithm because the greater the number of
 tasks that are placed then removed, the greater the number of calls to function that updates
 the MERs. This improvement of up to 2% on the tasks rejection ratio has been confirmed
 through identical experiments conducted with different sets of 2000 tasks.
 Conversely, measuring the runtime overhead is more sensitive and platform dependent. The
 embedded FPGA platform presented above were used in order to get accurate results. The
 MicroBlaze soft core embedded processor was clocking at 50 MHz. The application scenario
 consisted of a set of 80 tasks with parameters similar to those in table 4.1-(a). Figure 4.4
 details the obtained results. As MERs-based algorithms use a scan approach, the time
 taken to update the MERs is proportional to the size of the array. The size of the FPGA
 is assumed to be 50 X 40, which is relatively small compared to the ever increasing size of
 current FPGAs. According to the results, Staircase algorithm outperforms SLA in terms of
 runtime overhead as it runs more than twice as fast. It takes an average of 6ms to update
 the list of MERs and can take an average of 15ms in the worst cases. While looking in
 depth at the scheduling process described in figure 4.2, It can be observed that updating
 the MERs (performed by the placer and highlighted in figures 4.2-(b) and 4.3-(d)) is only
 a part of the process, albeit the most time consuming. The time taken to reconfigure the
 reconfigurable array is not shown here and cannot be neglected.
 142
4. Methodology, Models & Metrics Methodology
 Figure 4.4: Time for finding a MER (Maximum Empty Rectangle)
 4.2.4 Lessons Learnt from Preliminary Results and Conclusion
 The above preliminary results clearly show one thing. MERs-based algorithms can not be used
 in real-time systems that have to respond in a timeframe below a given threshold. For example,
 let's assume a hard real-time where the tick of the scheduler is 20ms, it would not be practically
 possible to respect the real-time constraint. As shown on the left of figure 4.4, the average time
 for updating the list of the MERs is 6ms for Staircase algorithm and 15ms for SLA algorithm.
 However if considering the worst case (figure 4.4-right), the update time is 15ms for Staircase and
 37ms for SLA algorithm.
 Figure 4.5 depicts a timing detail of a scheduling that uses Staircase algorithm for area mana-
 gement. It shows that each task placement is preceded by a MER detection that takes at least
 15ms in order to update the list of MERs, in addition to the configuration time that does not
 explicitly appear here. Indeed, if considering the trend of configuration time of the FPGA over
 the last decade (shown in figure 4.6) as well as other scheduler runtime overheads, then the time
 needed to schedule one hardware task could easily reach hundred plus milliseconds. For example,
 Xilinx Virtex-6 FPGA may be fully reconfigurable in about 50ms at the fastest possible con-
 figuration speed. Let's recall that the reconfiguration time of a hardware task on the array is
 proportional to the size of the task.
 However, configuration overhead may rapidly become too long for some real-time systems e.g.
 where scheduling intervals are  100ms.
 Furthermore the sizes of the reconfigurable device (50X40) along with tasks (19X19 maximum)
 used in the simulations and reported in table 4.1 are relatively small. Hence, resulting timing
 143
4. Methodology, Models & Metrics Methodology
 Figure 4.5: Scheduling timing and overheads (staircase)
 Figure 4.6: Evolution (over one decade) of the configuration time of a full FPGA when
 considering the fastest possible configuration speed (Koch and Torresen, 2010).
 measurements of figure 4.4 above are quite optimistic in the sense that the scanning process for
 updating the list of MERs depends on the sizes of the reconfigurable array and tasks.
 Consequently it is better to focus on non-optimal but faster placement algorithms. MERs-based
 algorithms will be essentially used to assess how bad behave non-optimal algorithms when com-
 pared to the optimal solution. Different trade-offs between scheduling/placement algorithms and
 placement quality were made, in order to reach reasonable algorithms complexities and runtime
 144
4. Methodology, Models & Metrics Methodology
 overheads that enable online real-time scheduling. Therefore three main options were identified:
 (i). Non-optimal placement strategies
 As much as possible, it is preferable to use non-optimal placement algorithms which are
 not based on maximum empty rectangles (MERs). Detecting the latter requires a scan of
 the array that may be of complexity O(w  h) in the worst case, where w and h are the
 width and the height of the reconfigurable array. However, the placement quality can also
 be improved by using overlapping rectangles (overlapping rectangles are not necessary the
 MERs), and which are of quadratic complexity with respect to the number of placed tasks.
 (ii). Looking-ahead scheduling for online clairvoyant paradigm
 As much as possible, looking-ahead scheduling algorithms (presented in section 3.7.2, Chap-
 ter 3) must be used in order to derive benefit from the online clairvoyant paradigm. Indeed,
 the online clairvoyant paradigm allows a looking-ahead scheduling algorithm to prospect
 future states of the processing resources, and therefore improves the scheduling by taking
 rapid scheduling decisions. However looking-ahead scheduling algorithms are more compli-
 cated as prospecting future states of the reconfigurable array implies mimicking many tasks
 starting and/or tasks completion on the array. Tasks starting and ending induce placement
 operations that should better not be of high complexity or runtime overhead. Therefore,
 MERs-based areas management is not advised.
 (iii). Multi-shape tasks
 Another way of improving the scheduling/placement quality without increasing the algo-
 rithm complexity and runtime overhead is to generated several variants per task. Such
 approach is also discussed in Danne and Platzner (2006a) for o?ine scheduling. Mahr et al.
 (2011) also considered more than one variants per task while studying online scheduling on a
 reconfigurable array. Their online module selection evaluates various scheduling approaches
 that mainly rely on the size of the modules. However, they do not correlate the modules
 selection with the underlying placement strategy. Conversely, this thesis evaluates the im-
 pact of the number of variants per task along with their aspect ratio, with respect to the
 underlying placement strategy. In this thesis, this approach is denoted as multi-shape-based
 placement. It increases the probability of fitting more tasks on the reconfigurable array,
 which can be very valuable in a real-time online context. Variants of the same task differ
 from each other by their size and shape and the corresponding execution time. Multi-shape-
 based placement therefore requires an extra effort at design time to generate and store many
 145
4. Methodology, Models & Metrics Models
 variants (bitstream) per task, but fortunately eases the runtime placement process.
 For online real-time scheduling, (i) and (ii) above suggest to combine scheduling and placement
 is such a way that both are not of high complexity and runtime overhead. For example, if the
 placer relies on a MERs-based algorithm, then the scheduler must be of low complexity, which is
 without-looking-ahead assignment policy (section 3.7.1, Chapter 3).
 However, when using looking-ahead scheduling, a simple placer is advised (e.g. using nonoverlap-
 ping rectangles managed by a binary or ternary tree). As multi-shape-based placement does not
 really increase placement algorithm complexity, it could be combined with looking-ahead schedul-
 ing to achieve a better scheduling.
 Few priority-driven scheduling algorithms are proposed, where tasks parameters that calculate
 the priority are either geometric (width, height, size, shape ratio) or temporal (deadline, laxity,
 etc.), or a combination of both. Furthermore, priority-driven scheduling strategies are combined
 with looking-ahead and without-looking-ahead scheduling approaches.
 4.3 Models
 An overview of scheduling problems targetting uniprocessor and multiprocessor systems has been
 presented in Chapter 3. The chapter gave some basic processor and tasks models that suit to
 these problems. It then drew some similarities with reconfigurable hardware systems scheduling.
 This section focuses on the problem of scheduling an application on a partially and dynamically
 reconfigurable array. The application consists of a set of independent aperiodic real-time hardware
 tasks that have to meet their deadline. The section presents the models which reflect as much as
 possible an application, the processing resources, the scheduler and the placer.
 4.3.1 Real-Time Tasks and Applications Modeling
 In this thesis, an application is a sequence of tasks. Each task Ti fulfils a specific and identifiable
 function in the application. A task corresponds to a sequence of operations to be run on the
 computing resources. The same task could be run several times. A job J is a running instance
 of a task. Ji;j denotes the jth instance of task Ti. Sometimes, it will be simply referred to as
 task. From a modular perspective, an application is made of one or several modules, and each
 module itself consists of one or several tasks. Hence, a task is the smallest identifiable block of
 an application which fulfils a clear specific function. This task could be a home-made IP or a
 146
4. Methodology, Models & Metrics Models
 manufacturer's IP provided as a pre-synthesized and technologically independent netlist. It may
 also be an HDL (e.g. VHDL or Verilog) description of an electronic functionality.
 Figure 4.7: A hardware task model: 2D view (b) and 3D view (a)
 Basic Hardware Tasks Model
 In this thesis, a hardware task Ti is an electronic functionality to implement in a reconfigurable
 device. Hence, it is a bitstream ready to be downloaded in the reconfigurable hardware device.
 A bitstream is generated after synthesis, placement and routing of a digital circuit. It contains
 information about the position of the circuit (placement) on the chip.
 Hereinafter are some basic assumptions widely accepted in related work, and that make the study
 simpler.
  Structural and temporal characteristics
 A hardware task Ti shares with a software task similar temporal characteristics that are:
 ? the arrival or release time ai,
 ? the computation time ci or execution time ei ,
 ? the finishing or completion time fi,
 ? the absolute (resp. relative) deadline di (resp. Di),
 ? the period (resp. minimum inter-release time) Pi for a periodic task (resp. for a
 sporadic task), etc. In aperiodic tasks model, as each task only arrives once, making
 147
4. Methodology, Models & Metrics Models
 the definition of a period is meaningless. However it is assumed that aperiodic tasks
 share the same period Pi = tD where tD is equal to the biggest absolute deadline in
 the system as expressed in equation 4.4, page 151.
 In addition, a hardware task has functional and structural characteristics. Functional char-
 acteristics reflect the behaviour or the functionality of the task. In the present model,
 functional characteristics of a task are not taken into account for its placement. Struc-
 tural characteristics provide geometric information on the task (e.g. area size, shape, etc.).
 Hence, this information is the most valuable for finding a place that may fit the task while
 timing characteristics provide information for scheduling.
 In general, as shown in figure 4.7-(b), a hardware task Ti is a rectangular module to be
 implemented on the reconfigurable device.
 Hence, the most common geometric characteristics are: the width wi and the height hi.
 A rectangular shape simplifies task placement, but at the cost of intra-task fragmentation
 as shown in figure 3.24, page 126 and discussed in section 3.9.1. Indeed, each task is
 represented by the rectangle that encompasses all the resources actually used by the task on
 the reconfigurable array (figure 3.24). Therefore what is denoted as intra-task fragmentation
 is the resulting lost of resources within the encompassing rectangle.
 By mixing geometric and temporal characteristics, hardware tasks may be seen as 3D cubic
 boxes as mapped in figures 4.7-(a) and 4.9.
  States of a hardware task
 As pictured in figure 4.5 (page 144), a task Ti is active within the interval that spans from
 it release time ai to its finishing time fi. Within this time interval, a task may be ready-
 to-run, running or waiting. Figure 4.8 depicts different states of a task. Once tasks are
 created, they are in the idle state. They are then released depending on the task model
 (periodic, aperiodic, sporadic). Once released a task becomes active and available (ready)
 for scheduling. A task in the ready state is ready to run and just waits for the scheduler
 to give it access to the computing resources. This means either other tasks with higher (or
 equal) priority are running in the reconfigurable array, or there is not enough contiguous
 free area to accommodate the ready task. Tasks in the running states are those that are
 currently running on the reconfigurable array. A running task may move to the waiting
 state if it waits for an event or resources before continuing its execution. Events may be
 temporal (delay) or external (interaction with environment). A task which is in waiting
 148
4. Methodology, Models & Metrics Models
 Figure 4.8: Different states of a hardware task
 state can move to the ready state after the expected event occurs, or can move to the idle
 state if the event is out of time (e.g. if the waiting task cannot still meet its deadline).
 In this thesis, the waiting state along with states transitions in a dotted line are not taken
 into account. Which means that there is no preemption and there is no tasks moving to the
 waiting state.
  Aspect ratio, standing vs laying task
 The aspect ratio of a task Ti is given by equation 4.1 and shown in figure 4.7-(b). Hence
 the task is square-shaped for a ratio equals to 1.
 ari =
 hi
 wi
 (4.1)
 On figure 4.7-(b) is also depicted a standing and a laying task. A task Ti is denoted as
 standing task if ari > 1 (resp. laying task if ari < 1). Synthesis of a hardware task provides
 some flexibility in choosing the desired aspect ratio.
  Multi-shape same size task
 As previously stated in section 4.2.4 above, there may be many variants of the same task.
 On figure 4.7-(b) both standing and laying tasks may be similar with respect to everything
 (functionality and timing) but their shape (width and height). This is more detailed later
 in section 5.5, Chapter 5 while presenting multi-shape-based scheduling algorithms.
 149
4. Methodology, Models & Metrics Models
  Relocatability, rotatability
 A hardware task is a synthesized and pre-routed electronic circuit, which is assumed to
 be relocatable wherever on the chip, as far as there is enough contiguous free resources to
 accommodate it. This full relocatability assumes that the reconfigurable hardware device
 is homogeneous. However if there is some heterogeneity on the device area, the task may
 be relocatable only on limited parts of the chip where the resources are similar to resources
 used by the task. Such a case is depicted later in figure 4.10, page 154.
 However, hardware tasks are not rotatable, meaning that an area accommodates a task only
 if there is no need to rotate the task. For example, albeit the laying task and the standing
 task of figure 4.7-(b) are similar in terms of area size, an area which width and height are
 similar to the width and the height of the laying task cannot fit the standing task.
  Online clairvoyant paradigm, preemption and precedence constraints
 Unlike o?ine models which are deterministic models, this thesis copes with more realistic
 application patterns or scenarios where it is not always possible to know the complete
 application flow beforehand. To reflect this situation, unknown information are expressed
 by randomly generating tasks parameters (arrival time, processing time, width, height, etc.)
 with known probability distribution. Hence, arrival time of each task is arbitrary and is
 unknown beforehand. As long as a task is not released, its parameters are kept unknown
 to the scheduler. This behaviour corresponds to the online clairvoyant paradigm.
  Preemption and precedence constraints
 This thesis considers the case of independent and non preemptable real-time tasks that are
 subjected to deadline constraints. Therefore, execution time and deadline of real-time tasks
 are required.
  Software vs hardware versions of tasks
 It is assumed that for any hardware task there exists a software version. Hence, a task that
 cannot be run in hardware can be run in software with the corresponding execution time. In
 general, hardware tasks run faster, making hardware implementation suitable for computing
 acceleration. Nevertheless, this work does not deal with the scheduling of software versions
 of tasks.
 150
4. Methodology, Models & Metrics Models
 Online Application Model
 In many applications, tasks arrive aperiodically. For example, nowadays, embedded systems are
 becoming more interactive. Tasks may arrive because an event occurred or a sensor reading is
 available. The system may then accurately estimate on the fly the resources and the time required
 to process newly acquired input information. Such an application model where the information
 on tasks are known as they arrive corresponds to the online clairvoyant paradigm. In addition,
 if the tasks in the application are submitted to deadline, then it is denoted as online real-time
 application.
 This thesis mainly consider online real-time applications featuring an online clairvoyant
 paradigm (more detailed in Chapter 3, section 3.4.2).
 There are various kind of online applications, including semi-online ones. Indeed, in most
 cases, there are some partial information available that could help improving the scheduling. For
 example some information may be available on tasks (e.g. their maximum and/or minimum size,
 their maximum and/or minimum execution time, the total number of tasks, etc.) or on the
 application (e.g the total number of resources needed by the application, etc.).
 An application Apk or  k is a set of k tasks  k = [T1, T2,...Tk].  k may be viewed either as
 a random task graph (without precedence constraints) or as a data flow graph with precedence
 constraints as mapped in figure 4.9. It is assumed that the tasks arrive as they are ordered in  k.
 This is expressed in equation 4.2 below where ai is the arrival time of task Ti.
 8Ti , Tj 2  k , i < j ) 0 < ai  aj (4.2)
 Implicitely the release time of the application is denoted as ta and corresponds to the release time
 of the first task or job in the application  k. Let Ti be the first task released in  k,
 ta = ai ) 8Tj 2  k; Tj 6= Ti , ai  aj (4.3)
 Identically, the absolute deadline of the application  k is denoted as td and corresponds to the
 latest absolute deadline of tasks in the application. Let Ti be the latest deadline di of a task in
  k,
 td = di ) 8Tj 2  k; Tj 6= Ti , di  dj (4.4)
 tD = td  ai is the relative deadline of the application. Therefore  k may be expressed either as
  k = [T1, T2,...Tk; td] or as  k = [T1, T2,...Tk; tD].
 151
4. Methodology, Models & Metrics Models
 Figure 4.9: An application as a set of boxes (taskgraph).
 4.3.2 Reconfigurable Devices Area Models
 Area models of reconfigurable hardware devices drastically impact on the complexity of proposed
 tasks placement solutions. Therefore these models cannot be studied separately. Depending on
 how the diversity of reconfigurable resources are taken into account while placing a task, one
 distinguish mainly homogeneous and heterogeneous area models. Furthermore, the hierarchical
 model discussed in Chapter 2 section 2.5.11 is derived from the latter model. However, depending
 on whether the reconfigurable technology enables columnwise partial reconfiguration or random
 partial reconfiguration, one distinguish 1D placement and 2D placement.
 Homogeneous model vs heterogeneous model
 In Chapter 2 section 2.5.2 is given a clear definition of homogeneous and heterogeneous reconfi-
 gurable hardware devices along with some commercial examples. This section will propose a
 simple model of theses architectures. It was previously said that a DPRHW may be viewed as a
 2-Dimensional array of Combinational Logic Blocks (CLBs) surrounded by vertical and horizontal
 programmable routing channels. In the basic model proposed here, the routing channels will not
 be explicitly mentioned , as they do not have any influence on the proposed placement model.
 Indeed, the latter model assumes that a task can fit in a rectangular area on the array as far as
 there is enough contiguous free space to include the task.
 A simplified model of an homogeneous reconfigurable hardware device is mapped in figure
 152
4. Methodology, Models & Metrics Models
 4.10-(a). Even if DPRHW like FPGAs are becoming more heterogeneous today, this simple model
 is still widely used to outline scheduling and placement problems. In general, Programmable
 Logic Blocks are organized in W columns and H rows where W and H are respectively the width
 and the height of the reconfigurable device. As it is assumed that the reconfigurable hardware
 device enables partial reconfiguration, Afpga = W  H expresses the number of independently
 reconfigurable units (CLBs) on the reconfigurable array. Afpga also reflects the size or the area of
 the array.
 As shown in figure 4.10-(a) and (b), an X  Y coordinate axis is used to physically localize each
 CLB or any other resource on the reconfigurable array.
 A simplified model of an heterogeneous reconfigurable hardware device is mapped in figure
 4.10-(b). Unlike figure 4.10-(a) there are some pre-built or pre-instantiated dedicated blocks
 that are already at some fixed positions on the chip. As discussed in Chapter 2, heterogeneous
 reconfigurable array is the current dominating trend in FPGA architecture. However, the above
 presented homogeneous model could be easily extended to represent an heterogeneous array. In-
 deed, as it is always assumed that a hardware task to be implemented is relocatable wherever on
 the homogeneous reconfigurable array as far as there is a rectangular portion that accommodate
 it, the following two assumptions can be made:
 (i) an heterogeneous array is an homogeneous array that has already accommodated some static
 and position-constrained tasks. For example, the heterogeneous array on 4.10-(b) embeds
 some hardwired blocks (softcore/hardcore processor, BRAM memory, DSP blocks) that are
 considered as permanent and position-constrained tasks (T1, T2 and T3, top left of the array).
 Only the remaining areas are still available for placing other tasks.
 (ii) an heterogeneous array is an array where each task has one or few possible placement posi-
 tions on the array depending on the matching between the kind of resources needed by the
 task and the position of such resources on the array. The case is shown in figure 4.10-(b)
 where dedicated resources are in the bottom right of the array. The BRAM and the DSP
 blocks may be assigned to specific tasks that mainly need these resources to be efficiently
 implemented.
 Chapter 3 section 3.8.5 presented related work (e.g. Koester et al., 2005, 2006) that improve
 the placement quality by taking into account the hardware heterogeneity in order to optimize the
 resources utilization. In their model, each task has a few possible placement positions on the array,
 and the heterogeneity of the array consists of two kind of resources: configurable logic blocks and
 153
4. Methodology, Models & Metrics Models
 Figure 4.10: Simple models of homogeneous and heterogeneous reconfigurable array
 memory blocks.
 In this thesis, time overhead due to configuration of a hardware task on the reconfigurable
 hardware device is assumed to be negligible. Example of full reconfiguration of FPGAs are
 pictured in figure 4.6.
 154
4. Methodology, Models & Metrics Models
 4.3.3 Scheduler Model
 As this thesis considers the online clairvoyant paradigm, the scheduling algorithm is not allowed
 to use information about the future. Consequently, at time t, the scheduler is not aware of any
 information (ai; ei; di; wi; hi, etc.) on any task arriving at time ai  t.
 As depicted in figure 4.11, once a task Ti = (ai; ei; di; wi; hi) of the application  k is released
 in the system, the scheduler takes it into account and schedules it in conjunction with the placer.
 Therefore, using a given assignment policy, the scheduler decides which tasks have to be run on
 the reconfigurable array. The scheduler acts as a tasks manager and might manage certain lists
 of tasks. It requests the placer to find a position (xi; yi) free at current time or in the future
 depending on whether the scheduling policy is looking-ahead-based or not. Tasks that cannot
 meet their deadline are rejected. A position (xi; yi) and a starting time si is then assigned to each
 arriving task Ti 2  k if possible, in such a way that task Ti does not overlap concurrently spatially
 and temporally with any other task in the system. The placer acts as a resources manager by
 keeping the state of the reconfigurable area, by optimizing its use thanks to management heuristics,
 and by providing the scheduler with best available areas.
 Equations 4.5 and 4.6 express the conditions for scheduling tasks on the reconfigurable array.
 These conditions allows many tasks to run either on a space-sharing basis or on a time-sharing
 basis. The placer is responsible for verifying the space-sharing conditions, which is (xi+wi)  (xj)
 for the 1D placement. The first equation 4.5 is for the 1D placement while the second 4.6 is for
 the 2D placement.
 With 1D placement: 8Tj 2  k; Tj 6= Ti ,
 8
 <
 :
 [(xi + wi)  (xj)] _ [xi  (xj + wj)]_
 [(si + ei)  (sj)] _ [si  (sj + ej)]
 (4.5)
 With 2D placement: 8Tj 2  k; Tj 6= Ti ,
 8
 >>><
 >>>:
 [(xi + wi)  (xj)] _ [xi  (xj + wj)]_
 [(yi + hi)  (yj)] _ [yi  (yj + hj)]_
 [(si + ei)  (sj)] _ [si  (sj + ej)]
 (4.6)
 where ai; ei; di; wi; hi are respectively the arrival time, execution time, deadline, width and height
 of task Ti.
 As discussed earlier in Chapter 2, section 2.7.3 and illustrated in figure 2.18 and figure 2.19,
 this thesis considers a scalable and distributed multi-RTOS architecture with respect to application
 requirements and system constraints. Hence, as it focuses on the management of the reconfigurable
 part of the system, the scheduler is devoted to that part, beside other schedulers managing other
 155
4. Methodology, Models & Metrics Models
 Figure 4.11: The global simulation model
 PEs, under the supervision of a global OS. Hence, the scheduler responds (successfully or not) to
 the requests of the global scheduler. This assumes that tasks in the system may have numerous
 alternative implementations (software, hardware, etc.) with various costs, performances and QoS.
 In this thesis, the scheduler implements a non pre-emptive scheduling policy that does not enable
 tasks migration.
 4.3.4 Placer Model
 As stated above, the placer responds to placement requests sent by the scheduler. As depicted
 in figure 4.11 the placer and the scheduler interact constantly. The former acts as the resources
 manager for the latter.
 The placer model is more detailed in figure 4.12. On one hand as shown in the left branch of
 the figure, the placer partitions and manages the reconfigurable array through a data structure
 which keeps the state of the array. For this, it uses various splitting, merging and defragmentation
 156
4. Methodology, Models & Metrics Models
 Figure 4.12: The placer model and its different functional parts.
 strategies. On the other hand as depicted by the right branch of figure 4.12, the placer allocates
 and deallocates areas to tasks and updates the data structure accordingly. To do this, the placer
 searches for available areas. It then uses different fitting strategies (1D, 2D, BF, FF, etc.) that
 have been presented in Chapter 3 section 3.8.4. Figures 3.17 page 114 and 3.18 page 115 depict
 examples of fitting strategies.
 The data structure may be a simple list of areas (nonoverlapping or overlapping), a binary tree
 (e.g. Bazargan et al., 2000) or a hash matrix (e.g. Walder et al., 2003).
 The quest for a quality placement especially in an online real-time context may be to:
 (i). find the best data structure that stores information of free spaces available on the reconfi-
 gurable array and that:
  eases the search for places to fit new tasks.
  eases the update of the structure after adding or deleting a task.
 (ii). find meaningful metrics that assess the placement quality and the fragmentation of the
 157
4. Methodology, Models & Metrics Metrics
 reconfigurable array and that are easy to calculate. In addition, if possible, the placer may
 take into account future tasks parameters distribution (if known, e.g. width, height, etc.)
 in order to improve tasks placement.
 Achieving such a quest is rarely possible and requires a trade-off between the two points mentioned
 above. For example, the hash matrix proposed by Walder et al. (2003) stores free areas and allows
 the placer to quickly index them with a constant time complexity. But such an easy search comes
 at the cost of heavy matrix update process at each task placement and removal.
 4.4 Metrics
 The metrics rate the performance of scheduling and placement of an application which consists of
 k tasks. They are almost similar to scheduling metrics in a microprocessor, with some specificity
 due to the underlaying placement.
 4.4.1 Reconfigurable Hardware Resources Metrics
 The total amount of resources available on a reconfigurable array corresponds to its area
 Afpga = W  H
 where W and H are respectively the width and the height of the reconfigurable array.
 The total amount of resources provided by the device within a given time interval that spans from
 time t1 to time t2 is given by equation 4.7.
 Afpga( t) = W  H   t (4.7)
 where  t = t2  t1
 4.4.2 Tasks Metrics
 This thesis mainly consider the problem of scheduling a set of aperiodic real-time tasks on the
 reconfigurable array. Hardware tasks metrics are derived from periodic tasks metrics. They are
 defined as follows:
 i). Time utilization factor (U (t)Ti)
 The time utilization factor of a periodic task Ti is the ratio between the task execution time
 158
4. Methodology, Models & Metrics Metrics
 and its period as show in equation 4.8. It reflects the fraction of time spent by the task on
 the reconfigurable array within the period of the task.
 U (t)Ti =
 ei
 Pi
 (4.8)
 where ei and Pi are respectively the computation time and the period of Ti.
 In the case of an aperiodic task, the time utilization factor of Ti is defined as
 U (t)Ti =
 ei
 Di
 (4.9)
 where Di is the relative deadline of Ti. Therefore, resi = Ai  U (t)Ti. However the time
 utilization factor metrics does not inform on the amount of resources really needed by the
 task.
 ii). Absolute computational load (ResTi)
 The absolute computational load of a hardware task Ti as the total amount of resources
 required to complete an instance of the task. It is given by equation 4.10 where wi , hi, ei
 and Ai denote respectively the width, the height, the execution time and the area requirement
 of Ti.
 ResTi = wi  hi  ei = Ai  ei (4.10)
 iii). Relative computational load (resTi)
 The relative computational load of a task Ti is its computational load relative to its relative
 deadline Di for an aperiodic task (period or minimal inter-release time Pi for a periodic task
 and a sporadic task respectively). It reflects the fraction of time within the active state of
 the task that has been really spent by the task on the reconfigurable array, and the area
 requirement of the task.
 The relative computational load resi corresponds to the system utilization (U (s)Ti) of a
 task Ti in monoprocessor scheduling that is given by:
 U (s)Ti = resTi = U
 (t)
 Ti  Ai =
 8
 <
 :
 ResTi
 Di
 = eiDi  wi  hi for aperiodic tasks
 ResTi
 Pi
 = eiPi  wi  hi for periodic and sporadic tasks
 4.4.3 Application Metrics
 Let Apk or  k = [T1; T2; :::Tk; td] be an application or a set of k tasks that must be run to
 completion before its absolute deadline td. Let's define:
 159
4. Methodology, Models & Metrics Metrics
 i). Time utilization factor U (t) 
The Time utilization factor of the complete set of aperiodic tasks  k is expressed as follows
 :
 U (t) =
 kX
 i=1
 ei
 Di
 (4.11)
 Obviously, the equation becomes U (t) =
 Pk
 i=1
 ei
 Pi
 if tasks are periodic.
 ii). System utilization U (t) 
The system utilization of the complete set of k aperiodic tasks  k takes into account the
 amount of resources used by the set as expressed in equation 4.12.
 U (s) =
 kX
 i=1
 1
 Di
  wi  hi  ei =
 kX
 i=1
 U (t)Ti  Ai (4.12)
 iii). Current time utilization factor
 Let  (t) be a partial set of  k that have already been released at current time t and that
 respects the following condition.
  (t) = fTi 2  k : ai  t < dig where di = ai +Di (4.13)
 The current time utilization factor of the set at current time t is then
 U (t) (t) =
 tX
 i=1
 U (t)Ti =
 tX
 i=1
 ei
 Di
 (4.14)
 The current time utilization factor only characterizes the tasks set but does not give any
 indication on the load of the system and the number of processing resources. Therefore,
 a multiprocessor system defines a system utilization factor U (s) (t) at a given time t that
 respects condition 4.13 as follows :
 U (s) (t) =
 U (t) (t)
 m
 (4.15)
 where m is the number of microprocessors. This definition may be transposed in the case of
 an m slots partitioned reconfigurable array where any slot may fit any task as discussed in
 3.6.4 and depicted in figure 3.8, page 92.
 iv). Absolute application (computational) load Res 
Res expresses the total amount of resources required to complete  k and is given by the
 following equation 4.16
 Res =
 kX
 i=1
 wi  hi  ei =
 kX
 i=1
 Ai  ei =
 kX
 i=1
 Resi (4.16)
 160
4. Methodology, Models & Metrics Metrics
 where wi , hi, ei, Ai and Resi denote respectively the width, the height, the execution time,
 the area requirement and the total amount of resources of task Ti.
 v). Relative application (computational) load (res )
 Also referred to as relative amount of resources required to complete an  k on a given
 reconfigurable array of size Afpga = W  H, is given by the following equation 4.17
 res =
 1
 W  H  tD
  
kX
 i=1
 wi  hi  ei =
 1
 Afpga  tD
  Res (4.17)
 where Res =
 Pk
 i=1ResTi is the above-mentioned absolute application load of application
  k, and tD the relative deadline of application  k.
 4.4.4 Scheduling Metrics
 These metrics are related to the application and the processing resources. These metrics are
 available after a scheduling and reflect its quality. Some of them have been described on section
 3.3.4 as example of objective functions that are very common in scheduling problems.
 i). Utilization ratio of the reconfigurable array (Ufpga(%))
 The average utilization ratio of a reconfigurable array (FPGA) of size Afpga = W  H on
 which an application  k = [T1; T2; :::Tk; tD] has been scheduled is given by the following
 equation 4.18:
 Ufpga(%) =
 Pk
 i=1 wi  hi  ei
 W  H  mak
 =
 Pk
 i=1ResTi
 W  H  mak
 (4.18)
 where mk is the makespan.
 ii). Tasks rejection ratio Rj k(%)
 Rj k(%) is the ratio of instances of tasks Ti 2  k that have failed to be placed on the
 reconfigurable array.
 Rj k(%) =
 nj
 k
  100% (4.19)
 where nj  k is the number of jobs rejected among the k jobs in the application  k.
 iii). Makespan mk
 The makespan spans from the release time of the first task (of the application) to the time
 the last task ends. It is also denoted as the length of the scheduling and corresponds to
 the real duration of the application. The makespan is one of the most commun objective
 function. The smaller the makespan, the better the scheduling.
 161
4. Methodology, Models & Metrics Metrics
 iv). Flow time ft , average flow time ftav and total flow time fttot
 Also known as the response time and expressed in equation 4.20, the flow time of a job Ji
 is defined as the difference between its completion time fi and its arrival time ai.
 fti = fi  ai (4.20)
 The total flow time fttot (resp. the average flow time ftav) is given by equation 4.21
 fttot =
 kX
 i=1
 fti and ftav =
 fttot
 k
 (4.21)
 where fttot is the sum of the flow times of the k jobs in job sequence  k, and ftav the latter
 sum averaged by k.
 v). Waiting time wti
 It spans from the task Ti release time ai to its starting time si assigned by the scheduling
 algorithm.
 wti = si  ai (4.22)
 The average waiting time of a sequence of na jobs, where na is the number of jobs accepted
 among the k jobs in  k, is given by the total waiting time averaged over na as follows :
 wtav = 1na  
Pna
 i=1 wti
 vi). Rejection delay Rdi
 This metrics represents the time that flows from the release of a task Ti to its rejection.
 This metrics is meaningful in online real-time scheduling where one may consider more than
 one implementation alternatives. Hence, rejecting a reconfigurable hardware task too late
 may prevent the system from running it on a different computing resource (e.g. sequential
 processor). The latter situation may be prejudicial. This metric is proposed as a quality
 metric that emphasizes scheduling strategies that minimize the rejection delay. The rejection
 delay is expressed by as follows for a rejected task Ti :
 Rdi = trej  ai (4.23)
 where trej is the rejection time of task Ti as given in equation 3.11, page 94, and ai its release
 time.
 vii). Differential quality metric URqm
 this metric is proposed as a new quality metrics that emphasizes a good behaviour of the
 162
4. Methodology, Models & Metrics Metrics
 scheduling/placement algorithm both in terms of chip utilization ratio and tasks rejection
 ratio. The metrics is especially meaningful in reconfigurable hardware scheduling. It is
 meant to ease the comparison between two algorithms that do not significantly differ either
 in chip utilization ratio or in tasks rejection ratio. The reason is that no matter which
 area management strategy is used, the reconfigurable array is submitted to a fragmentation
 problem that bounds its average utilization ratio.
 URqm is expressed in the following equation :
 URqm = 2  [Ufpga  (
 1
  
 1)  Rj k ] ,  2]0; 1] (4.24)
 where Ufpga is the utilization ratio, Rj k the tasks rejection ratio and  a weighting coefficient
 that reflects which of the two metrics is predominant. As the chip utilization ratio is to
 maximize and the tasks rejection ratio to minimize, the higher the difference between these
 two metrics the better the scheduling.
 For the sake of simplicity,  = 0:5 in this thesis, meaning that the utilization ratio and the
 tasks rejection ratio are given the same important. The metric URqm is meant to be positive
 otherwise the scheduler behaves very poorly.
 For example, for  = 0:5 the equation above becomes :
 URqm = Ufpga  Rj k
 However even if Ufpga and Rj k highly depends on the computational load of the application
 or set of tasks to schedule,  = 0:5 is not really a fair value because Rj k  0% is more
 likely to be achieved than Ufpga  100%, because of the reconfigurable array fragmentation.
  2] 12 ; 1] makes the utilization ratio predominant in the URqm. A negative value of URqm
 indicates that the scheduler behave poorly, as it achieves a high tasks rejection ratio combined
 with a low chip utilization ratio.
 viii). Scheduling runtime overhead
 The Cumulative algorithm execution time or cumulative scheduling algorithm runtime
 overhead represents the total amount of time spent by the scheduling algorithm to schedule
 and place all the tasks in an application or a tasks set. It is given by equation 4.25 below
 Cumuloverhead =
 1
 m
  
mX
 i=1
 nX
 j=1
 RuntimeOverhead(i; j) (4.25)
 where m is the number of tasks sets, n the number of invocation of the scheduler while
 scheduling each tasks set, and RuntimeOverhead(i; j) the scheduling algorithm runtime
 163
4. Methodology, Models & Metrics Simulation model
 overhead on the ith tasks set during the jth call of the scheduler. As runtime overhead during
 a single call of the scheduler is quite small, the cumulative value is more meaningful. The
 value is averaged on the number of tasks sets in order to reflects the portion of time devoted
 to the scheduling algorithm along a task set. Let's recall that the placement algorithm may
 be called one or many times at each scheduler invocation.
 4.4.5 Feasible Schedule
 A task sets  k is feasibly scheduled on a reconfigurable array if the following condition is met
 8Ti 2  k , fi  ai +Di
 An application cannot be successfully scheduled on an FPGA if its relative computation load
 res k exceeds 1. Hence, a necessary but not sufficient condition for scheduling an application
  k = [T1; T2; :::Tk; tD] on the reconfigurable device Afpga is that the total amount of resources
 needed by  k (which is
 Pk
 i=1 wi  hi  ei) must not exceed the total amount of resources available
 on Afpga during the relative deadline tD of  k (which is Afpga(tD) = Afpga  tD). This is clearly
 expressed by the following condition
 res k  1 )
 Pk
 i=1 wi  hi  ei  W  H  tD
 deduced from equation 4.17 above. Because of fragmentation, the relative application load of a
 schedulable application is in general much more closer to 0:5 than 1.
 Let  k be a set of k aperiodically-arriving real-time tasks; tasks in  k could be assumed
 tD periodic, tD being the absolute deadline of  k. This assumption relies on the fact that within
 the time interval that spans from 0 to tD, each task is supposedly released once.
 4.5 Global Simulation Model and Compatibility with the
 OVeRSoC Design Methodology
 A functional representation of scheduling and placement simulation model is presented in figure
 4.11. This simulation model encompasses basic components of a basic reconfigurable system. The
 simulation model may be used at two stages :
  at runtime, which means that the simulation model may be used to assess scheduling and
 placement algorithms through the aforementioned metrics.
 164
4. Methodology, Models & Metrics C++/SystemC based model
  at design time, the model may be introduced in a design methodology in order to refine
 hardware/software partitioning. To achieve such a purpose, the global simulation model
 is C++ based to insure full compatibility both with the OVeRSoC design methodology
 discussed in Miramond et al. (2009a) and its enabling environment presented in Miramond
 et al. (2009b).
 4.5.1 An UML Overview of the Global Simulation Model
 The global simulation model of hardware tasks scheduling on the reconfigurable part of an RSoC
 has been depicted earlier in figure 4.11, page 156. An UML representation of this simulation model
 is proposed in figure 4.13.
 The basic system consists of the scheduler, the placer and the set of tasks. Hence, the basic system
 instantiates each of these elements. The basic system first creates the application as a list of n
 tasks. It then creates the scheduler and the placer. It acts as the simulation engine. It generates
 the time basis for the whole system. Tasks are then released in the system through a FIFO list
 denoted as arriving_tasks_fifo and according to their release time.
 Tasks are hosted in the list at their release time ai and poped out before time ai + 1. As other
 elements in the system learn piece by piece about tasks, this corresponds to the online clairvoyant
 paradigm. The finishing_tasks_fifo achieves the same functionality by hosting finishing tasks
 instead at their completion time fi. The tasks in the finishing_tasks_fifo are poped out before
 time fi + 1
 The basic system invokes the scheduler everytime at least one task pops in the arriving_tasks_fifo
 or the finishing_tasks_fifo. As this thesis does not consider tasks preemption, these two events
 are the only that trigger off a rescheduling. Once the 3 main entities are built, the simulation is
 run to completion. The basic system assesses various scheduling metrics either directly or through
 the scheduler and the placer. The scheduler manages the tasks through as many lists of tasks
 (tasks_queue) as necessary, depending on its policy. A hardware task (task_mod) may feature
 many variants of different size and the corresponding computation time.
 The placer generates different data structures needed to reflect the state of the reconfigurable array.
 Hence, it may instantiate an FPGA_matrix (as an array of cells), and a list of free rectangles.
 Indeed, the scheduler measures the metrics that are related to tasks (e.g. tasks rejection ratio)
 while the placer evaluates those related to the reconfigurable array (e.g. chip utilization ratio,
 fragmentation, etc.).
 165
4. Methodology, Models & Metrics C++/SystemC based model
 Figure 4.13: An UML overview of the global simulation model of the DPRHW-OS for a
 reconfigurable platform.
 4.5.2 The Importance of Using a C++ Based Simulation Model
 As stated in the two first chapters of the thesis, the primary objective of this work was to propose
 scheduling and placement strategies that suited to online real-time scheduling of hardware tasks
 on dynamically reconfigurable hardware devices. However this work is likely to feed any Reconfi-
 gurable SoC design methodology with information, models and metrics that were learned from
 the primary objective. This second aspect of this work comes within the scope of the classical
 design philosophy that aims to overcome as much design challenges as possible at compilation
 time instead of running time.
 An RSoC design methodology denoted as OveRSoC methodology has been briefly introduced
 166
4. Methodology, Models & Metrics Conclusion of the Chapter 4
 in Chapter 2, section 2.7.3. The methodology was described as OS-centric or OS-based. This
 methodology is depicted in figure 2.19 page 60. The design flow is described in detail in Mira-
 mond et al. (2009a) and uses the DOGME tool (Miramond et al., 2009b), its dedicated front end
 for designers. The methodology relies on a system level simulation that is performed prior to and
 alongside the design steps and that is fed by the application and the system constraints.
 The second aspect of this thesis was to insure a full compatibility with the OVeRSoC design
 methodology along with tools. Thus, an object oriented approach that relied on the C++ pro-
 gramming language was adopted in this thesis. Therefore, all the components of the UML dia-
 gram in figure 4.13 are C++ objects, fully compatible with SystemC language, OveRSoC being a
 SystemC-based methodology.
 The main advantage of using a SystemC-based approach is its ability to adapt to almost all ab-
 straction levels (from system level to implementation level) while providing hardware/software
 co-simulation and verification. This allows the designer to rely on the same models to validate
 different parts of his system throughout the design process. In order to deal with scheduling and
 placement issues, this work relied on the model pictured in figure 2.18, page 59. In the model,
 an application is to be mapped on a multiprocessor platform. The system consists of resources
 consumers (tasks or applications) and processing elements ( processors, memories, etc..), both
 separated by an intermediate OS layer that acts as a resources manager. The OS layer hides the
 details of the platforms to the application layer and assigns different resources to different tasks
 of the application in such a way that the application runs to completion and achieves its goals.
 Compatibility between any design methodology and C++ based UML model proposed in this
 thesis may allow any designer to tune, through simulations, both its architecture and the RTOS
 that suits to its management.
 As illustrated by the UML diagram in figure 4.13, various models of the reconfigurable array that
 may be accurate at cell level were provided. These models do not express any communication, but
 may easily express neighbourhood between cells and therefore rectangular-shaped modules. The
 refined model on the array depends on the placement strategy used. The model may dynamically
 map the DRA
 1
 of figure 2.19. The same goes for the basic system in figure 4.13 that features the
 scheduler, the placer and hardware tasks, and that may map the DPRHW-OS
 2
 devoted to the
 DRA part.
 1
 Dynamically Reconfigurable Architecture, also denoted as DPRHW in this thesis.
 2
 Dynamically and Partially Reconfigurable Hardware Device - Operating System
 167
4. Methodology, Models & Metrics Conclusion of the Chapter 4
 4.6 Conclusion of the Chapter
 This chapter has started with the proposed methodology which results from the literature review
 presented in Chapter 3. The methodology first consisted of doing some accurate measurements
 on some existing scheduling and placement algorithms. In doing so, a framework for finding the
 algorithms suitable for scheduling online real-time hardware tasks on DPRHWs has been estab-
 lished. The chapter also defined models of real-time applications that consist, in this thesis, of
 a set of aperiodic real-time tasks. The Scheduler and the Placer models were also presented.
 Afterwards, different scheduling metrics have been defined, and some of them customized to make
 them meaningful for DPRHW scheduling. The end of the chapter presented the global simulation
 model and its UML representation. The compatibility of this global simulation model with C++
 and SystemC based methodology for RSoC design was emphasized, the OveRSoC methodology
 being taken as an example.
 The next chapter will propose scheduling and placement algorithms that results from the above
 proposed methodology, and that suit to online real-time scheduling of hardware tasks on dynam-
 ically and partially reconfigurable hardware devices (DPRHWs).
 168
Chapter 5
 Proposed Algorithms for Online
 Real-Time Scheduling & Placement
 5.1 Introduction
 This chapter presents and discusses different scheduling and placement algorithms that rely on
 related work discussed in Chapter 3 and on the models detailed in the previous chapter. Regarding
 the scheduling algorithms, they are presented according to their two main families: the looking-
 ahead scheduling and the without-looking-ahead scheduling.
 In without-looking-ahead scheduling approach, the algorithms are essentially priority-driven,
 where the priority of each task is based on its geometric and temporal parameters. Hence, in
 addition to various temporal parameters based algorithms such as EDF, LLF etc., other priority-
 driven algorithms that are either based only on geometric parameters of hardware tasks, or based
 on their geometric and temporal parameters are proposed.
 As online real-time applications are targeted, scheduling algorithms are combined with appro-
 priate placement strategies in order to provide low runtime overheads. Hence, multi-shape schedul-
 ing algorithms that improve tasks placement opportunities without significantly influencing the
 runtime overheads of the algorithms are also proposed. The idea behind the multi-shape approach
 is to provide, at design time, more than one task parameter combination for each task in the
 system. The multi-shape scheduling algorithms are also priority-driven, the main policy being to
 give the highest priority to the normal version of the task.
 In looking-ahead scheduling approach, as runtime overheads tend to be higher, the latter
 169
5. Scheduling and Placement Algorithms Parameters-Based Scheduling
 approach is combined with low complexity placement strategies as justified in the previous chapter.
 Examples of such placement strategies are : 1D placement, partitioned placement and multi-shape
 tasks placement.
 5.2 Tasks Parameters Based Global Scheduling
 The tasks parameters-based scheduling algorithms are priority-driven scheduling algorithms. A
 priority-driven scheduling assigns a priority to each job (or task) in the application. The priority
 is based on the parameters of tasks. The scheduler keeps a list of ready tasks sorted by their
 priorities. Therefore, the available processors resource is allocated to the highest priority tasks. In
 priority-driven real-time scheduling for microprocessors, a task priority is usually based on timing
 constraints. Hence, it is calculated using temporal parameters of the tasks (e.g. deadline for EDF,
 laxity for LLF, etc.).
 However, in the case of a hardware task that features both temporal and geometric parameters,
 there are more opportunities for combining these parameters in order to derive more priority-driven
 algorithms. Hence this section presents other priority-driven algorithms that are based either on
 geometric parameters of tasks, or on a combination of geometric and temporal parameters. In
 this thesis, these scheduling scheme are denoted as parameters-based scheduling.
 The priority }i of each task Ti is calculated using its parameters. Different tasks parameters-based
 scheduling algorithms differ from one another in the way of calculating }i. At each scheduling
 time tsch, the algorithm requests the placer to place the task with the highest priority, the task
 that heads the waiting (ready) queue. If the placer fails to place, it then tries to place the next
 task in the list and so on, as far as the reconfigurable array is not full. Obviously, this may lead
 to a situation where a task with a lower priority may be running on the reconfigurable array while
 a higher priority task is waiting. By the way, the tasks that can no longer respect their deadline
 constraints are removed from the waiting list. The scheduling algorithms is said work-conserving
 as any area in the array may be kept idle if it could fit a ready task, no matter what its priority
 is.
 A generic pseudo code of parameters-based scheduling algorithm is shown in table 5.1 page 171.
 There is any details of the placement strategies used. The scheduling algorithms may be combined
 with placement strategies of various complexity in order to achieve a given overall complexity.
 As tasks are released, they are inserted in a waiting or ready list (W ) and sorted according to
 170
5. Scheduling and Placement Algorithms Parameters-Based Scheduling
 A pseudo code of the parameters-based algorithm
 Initialization: tsch  CurrentTime; F  FinishingTasks(R; tsch); F  R;
 A ArrivingTasks(tsch); Afree  free areas; Area P  0; flag  0;
 W  WaitingTasks(tsch), the list is sorted according to the considered parameter };
 R RunningTasks(tsch), the list is sorted in increasing finishing time.
 Schedule_Param_Based (W;R;Afree);
 1. 8Ti 2 F , do % dealing with finishing tasks
 2. F  F  Ti % if there is any at time tsch
 3. update (R;Afree)
 4. flag  1
 5. end 8Ti 2 F
 6. if (flag) % if any task has ended at time tsch
 7. 8Ti 2 A % updating the waiting list W
 8. W  W [ Ti % W is kept sorted } wise
 9. end 8Ti 2 A
 10. 8Ti 2W , Afree 6= 0 % attempt to place waiting tasks
 11. if (P = place (Ti, Afree)) % if placement successful
 12. R R [ Ti % updating the running list R
 13. W  W  Ti % updating the waiting list W
 14. end if
 15. end 8Ti 2W
 16. else % if placement fails
 17. A sort(A;})
 18. 8Ti 2 A
 19. if (P = place (Ti, Afree)) % attempt to place just arrived tasks
 20. R R [ Ti % if successful, updating R
 21. else
 22. W  W [ Ti % otherwise, add in W if worthy
 23. end 8Ti 2 A
 24. end if (flag)
 Table 5.1: A pseudo code of the tasks parameters-based scheduling algorithm
 171
5. Scheduling and Placement Algorithms Parameters-Based Scheduling
 their priority }i in such a way that :
 W = [T1; T2; T3; :::; Tn 1; Tn]) }1 > }2 > }3 > ::: > }n 1 > }n (5.1)
 for the n tasks currently in the list. As this is a without-looking-ahead scheduling approach, a
 task that fails to be placed is kept in the waiting queue W for further attempts, as long as it can
 still meet its deadline. Hence, any rejection occurs at time trej as expressed earlier in equation
 3.11 page 94. The rejected task is removed from the waiting list during its update (table 5.1, line
 7). The tasks in the running list (R) are sorted according to their increasing finishing time.
 In a uniprocessor system, the microprocessor is assigned to the task with the highest priority
 which is T1 in the example above. However, as scheduling hardware tasks on a reconfigurable
 hardware devices is much more similar to multiprocessor scheduling, several tasks may concur-
 rently run on the device. Hence, the reconfigurable array is allocated to the m tasks that it may
 accommodate concurrently. A task of a lower priority may be running while another task of higher
 priority is waiting because there is not enough contiguous free space to place it.
 The scheduler is invoked each time tsch a new task is released or a running task completes. In
 the latter case, the algorithm takes into account the area(s) just freed by the terminated task(s).
 Consequently, it updates the running tasks list R and the list of free areas Afree accordingly (line
 1 to 5).
 The completion of one or several tasks (marked by a tasks completion flag set to 1) frees new
 areas on the array. Hence, if there are newly free areas, the newly released tasks are first inserted
 in the waiting list (line 6 to 9) with respect to their priority. The scheduler then tries to place as
 many tasks as possible from the waiting list (line 10 to 15).
 However, if any task completion has occurred at scheduling time tsch (marked by a tasks completion
 flag that remains to 0), there is no newly areas freed on the array. Therefore, as the array hasn't
 changed and cannot fit any task from the waiting list, the scheduler attempts to place only the
 newly arrived tasks (line 16 to 23). In case of failure, the tasks are inserted in the waiting list for
 a further attempt (line 22).
 As the only difference among priority-driven scheduling algorithms is the method for cal-
 culating the tasks priorities, the following sections present some of these algorithms with the
 corresponding formulas for the priority.
 172
5. Scheduling and Placement Algorithms Parameters-Based Scheduling
 5.2.1 Temporal parameters based scheduling (Basic, EDF, LLF, etc.)
 1. Basic scheduling
 In basic scheduling algorithm, the scheduler maintains two lists of tasks : the waiting list (W)
 and the running list (R). In the latter, tasks are sorted according to increasing completion
 times while the waiting list is sorted according to increasing arrival times, or decreasing
 priority. The priority of each task Ti is given by
 }i = 1ai
 ai being the release time of Ti. Therefore, the n ready tasks currently in the waiting list W
 are sorted on a first come first served basis. Equation 5.1 above becomes :
 W = [T1; T2; T3; :::; Tn 1; Tn]) a1 < a2 < a3 < :::: < an 1 < an
 where ai is the release time of task Ti. The equation suggests that as tasks are released, basic
 scheduling algorithm tends to start them as soon as possible. Consequently, the waiting
 time of tasks is minimized.
 The pseudo code of basic scheduling algorithm is detailed in table 5.2 page 174. The tasks
 lists W and R are maintained sorted as previously described. There are two events that
 invoke the scheduler : task(s) termination and task(s) release.
 When one or many task terminations arise at a given time tsch (without any task released),
 one or many areas are consequently freed (line 1 to 5). Tasks in the waiting list are then
 prioritized. The scheduler through the placer attempts to place them, beginning from the
 task heading the list. The whole list is then scanned and all the tasks that may fit on the
 array are placed, as far as the array is not full (line 6 to 13).
 Everytime tsch one or many tasks are released, if there is any task finishing at tsch, the
 scheduler directly attempts to place the newly arrived task(s) on the array. In case of
 failure, the task(s) is added at the rear of the waiting list.
 When both events arise simultaneously (task release and task finished), the newly released
 task(s) is first inserted in the waiting list before any placement attempt.
 The simulation results of this simple scheduling scheme is presented in chapter 6 section
 6.3.1 and compared with other scheduling algorithms.
 2. Earliest Deadline First (EDF)
 As formerly defined in section 3.5.3 page 81, EDF scheduling assigns the highest priority to
 173
5. Scheduling and Placement Algorithms Parameters-Based Scheduling
 A pseudo code of the basic scheduling algorithm
 Initialization: tsch  CurrentTime; F  FinishingTasks(R; tsch); F  R;
 A ArrivingTasks(tsch); Afree  free areas; Area P  0; flag  0;
 W  WaitingTasks(tsch), the list is sorted on a first come first served basis.
 R RunningTasks(tsch), the list is sorted in increasing finishing time.
 Schedule_Basic (W;R;Afree);
 1. 8Ti 2 F , do % dealing with finishing tasks
 2. F  F  Ti
 3. update (R;Afree)
 4. flag  1
 5. end 8Ti 2 F
 6. if (flag) % at least one task has ended at time tsch
 7. 8Ti 2W , Afree 6= 0 % dealing with waiting tasks first
 8. if (P = place (Ti, Afree)) % if placement successful
 9. R R [ Ti % R is updated and kept sorted
 10. W  W  Ti
 11. end if
 12. end 8Ti 2W
 13. end if (flag)
 14. 8Ti 2 A , Afree 6= 0 % dealing with arriving tasks
 15. if (P = place (Ti, Afree)
 16. R R [ Ti
 17. else W  W [ Ti % W is updated and kept sorted }-wise
 18. end if
 19. end 8Ti 2 A:::
 Table 5.2: A pseudo code of the basic scheduling algorithm
 174
5. Scheduling and Placement Algorithms Parameters-Based Scheduling
 the task with the closest absolute deadline. Therefore, the n tasks currently in the waiting
 list W are sorted according to their absolute deadline. Equation 5.1 becomes :
 W = [T1; T2; T3; :::; Tn 1; Tn]) D1 < D2 < D3 < :::: < Dn 1 < Dn
 where T1 is the heading task and Di the absolute deadline of task Ti.
 As discussed earlier in section 3.7.1, two variants of global EDF scheduling (denoted as
 EDF-First-k-Fit and EDF-Next-Fit respectively) for reconfigurable hardware devices along
 with their schedulability analysis were proposed by Danne (2006).
 In this thesis, the global EDF scheduling is similar to the aforementioned EDF-Next-Fit but
 applied to online aperiodic and nonpreemptive tasks, and using a 2D placement strategy.
 The simulation results are shown and discussed in Chapter 6, section 6.3.1.
 3. Least Laxity First (LLF)
 LLF scheduling is quite similar to EDF scheduling. LLF scheduling assigns the highest
 priority to the task with the smallest laxity. Therefore, the n tasks that are currently ready
 are sorted in a waiting list W according to their laxity. Therefore, equation 5.1 becomes :
 W = [T1; T2; T3; :::; Tn 1; Tn]) l1 < l2 < l3 < :::: < ln 1 < ln
 where T1 is the heading task and li the laxity of task Ti.
 LLF tends to prioritize the tasks that are closer to miss their deadline.
 The simulation results of the LLF scheduling is shown and discussed in Chapter 6, section
 6.3.1
 5.2.2 Geometric parameters based scheduling (BSF, SSF, etc.)
 1. Biggest Size First (BSF)
 In BSF scheduling policy, the bigger the size of a task, the higher its priority. Priority of
 each task Ti is given by :
 }i = wi  hi
 175
5. Scheduling and Placement Algorithms Parameters-Based Scheduling
 In the case of equal size tasks, the task with the higher aspect ratio is assigned a higher
 priority. Hence, let Ti and Tj be two tasks:
 wi  hi = wj  hj )
 8
 <
 :
 }i > }j if
 hi
 wi
  hjwj ) }i = }j +  
}i < }j if
 hi
 wi
 < hjwj ) }i = }j   
(5.2)
 2. Smallest Size First (SSF)
 In opposition to BSF, SSF gives priority to smaller size tasks. Priority of each task Ti is
 then given by
 }i = 1wi hi
 In the case of two tasks Ti and Tj with equal size the task with a higher aspect ratio is
 assigned a higher priority, as expressed in equation 5.2 above.
 Intuitively, BSF will provide a better reconfigurable array utilization ratio compared to SSF,
 as it places bigger tasks first. Contrary to BSF, SSF increases the array fragmentation by placing
 smaller tasks on available areas that may fit bigger tasks.
 The simulation results of the BSF and SSF scheduling are shown and discussed in Chapter 6,
 section 6.3.1.
 5.2.3 Combining Geometric and Temporal parameters for scheduling
 In this scheduling policy, the priority of each task is calculated using both geometric and temporal
 parameters of the task. Classified Stuffing (Chen and Hsiung, 2005) is one example of such
 scheduling algorithms. In Classified Stuffing (CS), a task is placed on either the leftmost or the
 rightmost of the reconfigurable array depending on its space utilization rate. The latter being the
 ratio between the width and the execution time of the task. CS is an improvement of normal 1D
 stuffing algorithms (Steiger et al., 2004). Both algorithms use a 1D placement, which means that
 the height of the task is meaningless and is not taken into account in the priority assignment.
 Section 3.7.2 in Chapter 3 provides more details of the CS algorithm and the 1D normal stuffing
 algorithm.
 The coming section presents the computational load based scheduling algorithm. The algorithm
 combines geometric and temporal parameters and uses a 2D placement.
 Computational load based scheduling
 The algorithm is based on the absolute computational load of the task. The priority of each
 176
5. Scheduling and Placement Algorithms Slot-Based Scheduling
 task is given by :
 }i = wi  hi  ei
 which corresponding to the total amount of resources needed to complete the task. Tasks are
 sorted according to decreasing computational loads in a ready tasks list W , as shown in the
 following equation.
 W = [T1; T2; T3; :::; Tn 1; Tn]) w1  h1  e1 > w2  h2  e2 > :::: > wn  hn  en
 Hence, the algorithm schedules resources greedy tasks first. As in other priority-driven scheduling
 policies presented above, when many tasks share the same priority, the task with the highest aspect
 ratio is assigned the highest priority. This algorithm is likely to achieve a higher reconfigurable
 array utilization ratio, as tasks that use more resources are placed first.
 5.3 Slots-based Scheduling
 Slots-based scheduling relies on the partitioned scheduling presented in section 3.6.4. The reconfi-
 gurable array is partitioned into slots as shown in Chapter 3, figure 3.8. The number and the
 size of each slot may be either predetermined or dynamically changed depending on the size of
 arriving tasks. This resulted in many slots-based scheduling algorithms that are presented below.
 5.3.1 n X 1D variable size slots scheduling
 In n X 1D variable size slots scheduling, the reconfigurable array is 1D-partitioned, n being the
 maximum number of partitions. The size of the partitions depends on the size of the arriving
 tasks. Hence, the size of the first partition is fixed by the size of the first task. A simple 1D
 placer is used in each slot. The placer keeps a list of partition that are sorted according to a given
 criteria. For example, in load-balanced 1D variable slots scheduling, the slots are sorted according
 to the increasing loads. Hence, any new job is scheduled on least loaded slot. This thesis refers to
 this scheduling as 1D variable size slots scheduling.
 Figure 5.1 depicts three variants of a 1D variable size slots scheduling. This scheduling ap-
 proach relies on the simplicity of the 1D placement used in each slot. The resulting internal
 fragmentation is drastically reduced compared to a traditional (non partitioned) 1D placement.
 Let  6 be a set of six tasks to online schedule on the reconfigurable array using 1D variable
 size slots scheduling, as mapped in figure 5.1. Tasks in  5 are sorted according to their increasing
 release times ai as expressed below, and are placed on the reconfigurable array as they are released.
 177
5. Scheduling and Placement Algorithms Slot-Based Scheduling
 Figure 5.1: 1D-like partitioned scheduling
  6 = [T1; T2; T3; T4; T5; T6]) a1 < a2 < a3 < a4 < a5 < a6
 The size of the reconfigurable array is Afpga = W  H where W and H are respectively the width
 and the height of the device. T1 is the first task released and is placed on the bottom left of the
 reconfigurable array. A first slot denoted as W1 is created, which width is equal to the width of
 T1. At this stage, there are two slots on the device, W1 and W2 = W  W1, not detailed on the
 figure. Therefore, different fitting strategies may be used : First Fit, Next Fit and Best Fit.
 1. First Fit (FF) strategy places the next task in the first available slot that can fit the task.
 As shown on the left of figure 5.1, when tasks T2 and then T3 are arrive, they are placed in
 the first available slot, W1. T4 is released and cannot fit in W1. Therefore, a second slot
 W2 is created, which width is equal to the width of T4. There are currently 3 slots on the
 reconfigurable array, W1, W2 and W3 = W  W1  W2. The next task T5 is released and
 placed on the first slot that can fit it, which is slot W1. Task T6 arrives at last and is placed
 in the first slot that can accommodate it, the slot W2. The 3rd slot (hatched) is free.
 2. Next Fit (NF) strategy places the current task on the next available area that may fit the
 task, as mapped in the middle of figure 5.1. Next Fit schedules tasks T1, T2 and then T3
 similarly to FF described above. The three first tasks are placed in the first slot W1, as the
 latter is wide enough to fit them. When T4 arrives, the scheduler checks the next available
 area that can accommodate the task. As the first slot cannot accommodate it, a slot of the
 same width as T4 is generated, T4 is placed. At this stage, there are 3 slots on the array
 178
5. Scheduling and Placement Algorithms Slot-Based Scheduling
 and fitting solutions for FF and NF are similar. Contrary to FF strategy, when T5 arrives,
 NF assigns the second slot W2 to it, instead of the first slot. Indeed, NF checks the area
 that is directly next to the previous area where the last task (T4 here) has been placed. T6
 is released and placed in the next available area, which is in the 3
 rd
 slot.
 3. Best Fit (BF) always selects the area that fit the best and that reduces internal frag-
 mentation. The algorithm selects among all possible fitting areas, the area which size is
 closer to the task size. This chosen area may be either in an existing slot, or in a newly
 generated slot which width will be equal to the width of the task. An example of a BF
 placement is pictured on the far right of figure 5.1. T1 and T2 are placed similarly to FF
 and NF placements described above. However, contrary to FF and NF, BF places T3 in a
 newly generated slot W2 that fits it the best and that minimizes the internal fragmentation.
 Indeed, FF and NF placements depicted respectively on the far left and the centre of figure
 5.1 fit T3 and T5 in the first slot, inducing a lost of space.
 After placing the six tasks using each of the 3 fitting strategies, there are two types of remaining
 areas :
  the remaining free areas (hatched areas) that are available and can accommodate new tasks.
  the internally fragmented areas (white areas) that are wasted as long as the neighbouring
 tasks that brought them out have not completed (e.g. the areas next to T3, T4 or T5 in FF
 placement are lost as these tasks do not occupy the entire width of the slot accommodating
 them).
 Figure 5.1 shows that the BF fitting strategy produces more free areas and less internal frag-
 mentation compared to FF and NF. The remaining available free area represents  17% for FF
 strategy,  28% for NF strategy and  36% for BF strategy. The latter strategy increases the
 reconfigurable device utilization ratio accordingly.
 A pseudo code of the 1D variable slots scheduling algorithm is presented in table 5.3, page 180.
 Lines 1 to 5 deal with tasks termination. If any has occurred, lines 6 to 14 update the slots and
 attempt to place the waiting tasks if there is any. Finally, lines 15 to 21 deal with arriving tasks.
 The placement (lines 8 and 16) are done using one or another of the fitting strategies above.
 In the coming section, n X 1D variable size slots scheduling is combined with the looking-ahead
 scheduling approach.
 179
5. Scheduling and Placement Algorithms Slot-Based Scheduling
 A pseudo code of the 1D variable slots scheduling algorithm
 Initialization: tsch  CurrentTime; F  FinishingTasks(R; tsch); F  R;
 A ArrivingTasks(tsch); Afree  free areas; Area P  0; flag  0;
 W  WaitingTasks(tsch), the list is sorted on a first come first served basis.
 R RunningTasks(tsch), the list is sorted in increasing finishing time.
 Schedule_1D_Var_Slots_Based (W;R;Afree);
 1. 8 Ti 2 F , do % dealing with finishing tasks
 2. F  F  Ti
 3. update(Ti; Afree ! sloti)
 4. flag  1
 5. end 8 Ti 2 F
 6. if (flag) % if at least one task has ended at time tsch
 7. 8 sloti 2 Afree , do % merging empty slots if exist
 8. if(sloti ! empty) merge_slots_if_worthy( )
 9. end 8 sloti 2 Afree
 7. 8 Ti 2W , Afree 6= 0 % dealing with waiting tasks first
 8. if(P = place(Ti, Afree)) % if placement successful
 9. R R [ Ti % R is updated and kept sorted
 10. W  W  Ti
 11. update(Ti; Afree ! sloti) % updating the slot
 12. end if
 13. end 8 Ti 2W
 14. end if(flag)
 15. 8 Ti 2 A , Afree 6= 0 % dealing with arriving tasks
 16. if(P = place(Ti; Afree)
 17. R R [ Ti
 18. update(Ti; Afree ! sloti)
 19. else W  W [ Ti % W is updated and kept sorted }-wise
 20. end if
 21. end 8 Ti 2 A:::
 Table 5.3: A pseudo code of the 1D variable slots scheduling algorithm
 180
5. Scheduling and Placement Algorithms Slot-Based Scheduling
 5.3.2 1D variable slots looking-ahead scheduling
 Chapter 3 has presented the looking-ahead scheduling along with its advantages and drawbacks.
 The main advantage was the ability for the scheduler to know as soon as a task arrives, whether
 it may fit in the reconfigurable array at current time or later on. However, this rapid decision
 came at the cost of numerous placement and area management operations required to mimic fu-
 ture states of the reconfigurable array. This cost may be even higher if optimal area management
 strategies (e.g. MERs-based) are used.
 The methodology presented in Chapter 4 first made accurate measurements of runtime overheads
 of MERS-based area management on a real embedded processor. Based on these measurements,
 different trade-offs between the scheduling scheme and the underlying placement strategy were sug-
 gested . Hereinafter is an algorithm that results from a combination of a looking ahead scheduling
 algorithm and an 1D slots based placement strategy. Any other possible combination hasn't been
 discussed.
 1D variable size slots horizon (1D-VSSH)
 1D variable size slots horizon (1D-VSSH) scheduling combines the advantages of a looking-ahead
 scheduling (especially the horizon scheduling, Steiger et al., 2004) with the simplicity of a 1D
 variable size slots scheduling/placement presented in the former section. The 1D horizon schedul-
 ing (Steiger et al., 2004) is used in each slot. Following the same principle, a stuffing scheduling
 (Steiger et al., 2004) or an improved stuffing (e.g. Chen and Hsiung, 2005) may also be used.
 Figure 5.2 illustrates a case where a set of 6 tasks  6 = [T1; T2; T3; T4; T5; T6] are scheduled
 on a 10X6 reconfigurable array using 1D variable size slots horizon scheduling. Parameters of the
 tasks are detailed in table 5.3.2. Each task is submitted to a deadline constraint. The number
 and the size of slots are adjusted at runtime depending on the size of released tasks.
 Table 5.4: Tasks parameters
 for 1D variable size slots
 looking-ahead scheduling
 tD = max(d) = 18
 Tasks parameters T1 T2 T3 T4 T5 T6
 a : arrival time 1 1 3 3 5 8
 e : execution time 8 7 10 6 5 4
 d : deadline 10 10 17 17 17 18
 latest starting time 2 3 7 11 12 14
 w : width of the task 3 5 6 3 2 7
 h : height of the task 5 4 2 3 3 3
 181
5. Scheduling and Placement Algorithms Slot-Based Scheduling
 Figure 5.2: 1D improved horizon scheduling algorithm, also denoted as 1D variable size slots
 horizon (1D-VSSH)
 Horizon scheduling algorithm schedules a task either on the currently available areas, or
 prospects future states of the array in order to see if there is any area that may accommodate the
 task in a the future and run it to completion without missing its deadline.
 At time t = 1, T1 and T2 are released and placed in the reconfigurable array (figure 5.2(a)).
 The width of the first slot W1 is fixed by the width of T1. T2 is placed in the second slot W2. W2
 and W3 are floating size slots, as they may be resized. However, the size of W1 is bounded as far
 as W2 hosts at least one task.
 At time t = 3, T3 and T4 are released (figure 5.2(b)). W2 is resized in order to fit T3. At current
 time, there is no place to fit task T4. However, T4 may start at time t = 11 latest without missing
 its deadline. As task T1 will complete at t = 9, the area that will be freed may fit T4. T4 is
 182
5. Scheduling and Placement Algorithms Slot-Based Scheduling
 planned to start at time t = 10 at the position currently occupied by T1. Therefore, T1 and T4
 overlap on the figure.
 At time t = 5, T5 arrives and cannot fit in any currently available area. The latest starting time for
 it is t = 12 otherwise it will violate its time constraint. Consequently, slot W1 may accommodate
 T5 in addition to T4 at time t = 10 after the completion of T1. T5 is planned as shown in figure
 5.2(c).
 At time t = 8, T6. Following the same principle, T6 is planned to start at time t = 12 after T2
 completes. However, slots W2 and W3 are resized in order to fit T6.
 The relative application (computational) load of the tasks set  6 is given by
 res 5 =
 Pk
 i=1
 wi hi ei
 W  H tD
 = 3 5 8+5 4 7+6 2 10+3 3 6+2 3 5+7 3 410 6 18 = 50:7%
 It reflects the amount of resources required by  6 relative to the total amount of resources available
 on the reconfigurable array during tD time units, tD being the absolute deadline of  6.
 The utilization ratio of the reconfigurable array resulting from the current scheduling policy (1D
 variable slots horizon-looking-ahead) on  6 is given by :
 Ufpga(%) =
 P6
 i=1
 wi hi ei
 W  H mk =
 3 5 8+5 4 7+6 2 10+3 3 6+2 3 5+7 3 4
 10 6 16 = 57:1%
 mk is the makespan or the schedule length. It corresponds to the latest finishing time which is
 equal to 16 in the above case. W and H are respectively the width and the height of the reconfi-
 gurable array.
 Simulation results for 1D variable size slots horizon looking-ahead scheduling is shown and
 explained in Chapter 6, section 6.5.2.
 5.3.3 1D variable slots scheduling with minimum makespan
 The makespan is one of the most used objective function. Indeed, most of the scheduling costs
 are highly tight to the amount of time spent by a given user (or application) on the computing
 resources. Consequently, the makespan must be reduced. However, as shown in Chapter 3 section
 3.4.3, there is no optimal solution in online scheduling, and performance analysis is done through
 average-case analysis or worst-case analysis (e.g. competitive analysis).
 The flow time of each job may affect the makespan. Hence, the reduction of the total flow
 time may be considered. As expressed in equation 4.20 page 162, the flow time of a job is the
 time spanning from its release to its completion.
 183
5. Scheduling and Placement Algorithms Slot-Based Scheduling
 Minimizing the total flow time of a job sequence  n is equivalent to minimizing the average flow
 time which has been also expressed in equation 4.21, page 162. In multiprocessor scheduling, one
 way of minimizing the total flow time is to give higher priority to jobs with longest execution
 time. The reason is that the sooner the jobs with long processing time start, the sooner they
 complete. This is even more important in online scheduling. Jobs with longer execution time
 should not have to wait too long before starting, as they would likely increase the flow time of the
 task, and therefore the makespan. An example is shown on the left of figure 5.3 where a late start
 of task T21 drastically lengthens the makespan. From a multiprocessor scheduling perspective,
 figure 5.3 depicts the worst case and the optimal case of a list scheduling of n tasks on m identical
 processors. The case is transposed in reconfigurable hardware scheduling domain.
 Figure 5.3: List scheduling vs optimal scheduling of n tasks on m identical processors; ei is the
 execution time of task Ti
 Let  n = [T1; T2; :::; Tn] be a list of n hardware tasks of width w1 = w2 = ::: = wn 1 = wn = 1
 and processing times e1 = e2 = ::: = en 2 = en 1 and en = m. Let us assume that jobs in  n
 are presented one by one to the scheduling algorithm. As soon as a job ti is scheduled, the next
 job ti+1 in the list is available for scheduling. This may correspond to jobs arrive over list or jobs
 arrive over time online paradigm.
 Let FPGAm be anm columns reconfigurable hardware device wherem is the width of the device
 and n = m  (m 1). In order to schedule  n on FPGAm, let us assume that either the scheduler
 184
5. Scheduling and Placement Algorithms Placement Strategies
 uses a 1D placer or the height of each hardware task in  n spans the entire height of FPGAm.
 Therefore, FPGAm can concurrently fit m jobs. The list scheduling algorithm schedules any new
 job on least loaded processor (here the less loaded column). As depicted on the left of figure
 5.3, the first job t1 is placed in the first column c1 at time 0, the second job t2 on the second
 column c2 and so on. The resulting makespan is mkon = 2m 1. Indeed, as the scheduler learns
 about tasks one by one, it cannot properly plan the arrival of tasks with longer execution time
 like T21. Consequently, at the end, the final load of each column of the reconfigurable array is
 very unbalanced, leading to a higher makespan and a lower average utilization ratio. This special
 case of the left of figure 5.3 has been chosen to be a worst case scenario.
 The right of figure 5.3 shows how optimal would have been an o?ine scheduling of the same tasks
 set  n on the same reconfigurable device FPGAm. Indeed, if we have a priori knowledge of the
 tasks in  n, they can be optimally dispatched in different columns of the device, as shown by the
 right of figure 5.3. The resulting makespan is optimal and lowered to mkopt = m.
 As stated in Chapter 3 and expressed by equation 5.3, an online scheduling algorithm (e.g. list
 scheduling here) is said c competitive if for any input instance  n, the objective function (e.g. the
 makespan mkon in figure 5.3-left) produced by the algorithm on  n is at least c times better than
 that obtained with the optimal o?ine scheduling, as shown in figure 5.3-right. This is expressed
 by
 mkon  c  mkopt (5.3)
 c is known as the competitive ratio of the online scheduling algorithm, and is expressed as follows:
 c =
 mkon
 mkopt
 = 2 
1
 m
 (5.4)
 In the special case of figure 5.3 where wi = 1, if we assume that each hardware task in  n spans
 the entire height of the reconfigurable device, the utilization ratio in online scheduling and in
 o?ine optimal scheduling are respectively
 Uon(%) =
 Pn
 i=1
 ei
 m mkon
 and Uopt(%) =
 Pn
 i=1
 ei
 m mkopt
 where Uon(%)  Uopt(%). Therefore, equation 5.3 becomes:
 mkon  (2 
1
 m
 )  mkopt (5.5)
 As emphasized in equation 5.5, equation 5.4 guarantees the fact that the makespan of the online
 algorithm will never be beyond (2 1m ) mkopt. The latter equation is also known as the competitive
 ratio of the Graham's online list scheduling problem (Graham et al., 1979) where a sequence of jobs
 has to be scheduled on m identical parallel processors in a way that the makespan is minimized.
 185
5. Scheduling and Placement Algorithms Placement Strategies
 A pseudo code of the 1D variable slots with minimum makespan scheduling algorithm
 Initialization: Afree  free areas; Area P  0 ; flag  0 ; tsch  CurrentTime ;
 F  FinishingTasks(R; tsch); F  R; A ArrivingTasks(tsch); R RunningTasks(tsch);
 W  WaitingTasks(tsch), the list is sorted in decreasing processing time.
 Afree = fC1; Clongg , C1 = 1st cluster, Clong = 2nd cluster for tasks with long processing time.
 Temp empty , temporal list of tasks to be placed, sorted in decreasing processing time;
 elong    emax , long processing time threshold, e.g.  = 0:5 ;
 Schedule_1D_slots (W;R;Afree) ;
 1. 8 Ti 2 F , do % dealing with finishing tasks
 2. F  F  Ti
 3. update(R;Afree ! Ci) % updating the corresponding cluster
 4. flag  1
 5. end 8 Ti 2 F
 6. if (flag) % if at least one task has ended at time tsch
 7. Temp W [A % Temp contains W and A.
 8. 8 Ci 2 Afree; Ci 6= Clong , do % merging empty slots if exist
 9. if (Ci ! empty) merge_clusters_if_worthy( )
 10. end 8 Ci 2 Afree
 11. else Temp W % Temp contains only W
 12. end if(flag)
 13. 8 Ti 2 Temp , Afree 6= 0 % dealing with waiting tasks
 14. if (Ti ! exec_time  elong) P = place(Ti , Clong)
 15. if (notP ) P = place(Ti , Ci 6=long)
 16. if (P ) % if placement successful
 17. R R [ Ti % R is updated and kept sorted
 18. Temp Temp Ti
 19. update_clusters(R;Afree ! Ci)
 20. end if (P )
 21. end 8 Ti 2 Temp
 22. W  W [ Temp % W is updated and kept sorted, Temp emptied
 Table 5.5: A pseudo code of the 1D variable slots with minimum makespan algorithm
 186
5. Scheduling and Placement Algorithms Ternary Tree Structure
 5.4 Placement Strategies for 2D Looking-Ahead Scheduling
 In Chapter 4, MERs-based optimal area management strategies have been used to refine the
 methodology, and to refer to as a comparison reference. According to the same methodology, non
 optimal area management and placement strategies, combined on one hand with looking-ahead
 scheduling algorithms, and on the other hand with multi-shape tasks scheduling were adopted.
 In most of designed scheduling algorithms, the placer manages the areas mainly through a binary
 search tree as described in Bazargan et al. (2000) and detailed in Chapter 3, section 3.8.3. The
 time complexity for finding a given node in the tree is O(n) in the worst case, n being the number
 of tasks running on the FPGA. The complexity drops to O(log2(n)) if the tree is balanced. By
 default, a 2D placer is used, unless a 1D placer is clearly mentioned in the name of the scheduling
 algorithm.
 In some cases, the hash matrix presented by Walder et al. (2003) was used. The tree and the
 hash matrix (if used) are updated at each task insertion or deletion. The matrix stores free areas
 and allows the placer to find a feasible placement in constant time complexity O(1). However,
 updating the matrix requires a potential scan of w  h entries, where w and h are respectively the
 width and the height of the area inserted in or deleted from the matrix. Walder et al. (2003) show
 that the number of real scans is one order of magnitude lower than that.
 No matter if the matrix is used or not, a scheduling algorithm is comparable to another only
 if both are using the same placement strategy. This ensures that any improvement will be to
 scheduler's credit. In this chapter, there is no more details of the placement strategies, as they
 were formerly described in Chapter 3 and detailed in the papers of the authors cited above.
 In the coming section is proposed a ternary tree structure suitable for looking-ahead scheduling.
 5.4.1 A Ternary Tree structure for Looking-Ahead Scheduling
 This section presents a ternary search tree structure enabling looking-ahead scheduling of real-time
 tasks on partially reconfigurable FPGAs. The structure suits to this scheduling approach, as it
 provides a good overview of present and future states of the reconfigurable array.
 As previously stated, looking-ahead scheduling schedules each task once at its arrival and imme-
 diately rejects or accepts it. The accepted task may start instantaneously or later and still meet
 its deadline. This rapid scheduling decision makes looking-ahead scheduling of vital interest for
 online real-time systems. Indeed, an immediate task rejection gives to the OS the opportunity for
 finding alternative resources implementation other than the reconfigurable array.
 187
5. Scheduling and Placement Algorithms Ternary Tree Structure
 Through the literature review in Chapter 3 and the methodology in Chapter 4, it was also stated
 that the reconfigurable array utilization ratio, the task rejection ratio and the scheduling algo-
 rithm complexity highly depend on the underlying placement strategies used. Consequently, using
 a looking-ahead scheduling is worthy in an online real-time context only when combined with low
 complexity free area management strategies.
 In this section, the looking-ahead scheduling is combined with a low complexity 2D placement
 strategies in order to lower the overall scheduling/placement runtime overheads. This placement
 strategy relies on the binary search described in Chapter 3 section 3.8.3. The tree has been in-
 troduced by Bazargan et al. (2000) and improved by Walder et al. (2003). It stores the state of
 the reconfigurable array along the scheduling process. In this thesis, a ternary tree suitable for
 looking-ahead scheduling was derived from the binary tree.
 Managing the tree : example of two horizon scheduling algorithms
 The ternary tree structure is a variants of the binary tree. Storing the states of the array in the
 latter structure makes merging and splitting operations more intuitive, as detailed earlier in figure
 3.26 page 128. The same principle is applied to the ternary tree. The time complexity for finding
 a node in the structure is bounded by O(n), n being the number of tasks currently placed in the
 array.
 Figure 5.4 pictures splitting and merging processes that result from the scheduling of a set
 of tasks on the reconfigurable array. Let  6 = [T1; T2; T3; T4; T5; T6] be a set of online real-time
 hardware tasks to schedule. Parameters of  6 are shown in table 5.4.1.
 As shown in figure 5.4 (i), (ii), and (iii), when a task is placed in an area, the area is divided in
 three rectangles according to vertical, horizontal or overlapping split . In the binary tree presented
 in Bazargan et al. (2000) and Walder et al. (2003), each node represents an area which can generate
 up to two children nodes when a task is placed. A third node is generated only in some special
 cases. In the ternary tree proposed in this thesis and depicted in figure 5.4, a third area node of
 the size of the task placed ((a) and (a0)) is systematically generated. Each node stores the size
 and the time availability of an area. One example of a set of tasks to be scheduled is given in
 table 5.4.1. The scheduling algorithm learns about tasks as they are released. According to figure
 5.4, the entire resources of the FPGA of size 10x6 is available at the beginning. Therefore,
  At time t = 1, tasks T1 and T2 are released in the system. They are immediately placed, as
 there is enough place to fit them (see a and b). Children nodes are generated accordingly
 188
5. Scheduling and Placement Algorithms Multi-shape Tasks Scheduling
 Table 5.6: Example of tasks
 parameters for
 horizon-SFAF and
 horizon-EAAF scheduling
 algorithms
 tD = max(d) = 18
 Tasks parameters T1 T2 T3 T4 T5 T6
 a : arrival time 1 1 3 3 5 8
 e : execution time 8 7 10 6 5 4
 d : deadline 10 10 17 17 17 18
 latest starting time 2 3 7 11 12 14
 w : width of the task 3 5 6 3 2 7
 h : height of the task 5 4 2 3 3 3
 (see a0 and b0). The information in the root node (b0) shows that the whole FPGA will be
 available at time 18 while the child node 7x6 hosts the area 7x6 occupied during the time
 interval [1; 8].
  At time t = 3, T3 arrives and cannot fit in any currently available area (nodes 3x1, 5x2
 and 2x6 on (b0)). However, T3 can fit in node 7x6 and still meet its deadline. Thus, T3 is
 planned to start at time t = 8.
  At time t = 5, T4 arrives. There are two scheduling options :
 1. the horizon-EAAF
 this fitting approach assigns the Earliest Available Area First to T4. The node that is
 available sooner than any other node capable of accommodating task T4 is the node
 6x4, available at time t = 8. This fitting strategy may correspond to first fit, from a
 temporal point of view.
 2. the horizon-SFAF
 this approach assigns the Smallest Fitting Area First to T4. The smallest node among
 all the node that can fit the task is the node 3x5, available at time t = 9). The node
 corresponds to the area that fits the best, similar to best fit fitting strategy.
 Intuitively, horizon-SFAF should outperform horizon-EAAF in terms of reconfigurable chip
 utilization ratio, as it relies on a best fit approach. For the sake of simplicity, these two variants
 of horizon scheduling algorithms are sometimes denoted as SFAF and EAAF respectively.
 The simulations results of the horizon-SFAF and the horizon-EAAF horizon scheduling using
 the ternary tree scheduling are presented and discussed in Chapter 6, section 6.5.1. The two
 variants are compared with two tasks parameters based scheduling algorithms (EDF scheduling
 and the Basic scheduling).
 189
5. Scheduling and Placement Algorithms Multi-shape Tasks Scheduling
 At  t=3, T3 arrives and can?t f i t  in any of currently available areas.
 T3 is planned to start at t=8 at node 7x6. The tree is updated (c-c?)
 but areas 5x2 and 2x6 (b?) are lost during t ime interval [3,8] (horizon)
 At  t=5, T4 arrives and cannot f i t  
anywhere before t ime 8.
 T4 could start either at node 6x4 at t=8 (EAAF approach...), 
or at node 3x5 at t=9 (SFAF); see above (c, c?) and then below...
 (a) (b)
 (c)
 At  t=1, T1 and T2 arrive and are placed (see a, a?, b and b?)...
   T1
  3x5
 [1-9]
 3x1(1)
 1
 x
 6
 (8)
 6 X 4  [ 3 - 8 ]
       T3 
6 x 2 [ 8 - 1 8 ]
 3x1 (1)
   7x6 (1)  T1
  3x5
 [1-9]
   T1
  3x5
 [1-9]
 3x1 (1)
 2x6
 (1)
 5x2 (1)
   T2 
 5x4
 [1-8]
 1 0 x 6
 ( 9 )
 T1: 3x5
 [1-9]
 7x6
 (1)
 l
 s
 x
 (a?)
 3x1
 (1)
 3x1
 (1)
 2x6
 (1)T2: 5X4
 [1-8]
 l
 s
 x
 x
 s
 5x2
 (1)
 (b?) 1 0 x 6( 1 8 )
 T1: 3x5
 [1-9]
 7x6
 [1-8]
 l
 1 0 x 6
 ( 1 8 )
 T1:3x5
 [1-9]
 3x1
 (1)
 7x6
 (18)
 T3: 6x2
 [8-18]
 l
 sx
 x
 s
 6X4 
[3-8]
 1x6 
[3-8]
 (c?)
 1 0 x 6
 ( 1 8 )
 T1:3x5
 [1-15]
 3x1
 (1)
 1x6
 (8)
 7x6
 (18)
 T3:  6x2
 [ 1 - 1 8 ]
 l
 s
 x
 x s
 6x4
 (14)
 T4:  3x3
 [ 9 , 1 5 ]
 x
 3x2
 [1-9]
 l
 T4? :3x3
 [ 8 , 1 4 ]
 x
 3x1
 [1-8]
 3x4
 [1-8]
 (d ? )
 (i) horizontal split
   Task 
x area
 l: laying area
 s: standing 
    area
 (i i) vertical split
 l :  laying
 s: standing 
    area
 (i i i) overlaping split
  laying 
standing 
EAAF :
 SFAF :
   Task 
x area
   Task 
x area
 over laping region 
1 0 x 6 1 0 x 6
  3x2
 [1-9]
 3x1(1)
 1
 x
 6
 (8)T3 :  6x2 
[ 8 - 1 8 ]
 T 4 : 3 x 3
 [ 9 - 1 5 ]
 6 X 4  [ 1 - 8 ]
 (d) SFAF fitting of T4
 3x1(1)
 1
 x
 6
 (8)T3 :   6x2
  [8 -18 ]
 3 x 1  [ 1 - 8 ]
 T 4 ? : 3 x 3
 [ 8 - 1 4 ]
  3X4  
[ 1 - 8 ]
   T1
  3x5
 [1-9]
 (d) EAAF fitting of T4
 l
 l
 s
 l
 or
 Figure 5.4: Ternary tree structure : splitting and updating processes
 190
5. Scheduling and Placement Algorithms Multi-shape Tasks Scheduling
 5.5 Multi-shape based Tasks Scheduling
 Multi-shape based scheduling is a scheduling algorithm that assumes that each hardware task may
 have more than one versions that differ from each other by their shape, size and/or processing
 time. Such tasks is denoted as multi-shape tasks, and the corresponding scheduling/placement
 algorithms as multi-shape based.
 When designing a hardware task, there are mainly two stages at which the designer can influence
 its shape and its size :
 1. Prior to the synthesis phase
 by using various arithmetic implementation techniques. The techniques range from bit-serial
 to fully parallel and therefore provide a range of options and compromises between the size of
 each task and its temporal characteristics. Distributed Arithmetic is a well-known example
 of such a technique. A task may have several execution times (or throughput) depending
 on whether it has been implemented serial, semi-parallel or fully parallel. Variants of the
 same task differ from each other by their size and the corresponding execution time. One
 example is depicted in figure 5.5 where task T3 is a multi-shape task with a normal variant
 and some smaller variants (smaller_standing and smaller_laying) with a longer execution
 time.
 2. After the synthesis phase
 where Place & Route tools may be used to constrain a designed module in a rectangular
 region. This relies on the module-based design for partial reconfiguration described earlier
 in Chapter 2 section 2.5.8. Such a region-constrained Place & Route may slightly change
 temporal characteristics of the module or task (e.g. the highest operation frequency deduced
 from the longest path in the module layout). However, it is assumed in this thesis that exe-
 cution time of a task remains slightly the same for slightly the same amount of configurable
 resources. Figure 5.5 depicts an example where task T3 is a multi-shape task with a normal
 variant and some variants denoted as same_standing and same_laying that use the same
 amount of configurable resources for the same execution time, but are of different width
 and height.
 Therefore, designing multi-shape tasks comes at the cost of an extra effort at design time. In
 addition, memory requirements for storing modules bitstreams increase linearly with the number
 of variants per task. As additional versions of hardware tasks tend to be smaller versions that
 191
5. Scheduling and Placement Algorithms Raison d'?tre for Multi-shape Tasks
 Figure 5.5: Multi-shape tasks provides more fitting opportunities (e.g. T3 provides 5 variants).
 minimize the total amount of configurable resources, memory requirement is bounded by O(n)
 where n is the number of variants per task.
 5.5.1 Raison d'?tre for multi-shape tasks
 As stated in the previous chapter, a way of improving the scheduling/placement quality without
 increasing the algorithm complexity and runtime overhead is to generated hardware tasks that
 feature more than one rectangular shape. Multi-shape based scheduling is suitable for applications
 with hardware tasks that have more than one implementation version on the reconfigurable device,
 as depicted in figure 5.5. The top of the figure depicts an example of a multi-shape task with 5
 versions. The bottom of the figure depicts a placement scenario of such tasks. The normal version
 of task T3 couldn't fit on the reconfigurable array in any of the three cases illustrated in the bottom
 of figure 5.5. Indeed, after placing any of the two versions of tasks T1 and T2, the remaining area is
 not big enough to fit the normal version of T3. (a), (b) and (c) map three placement alternatives.
 In the example, T3 provides versions that use less resources (smaller versions) in addition to version
 that use the same amount of resources (same size versions).
 192
5. Scheduling and Placement Algorithms Raison d'?tre for Multi-shape Tasks
 Let Ti = {Ti1; Ti2; :::; Tin} be an n-versions multi-shape hardware task where Tij is the jth
 version of task Ti . The ratio in equation 5.6 gives a simplified relation between size and execution
 time of the n different versions of task Ti.
 8Tij ; Tik 2 Ti,
 Aij
 Aik
 =
 wij  hij
 wik  hik
 =
 eik
 eij
 (5.6)
 where Aij = wij  hij is the size of the jth version of task Ti (resp. eij is the execution time of the
 jth version of task Ti). For example, task T3 in figure 5.5 has a normal version and two half sized
 versions (smaller laying and smaller standing) with an execution time that is twice longer.
 Scheduling multi-shape hardware tasks does not really increase the algorithm complexity and
 runtime overhead. The many the versions per hardware task, the higher the probability of fitting
 tasks on the reconfigurable array. Thus, the task rejection ratio and the reconfigurable array
 utilization ratio are improved. Consequently, the philosophy beyond multi-shape is to partially
 shift the complexity of the runtime scheduling and placement from online time to o?ine or design
 time. Therefore, the extra effort at design time consists of generating as many versions (bitstreams)
 of each hardware task as possible.
 There are at least two more reasons for using multi-shape tasks :
  From a power consumption perspective : providing various implementation versions of a
 hardware task is meaningful in power-aware systems. Indeed, the power consumption of a
 hardware task is quite influenced by its size, its operation frequency and the type of logic
 resources that implement it. For example, a pure CLBs-based module may consume more
 power than a module that mainly uses dedicated ASIC blocks (e.g DSP blocks, hard core
 processor, etc.). A system can switch from a normal that state to a power-aware state just
 by swapping tasks versions in and out the system depending on their power consumption.
 For example, the system may provide a high QoS in its normal state at the cost of power
 consumption, and may switch into a state that sacrifices the QoS for the sake of energy
 save.
  From a cryptography and encryption perspective : multi-shape hardware tasks may also be
 recommended in digital systems that fear of being spied at the physical implementation
 level by techniques like differential power analysis. By tracking the energy consumed by
 a mathematically secured digital system, a differential power analysis may collect enough
 information to break its encryption. However, the system may be more difficult to track if
 different variants of each task are dynamically swapped in and out the reconfigurable array,
 or relocated at runtime.
 193
5. Scheduling and Placement Algorithms The multi-shape algorithm
 Figure 5.6: Flow chart of the multi-shape algorithm that selects task version to be placed.
 5.5.2 The multi-shape basic algorithm
 The multi-shape algorithm schedules a list of ready tasks that is sorted according to their arrival
 times. Hence, the algorithm differs from the basic scheduling algorithm presented earlier only by
 the fact that each hardware task provides more than one rectangular shape. Tasks are kept into
 the queue as long as they can still meet their deadline. At each time tsch, the scheduler is invoked
 only in the event of task(s) termination (tf ) and/or task(s) arrival (tr) and proceeds as followed:
  if (tsch = tf ), the algorithm scans the queue from the head, attempting to place tasks as far
 as possible, removing (rejecting) tasks which cannot still meet their deadline.
  if (tsch = tr), the algorithm attempts to place the arriving task(s). If the attempt fails, the
 task(s) is inserted in the back of the ready tasks queue.
 194
5. Scheduling and Placement Algorithms The multi-shape algorithm
 The algorithm chooses a version of the elected task among its existing versions in the order of
 priority mapped in figure 5.6(b). The latter figure shows that among two versions of different
 size, the algorithm prioritizes smaller size versions. Prioritizing smaller versions over other same
 size versions tends to reduce as much as possible the total amount of configurable resources used
 by each task. However, in the case of identical size versions, the algorithm prioritizes the version
 with the highest aspect ratio. The flow chart of figure 5.6(a) details the algorithm. While finding
 a location P for a multi-shape task, the algorithm first checks whether the normal version may fit
 on the reconfigurable array. In case of failure, it checks smaller versions if the corresponding task
 processing time will not lead to a completion time that violates the deadline constraint. Other
 versions that have the same size with the normal version are only checked if any smaller version
 hasn't been successfully placed. If any version of the task cannot fit at the scheduling time tsch,
 the task is added or kept in the waiting list for a further placement attempt, if is can still meet
 its deadline.
 Distributed Arithmetic as an enabling technique
 Distributed Arithmetic (DA) is a computation algorithm that uses memory instead of multipliers
 to perform sum of products where one of the operand remains constant. The algorithm is denoted
 as multipliers-less (see section 7.9.1, Appendix 7.9. The equation 5.7 expresses the output of a FIR
 Filter. It is a good example of sum of products, as processing one output sample Y (n) requires
 the accumulation of N product terms.
 Y (n) =
 N 1X
 l=0
 Hl  Xl(n) =
 N 1X
 l=0
 Hl  Xl (5.7)
 H0, H1,..., HN 1 are N constant and time-invariant filter coefficients that are computed before-
 hand. N is the filter length. At each time n, the output response Y (n) is function of the N lasts
 inputs samples X0, X1...XN 1 only. Therefore, n may be implicit as shown in the final equa-
 tion. The output requires 2N  1 arithmetic operations (N multiplications and N  1 additions).
 Different techniques that range from pure serial implementation to fully parallel may be used for
 FPGA implementation of the filter. Such a convolution is very common is DSP functions, and is
 suitable for DA implementation.
 The Appendix E page 261 gives a detailed example of the filter, designed as a multi-shape hard-
 ware task using the DA algorithm. Different trade-offs between the configurable resources and the
 filter throughput are obtained. In the past, multipliers-less techniques were very useful as multi-
 pliers were very configurable resource-consuming. Fortunately, nowadays, FPGAs are embedding
 195
5. Scheduling and Placement Algorithms Conclusion of Chapter 5
 numerous hardwired high performance DSP blocks.
 The simulations results for multi-shape tasks scheduling are presented and discussed in
 Chapter 6, section 6.4.
 5.6 Conclusion of the Chapter
 This chapter has discussed different scheduling algorithms and some placement aspects of these
 algorithms. The chapter mainly focused on scheduling through two approaches : the looking-ahead
 scheduling and the without-looking-ahead scheduling approach. On one hand, a family of without-
 looking-ahead algorithms denoted as tasks parameters based has been studied . These algorithms
 was based on geometric and/or temporal parameters of the tasks. In addition, the multi-shape
 scheduling algorithm was proposed. The algorithm assumes that each task may be provided with
 more than one shape or size at its design time. On the other hand, looking-ahead scheduling
 algorithms were combined with low complexity placement strategies (e.g. 1D partitioned) in order
 to provide scheduling solutions that were likely to have low runtime overheads. Furthermore, a
 ternary tree that eases the area management for looking-ahead horizon scheduling was proposed
 and investigated.
 This chapter relied on the methodology presented in Chapter 4. It also relied on some intu-
 itive assumptions on improvements and performance that can be achieved by the above proposed
 scheduling/placement strategies. The next chapter is devoted to simulations and experiments
 results that will assess, validate or invalidate the studies above.
 196
Chapter 6
 Simulation Results of the Algorithms
 Proposed to Solve Online Real-Time
 Scheduling Issues
 6.1 Introduction
 This chapter presents and explains the simulation results of the experiments conducted in order
 to assess and compare the online scheduling algorithms presented in the previous chapter. Figure
 6.1 depicts the global simulation environment. The input instances will be built first. These
 input (e.g. parameters of the tasks in the application) are generated by probability distributions.
 Therefore, they will be submitted to different scheduling algorithms and their placement strate-
 gies. The output data will then be collected in the form of performance metrics that have been
 presented earlier in Chapter 4. Afterwards, a quantitative analysis will be performed on these
 output data (figure 6.1), according to metrics. This analysis will be highlighted through mean-
 ingful diagrams. The simulation results will be analyzed, classified and discussed with respect to
 the input instances.
 In this thesis, the analysis mainly relies on an average-case analysis as described in chapter 3,
 section 3.4.3. These analysis consider the average performance of the scheduling algorithms or
 heuristics over all or a range of the input instances. However, worst-case analysis (e.g. compet-
 itive analysis) remains the widely used methodology for guaranteeing the performance of online
 197
6. Experiments Results Introduction
 scheduling algorithms on uniprocessor or multiprocessor systems (see Chapter 3, section 3.4.3).
 Thus, in this thesis, the competitive analysis was introduced and transposed in some cases of online
 scheduling on reconfigurable hardware, but not more than that. Fortunately, since a competitive
 analysis can only trap the worst case behavior of online algorithms, numerous experimental studies
 (e.g. Albers and Schr?der, 2002) have shown that, on real world jobs, these algorithms quite often
 outperform the well known c = 2 1m Graham et al. (1979)'s competitive ratio for m processors
 scheduling. Therefore, the following average-case analysis are valid.
 Figure 6.1: Summarizing the scheduling problem as defined in this thesis.
 6.2 Building the Inputs and the Testing Environment
 For experiments purpose, it is assumed that the inputs consist of sets of tasks and the reconfi-
 gurable array. A tasks parameters generator has been implemented. It allows a user to generate
 the tasks parameters following two probability distributions : the uniform distribution and the
 Gaussian distribution. However, in order use more realistic tasks parameters, a great range of
 values was covered on one hand, and on the other hand, size and timing characteristics of real life
 hardware IPs were used.
 In the following sections, some common IPs for DSP applications are characterized, their size
 estimated and the tasks sets generated accordingly.
 6.2.1 Hardware Tasks Characterization
 With the aim to remain close the the reality, especially in terms of size, a census of available IPs
 in XILINX COREGEN
 1
 IPs library was taken, in addition to some others fairly well documented
 1
 Xilinx CORE Generator System
 TM
 accelerates design time by providing access to highly parametrized
 Intellectual Properties (IP) for Xilinx FPGAs and is included in the ISE
 R 
Design Suite. CORE Generator
 provides a catalog of architecture specific, domain-specific (embedded, connectivity and DSP), and market
 specific IP (Automotive, Consumer, Military/Aerospace, Communications, Broadcast etc.). These user-
 customizable IP functions range in complexity from commonly used functions, such as memories and
 198
6. Experiments Results Introduction
 from the Internet. In order to find realistic range for tasks parameters, it would have been
 necessary to develop a synthesis protocol which synthesize many versions of each IP for FPGA
 implementation. Hence, versions of the same IP would differ from each other by prioritizing one
 option over another, or by combining optional characteristics of the IP, such as the accuracy, the
 datapaths width, the security and encryption, the throughput, the use of synchronization signals
 or not, etc.
 These combinations and options would have produced a tremendous amount of possible synthesis
 per IP. Therefore, a uniform distribution instead of a Gaussian one was used, as it best reflects
 the case. Each IP was synthesized twice, once in its lightest configuration, and once in its fullest
 configuration. Hence, it was assumed that, statistically, the values of the tasks size were uniformly
 distributed between these two limits.
 Table 6.1 gives a framework of minimum and maximum sizes of IPs in terms of number of
 slices required. The IPs are grouped according to their application domain. A more detailed table
 of existing IPs and their size can be found in Appendix D, table 7.7, page 260.
 IPs Size on Estimated size
 Virtex2pro FPGA on Virtex5 FPGA
 Communication IPs 2 [1000; 3000] slices  2 [350; 1500] slices
 Floating Point operations IPs 2 [100; 500] slices  2 [45; 300] slices
 CORDIC algorithm 2 [100; 600] slices  2 [50; 270] slices
 FFT  2000 slices  950 slices
 Video processing IPs 2 [3000; 9000] slices
 Table 6.1: Approximate sizes of most common IPs (hardware tasks).
 6.2.2 Estimating the Size of Tasks
 Table 6.1 reports on the number of slices used, but does not provide any information on the
 number of slices per column (height) and per line (width). Therefore it is necessary to estimate
 the width and the height of IPs or tasks, according to the rectangular-shaped task model adopted
 FIFOs, to system-level building blocks, such as filters and transforms. Using these IP blocks can save
 days to months of design time. The highly optimized IP allows FPGA designers to focus efforts on building
 designs quicker while helping bring products to market faster.
 199
6. Experiments Results Introduction
 in this thesis.
 Let Smax (resp. Smin) be the maximum (resp. minimum) number of slices required for a given
 IP application or set of n tasks  n. One can estimate height and width parameters by a mean m
 and a standard deviation  , so that in extreme cases (that correspond to maximum values of the
 uniform law), one reaches the limits of the number of slices. This is expressed as follows :
 (m+  )2 = Smax and (m  )2 = Smin (6.1)
 The solution of these equations gives an estimation of parameters m and  :
 m =
 p
 Smax+
 p
 Smin
 2 and  =
 p
 Smax 
p
 Smin
 2
 (6.2)
 For the random generation of the size of hardware tasks, it is possible to do it either according
 to a specific application domain or in a more generic way. In the first case, the targeted appli-
 cation domain may allows the designers to evaluate Smin and Smax and therefore to estimate
 the parameters m and  . m and  may even feed free tasksgraph generators like TGFF2. In the
 second case, the parameters are simply randomly generated following the uniform distribution.
 6.2.3 Final Inputs Values for Experiments
 The size of the Xilinx's FPGA XCV1000 is used as reference size of the reconfigurable array. The
 corresponding width and the height of the array are respectively W = 96 and H = 64. Thus,
 there are 6140 CLBs available on the reconfigurable device. This size has been formerly used in
 few research papers.
 However, nowadays, FPGAs are getting denser and may integrate 10 times more CLBs than the
 Xilinx's FPGA XCV1000. The sizes S of the generated tasks were also chosen accordingly in a
 way that the device may accommodate about 4 tasks of the maximum size Smax at a time. In
 most of the conducted experiments, the parameters of 100 sets of 50 aperiodic real-time tasks were
 randomly generated following the uniformly distribution in the intervals listed below :
  Size S 2 [50; 1500] CLBs.
  Aspect ratio ar = hw = f
 1
 5 ;
 1
 4 ;
 1
 3 ;
 1
 2 ; 1; 2; 3; 4; 5g.
  Processing time e 2 [5; 100].
 2
 TGFF stands for Tasks Graph For Free. TGFF is an open source software created in 1998 by R.P.
 Dick and D.L. Rhodes in order to facilitate tasks graph generation for scheduling problems analysis.
 200
6. Experiments Results Parameters based scheduling
  Laxity class A ) l 2 [1; 10], class B ) l 2 [11; 50] and class C ) l 2 [51; 100].
  Relative application computational load given by equation 4.17 page 161 and that is usually
 denoted in this chapter as application load Appload; Appload = 50% .
 The above parameters are used by default for simulations, unless new values are
 explicitly indicated.
 6.2.4 The Running Environment
 Apart from simulations that have been conducted on the microblaze embedded processors for
 accurate timing measurements, scheduling algorithms were simulated on a laptop computer. The
 latter hosts the Intel premium dual-core processor T2330, running at 1.6 GHz and featuring an
 1MB L2 cache memory.
 6.3 Tasks Parameters Based Scheduling
 The simulation results of tasks parameters based scheduling are presented in this section.
 Figures 6.2 and 6.3 depict the simulation results of the Basic, EDF, SSF and BSF scheduling
 algorithms that use single-version tasks. The placer manages non overlapping vertically split areas
 through a binary tree as described in Bazargan et al. (2000). Consequently, any improvement of
 one algorithm over another will be thanks to the scheduling scheme. Experiment results are
 expressed in terms of laxity classes.
 6.3.1 Chip Utilization Ratio and Tasks Rejection Ratio
 From the reconfigurable array utilization ratio and tasks rejection ratio perspective, simulation
 results shown by figure 6.2 are as follows :
  For tasks of laxity class A, the chip utilization is  30% no matter which scheduling algo-
 rithm is used. The main reason is that once a task is released, its maximum waiting time
 is equal its laxity. Consequently, for a laxity class l 2 [1; 10], the scheduling algorithms
 does not have enough time to schedule the tasks. As the underlying placement strategy is
 similar, the resulting chip utilization ratios are nearly similar.
 The tasks rejection ratio is also  22% independently to the scheduling algorithm. Finally,
 BSF algorithm slightly outperforms the other algorithms in terms of utilization ratio while
 SSF behaves worst than others.
 201
6. Experiments Results Parameters based scheduling
  For tasks of laxity class B, the chip utilization rises to  33% while the tasks rejection ratio
 drops to  12%. Indeed, thanks to a higher laxity, there is more time for finding a fitting
 solution to tasks. Once again, BSF produces the best utilization ratio, and SSF the worst
 tasks rejection ratio.
  For tasks of laxity class C, the chip utilization ratio does not significantly increase. BSF
 outperforms with  34%. However, the tasks rejection ratio decreases to around  5%.
 As shown in figure 6.2 and commented above, any of the tasks parameters scheduling does not
 significantly outplay another. However, significant improvements are gained in terms of laxity
 instead, especially on tasks rejection ratio. One may conclude that these results are highly tied
 to the underlying placement strategy used here, which is the above mentioned Bazargan et al.
 (2000)'s binary tree based placer.
 The diagram at the bottom of figure 6.2 depicts the differential quality metric URqm as the
 difference between the utilization ratio and the tasks rejection ratio. The metrics gives an overall
 performance that takes into account both the utilization ratio and the tasks rejection ratio as
 expressed earlier in equation 4.24, page 163. A high chip utilization ratio increases URqm while
 a high tasks rejection ratio decreases it. Hence, the higher the differential quality metric URqm,
 the better the scheduling. From the URqm perspective, BSF remains the best parameters based
 scheduling while the BSF is the worst. Intuitively as bigger tasks are placed first, the utilization
 ratio is therefore higher. URqm may allow the designer to choose among two algorithms that share
 similarities either in terms of tasks rejection ratio, or in terms of chips utilization ratio.
 6.3.2 Runtime Overhead
 Figure 6.3 depicts the runtime overhead of different of tasks parameters based scheduling algo-
 rithms. The top left of the figure shows the average time the scheduling function has taken to
 run every it has been invoked. These time overheads are globally around 40 to 47us. It can
 be noticed that, for example in laxity class A, the average time overhead slightly increases from
 one algorithm to another. This increase comes from the increasing difficulty in keeping the list
 of ready tasks sorted according to the considered criteria. Keeping the list of tasks according to
 their release time is easier. Identically, keeping the tasks in the list according to their deadline
 or their laxity is somewhat related to their release time. Conversely, sorting the released tasks
 according to their size is completely time independent. Therefore BSF and SSF lead to a more
 time consuming process.
 202
6. Experiments Results Parameters based scheduling
 Figure 6.2: Utilization ratio (top), rejection ratio (middle) and quality metrics (bottom) :
 comparative results for EDF, LLF, SSF and BSF scheduling algorithms.
 203
6. Experiments Results Parameters based scheduling
 Figure 6.3: Scheduling runtime overhead, number of scheduling calls and cumulative scheduler
 runtime overheads : comparative results on EDF, LLF, SSF and BSF scheduling algorithms.
 The average number of invocations may also affect the global time spent in running the
 scheduling algorithm. The top-right of the figure illustrates the average number of calls to the
 scheduler. Recall that the scheduler is invoked whenever a task completes or arrives. Figure
 6.3-top-right shows that the higher the laxity, the higher the number of calls. The reason is that
 a longer laxity gives more opportunity to the scheduler to place each task, and therefore increases
 the number of calls. The average number of calls is around  64 to 71.
 Surprisingly, as shown in the top-left of figure 6.3, the average runtime overheads of scheduling
 algorithms are higher for tasks of laxity B. Even when considering the cumulative runtime overhead
 that takes the number of invocations into account, the trend is confirmed. The diagram at the
 bottom of figure 6.3 depicts the case. The cumulative runtime overhead corresponds to the average
 204
6. Experiments Results Multi-shape scheduling
 time spent in the scheduling function every time it is invoked, times the number of calls to the
 scheduler. Hence, the third diagram results in the multiplication of the two diagrams above. Put
 it differently, the cumulative sum of the time spent in the scheduling algorithm throughout the
 simulation is calculated, in order to obtain a bigger and therefore more expressive value.
 As all the tasks are not accepted in the present case, the makespan is meaningless. Indeed,
 an algorithm which rejects too many tasks is likely to have a shorted makespan. Therefore, the
 makespan of two scheduling algorithms are comparable only if both have accepted exactly the
 same tasks.
 6.3.3 Conclusion on parameters based scheduling
 Tasks parameters scheduling algorithms are simple to implement, do not have a high runtime
 overhead, and are based on one parameter of the task. Experiment results show that these
 algorithms are not far from each other in terms of scheduling and placement quality (e.g. runtime
 overhead, reconfigurable array utilization ratio and task rejection ratio). As they use the same
 placement strategy, the similarities in chip utilization ratio and tasks rejection ratio suggest that
 a significant improvement of scheduling may not come from a single parameter of the task. In
 all likelihood, it may be interesting either to combine temporal and geometric characteristics in
 order to build novel algorithms or to propose areas management approaches that does not increase
 the overall algorithm complexity and runtime overhead, but improves array utilization and tasks
 rejection ratio. Hereinafter are the experiments results for the multi-shape tasks based scheduling
 algorithms.
 6.4 Multi-shape Tasks Based Scheduling
 This section presents the simulation results of the multi-shape based scheduling. The algorithm
 has been previously studied in Chapter 5, section 5.5 and described in figure 5.6, page 194.
 6.4.1 Multi-shape Tasks
 Relying on equation 5.6 page 193, one or several versions of each task were produced. It is assumed
 that each task has at least one version, which is its normal version.
 The normal version of a task is the version that results from a random generation of the tasks
 parameters with a given probability distribution, as discussed earlier. A version is denoted as
 205
6. Experiments Results Multi-shape scheduling
 standing (Std) or laying (Lay) depending on its aspect ratio compared to the normal version, as
 depicted in figure 6.4. A version is denoted as smaller (Sml) or same (Sm) depending on its
 size compared to the normal size version, as also depicted in figure 6.4. Hence, while a same size
 version uses the same amount of configurable resources as the normal size, the smaller size version
 uses only half this amount.
 A standing version will tend to increase its aspect ratio by doubling its height in comparison with
 the normal version's height. However, this height cannot exceed the height of the reconfigurable
 array. Therefore, the width of the standing version is adjusted accordingly in order to obtain a
 size either identical (same) to the size of the normal version, or smaller. Conversely, a laying
 version will tend to be twice wider than the normal version, as far as the width does not exceed
 the the width of the reconfigurable array. The height of the laying version is adjusted accordingly
 in order to generate either a same size task, or a smaller size task. As pictured in figure 6.4, several
 variants of multi-shape hardware tasks were generated. Various combinations of these variants
 can provide variants of the multi-shape algorithm. Each combination is also characterized by the
 number of versions per hardware task. Below are listed examples of possible combinations that
 were used in this thesis :
 1. Sm_Std(2) ; where (2) represents the number of versions or variants per task. In this
 scenario illustrated in figure 6.4, each hardware task has a normal version and an additional
 version denoted as Sm_Std, which stands for same size standing version. The same standing
 version is another version that uses the same amount of resources than the normal version,
 but with a higher aspect ratio as described above. Figure 6.4 illustrates the Sm_Std(2).
 2. Sm_Lay (2) ; this scenario is similar to the former. However, the second version has the
 same size with the normal version, but with half the aspect ratio if possible. Here, Sm_Lay
 (2) stands for same size laying version.
 3. Sml_Std(2) ; a normal plus a smaller size standing version (here Sml stands for smaller
 size ). The latter version shares the same height with the normal version, but with half its
 width. Here, Sml_Xxx (y) stands for smaller size
 4. Sml_Std_Lay(3) ; a normal plus a smaller standing and a smaller laying versions.
 5. Sm_Std_Lay(3) ; in this scenario depicted in figure 6.4, each task has 3 variants : the same
 standing and the same laying in addition to the normal variant.
 6. Sm_Sml_Std_Lay(5) ; this case is shown in figure 6.4 where each task has 5 variants : the
 206
6. Experiments Results Multi-shape scheduling
 Figure 6.4: Different combinations of multi-shape tasks for variants of multi-shape scheduling
 algorithm
 same standing, the same laying, the smaller standing and the smaller laying in addition to
 the normal variant.
 7. Shu?e(1...5) ; this case reflects a more realistic scenario. Indeed in real life applications,
 it is quite difficult to have more than one task variant for each task. Hence, in the present
 case, it is assumed that the number of versions per task is 2 [1; 5]. The number and the
 type of versions are randomly generated for each task, following the uniform distribution.
 Variants of the multi-shape scheduling algorithm above are compared to the Basic scheduling
 algorithm denoted as Basic(1) and described in Chapter 5. The latter algorithm schedules only
 the normal version of tasks. Both algorithms use the same aforementioned binary tree based
 placement strategy. Furthermore, both algorithms rely on the basic scheduling algorithm where
 the tasks in the ready list is sorted according to their release time. Therefore, comparing multi-
 shape scheduling algorithms with basic scheduling brings out the improvements gained by the use
 of multi-shape tasks. Simulations have been conducted on 100 sets of 50 tasks; Hence, the final
 simulation result is an average of 100 simulation results.
 207
6. Experiments Results Multi-shape scheduling
 6.4.2 Chip Utilization Ratio and Tasks Rejection Ratio
 Figure 6.5 shows the simulation results of the chip utilization ratio. The figure compares variants
 of multi-shape scheduling algorithms with the basic scheduling algorithm that uses only single-
 version tasks. Utilization ratios are also classified according different classes of laxity. Hence, it
 can be observed from the graph that :
 For tasks of laxity Class A
 The reconfigurable array utilization ratio is around  30% (see figure 6.5) for the basic schedul-
 ing algorithm and four multi-shape scheduling algorithms which are: the Sml_Std(2), the Sml_Lay(2),
 the Sml_Std_Lay(3) and the Sm_Lay(2). The three first multi-shape algorithms use additional
 smaller versions tasks, while the fourth (Sm_Lay(2)) uses an additional same size laying version.
 The two main reasons for these similarities in the results are listed and analyzed as follows :
 Figure 6.5: Multi-shape scheduling algorithms: simulation results of the utilization ratio,
 comparison with the Basic scheduling.
 (i) In algorithms where each task provides a normal version and additional smaller size versions,
 the latter versions require a longer execution time on the reconfigurable array to complete.
 Hence, tasks may rarely meet their deadline unless their laxities are very long. It is not
 the case here where laxity 2 [1; 10] for tasks of laxity class A. Sml_Std(i), Sml_Lay(i) and
 Sml_Std_Lay(3) are such multi-shape algorithms, i being the total number of tasks versions.
 (ii) In algorithms like Sm_Lay(i), Sml_Lay(i), etc. where each task provides a normal version
 208
6. Experiments Results Multi-shape scheduling
 and additional laying versions, the latter versions are likely to be rejected if the areas splitting
 is done vertically. Indeed, vertically (resp. horizontally) split free areas are likely to produce
 more standing than laying areas (resp. more laying than standing areas). Therefore, standing
 (resp. laying) areas are more suitable for accommodating standing tasks which aspect ratio
 ar > 1 (resp. laying tasks which ar < 1). This results clearly bring out the influence of the
 underlying placement strategy on the overall quality of the scheduling algorithm.
 This can also be observed in the results of the Sm_Std(2) and Sm_Std_Lay(3) that are
 similar, making the use of a third task version in Sm_Std_Lay(3) useless, as it does not
 bring any improvement.
 Figure 6.6: Multi-shape scheduling algorithms: simulation results of the tasks rejection ratio,
 comparison with the Basic scheduling.
 Figure 6.6 depicts simulation results of the tasks rejection ratios. The tasks rejection ratios
 are almost similar ( 21:5%) for the basic scheduling and the aforementioned four multi-shape
 scheduling algorithms. The two reasons for these similarities in results are the same listed and
 described above (tasks versions with longer execution time, or an unsuitable free areas splitting
 strategy).
 Conversely, multi-shape scheduling algorithms that use same size standing versions as addi-
 tional tasks versions significantly improve the chip utilization and the tasks rejection ratio for tasks
 of laxity class A. Sm_Std(2), Sm_Std_Lay(3) and Sm_Sml_Std_Lay(5) are these algorithms.
 Their simulation results do not differ from each other, as shown in figures 6.5 and 6.6. It can be
 observed from the graph of figure 6.5 that these 3 multi-shape algorithms raise the chip utiliza-
 209
6. Experiments Results Multi-shape scheduling
 tion ratio from  30% to  37% when compared to basic scheduling or any other multi-shape
 algorithm that does not use a same size standing version as additional version of tasks. Figure
 6.6 also shows a significant reduction in tasks rejection ratios. The latter decrease from  21:5%
 to  13:5%.
 Regardless the number of tasks versions used, Sm_Std(2), Sm_Std_Lay(3) and Sm_Sml_Std_Lay(5)
 scheduling algorithms provide slightly similar results in terms of chip utilization ratio and tasks re-
 jection ratio. This similarity indicates that the improvements are exclusively brought by the use of
 the same size standing version, in addition of the normal version. Therefore, smaller size versions
 and same size laying versions that are provided to Sm_Std_Lay(3) and Sm_Sml_Std_Lay(5)
 algorithms are useless here as they bear the two drawbacks listed and described above (a very
 long processing time and/or an areas splitting strategy that is unsuitable).
 For tasks of laxity Class B
 Improvements are better, and follow the same trend as the results for the laxity class A above.
 Hence, using a laying version (same or small) does not improve the results because of the above-
 mentioned splitting strategy. Once again, best results are obtained while using a same size standing
 version as additional version. This can be observed in figure 6.5 and figure 6.6 where Sm_Std(2)
 algorithm increases the chip utilization ratio from 33% to 40:6% and decreases the tasks rejection
 ratio from  12% to 6% while compared with the basic scheduling algorithm. This corresponds to
 an improvement of 50% and 23% for tasks rejection ratios and chip utilization ratio respectively.
 In other words, the Sm_Std(2) algorithm that uses only two versions per task (normal plus same
 standing), rejects 50% less tasks and looses 23% less reconfigurable resources, compared with basic
 scheduling.
 As the laxity class B is bigger here (laxity 2 [11; 50]), tasks deadlines are longer accordingly.
 Therefore, there is more opportunity for placing the smaller version of tasks, even if they require
 longer processing times. Tasks with smaller but standing version slightly improve their rejection
 ratio, thanks to their laxity and to their aspect ratio. For example, the Sml_Std(2) algorithm
 records a decrease ( 20%) in tasks rejection ratio compared with basic scheduling, for a slightly
 similar chip utilization ratio ( 32%).
 The impact of the areas management strategy on multi-shape scheduling algorithms is even
 more obvious here. For example, the basic scheduling algorithm that uses single-version tasks
 outperforms the Sml_Lay(2) algorithm in terms of chip utilization ratio and tasks rejection ratio.
 The reason is that, as the laxity class B is longer, the Sml_Lay(2) algorithm may fit the smaller
 210
6. Experiments Results Multi-shape scheduling
 version of some tasks on the reconfigurable array, instead of the normal version. But these smaller
 versions require longer processing times. Therefore, on one hand, placing small size tasks lowers
 the chip utilization ratio
 3
 and leads to a more fragmented chip. On the other hand, as smaller
 version of tasks require more time to process on the reconfigurable array, they prevent the scheduler
 from placing newly arriving or waiting tasks. Furthermore, as the free areas are vertically split in
 the present case, they are mostly standing areas, and are less suitable for accommodating laying
 tasks.
 For tasks of laxity Class C
 Apart from some special observations, the comments made above on the simulation results for
 tasks sets of laxity class B are also valid for tasks sets of laxity class C. As depicted by figure 6.5
 and figure 6.6, there are major improvements in chip utilization ratio for multi-shape scheduling
 algorithms that are provided with same size standing and/or smaller size standing versions of
 hardware tasks.
 For example, compared to the basic scheduling algorithm, the Sm_Std(2) algorithm decreases
 the tasks rejection ratio by 75% and increases the utilization ratio by 28% compared with the
 basic scheduling algorithm. Hence, as Sm_Std(2) algorithm achieves a 1:33% tasks rejection
 ratio, it accepts almost all the tasks. These improvements are gained only by using two versions
 per task. One can observe that regardless of the number of versions per tasks, any other multi-
 shape algorithm does not outperform Sm_Std(2) algorithm. As previously stated, improving the
 scheduling through a multi-shaped tasks approach is not a matter of number of versions or shape
 per task, but essentially a matter of size and aspect ratio of the additional version(s). Figure
 6.7 illustrates, on a single page, an overview of the simulation results of the chip utilization ratio
 and the tasks rejection ratio, along with the resulting differential quality metric URqm. The two
 first graphs of figure 6.7 replicate respectively the utilization ratio and the tasks rejection ratio
 graphs of figures 6.5 and 6.6 respectively, while the far bottom graph represents the differential
 quality metric URqm. The latter metric is simple but allows us to have a global and concurrent
 interpretation of both chip utilization ratio and tasks rejection ratio. The URqm confirms that
 Sm_Std(2) is the best multi-shape algorithm thanks to the size and the shape of the additional
 version. As shown by the graph, it is not really worthy using more than 2 versions per task, if the
 version takes the underlying placement strategy into account.
 3
 as shown by the simulation results in section 6.3, SSF (smallest size first) based scheduling algorithm
 is outperformed by other tasks parameters based scheduling algorithms.
 211
6. Experiments Results Multi-shape scheduling
 Figure 6.7: Utilization ratio, tasks rejection ratio and differential quality metric URqm (with
  = 0:5) : comparative results for basic scheduling and multi-shape scheduling algorithms.
 212
6. Experiments Results Multi-shape scheduling
 The Shu?e(1...5) scheduling algorithm was described as bearing a more realistic case where
 hardware tasks can have randomly 0 up to 4 additional versions per task. Zooming in on the
 simulation results of the Shu?e(1...5) algorithm may therefore allow us to have a more realistic
 prediction on improvements achievable by a multi-shape algorithm. The Shu?e(1...5) algorithm
 reflects a scenario where prior to design, all available versions of hardware tasks or IPs that are
 required for the application are collected. Therefore, for each task, there may be only one or
 several versions. According to the results graphs depicted in figure 6.7, Shu?e(1...5) decreases
 the tasks rejection ratio by  16% for tasks of laxity class A,  27% for the laxity class B, and
  37% for the laxity class C compared to the basic scheduling. At the same time, the algorithm
 improves the chip utilization ratio by  10% for the laxity classes A and B, and  7% for the laxity
 class C. It is noticeable that increasing chip utilization ratio is more difficult than improving the
 tasks rejection ratio. Indeed, a high ( 100%) chip utilization ratio is rarely achievable because
 of the intrinsic fragmentation problem.
 6.4.3 Makespan and Runtime Overheads
 Comparing algorithms through their makespan is meaningful only if they do not reject any task.
 In the latter case, an algorithm will outperform another in terms of makespan only if it takes less
 time to schedule the same set of tasks. Simulation results for tasks of class laxity C depicted in
 figure 6.7 and commented above shows a nearly similar case. Indeed, according to these results,
 the tasks rejection ratio decreases to 5:33% for basic scheduling and to 1:33% for Sm_Std(2).
 Such low tasks rejection ratios make a comparison of scheduling algorithms from the makespan
 perspective meaningful. Indeed, an algorithm that rejects too many task is likely to have a shorter
 makespan. The makespan is analyzed only on simulation results for tasks of laxity class C.
 Figure 6.8 compares scheduling algorithms through their makespan. Once again, these results
 show that multi-shape algorithms that use same size tasks versions achieve a better makespan
 compared with basic scheduling. For example, Sm_Std(2) requires an average of 190 time units to
 run a tasks set to completion while the basic scheduling requires 204 time units. Conversely, the
 highest makespan is achieved by Sml_Std(2) and Sml_Std_Lay(3) algorithms that use smaller
 size tasks versions. The latter versions lengthen the makespan because of their longer execution
 times. It can also be observed that using additional laying tasks versions do not improve the
 results, as their aspect ratios do not suit to vertically split free areas. Therefore, Sm_Std_Lay(3)
 provides the same result as Sm_Std(2) and Sml_Std_Lay(3) the same as Sml_Std(2), regardless
 213
6. Experiments Results Multi-shape scheduling
 Figure 6.8: The average makespan : comparative results for multi-shape and Basic scheduling
 algorithms.
 of the number of versions per tasks.
 Detailed simulation results related to the runtime overheads are presented in the three figures
 in page 215. The top of figure 6.9 depicts the average runtime overheads of different multi-shape
 algorithms compared with the basic scheduling algorithm.
 For tasks of laxity class A, the runtime overheads range from 37us to 47:5us, the basic
 scheduling algorithm achieving the lowest one. Intuitively, the runtime overhead is longer for
 multi-shape scheduling algorithms. Indeed, it takes more time to search for the version that can
 fit on the reconfigurable array. Therefore, the more the versions per task, the longer the runtime
 overhead.
 However, as shown by the results for the laxity class A, the rule is not that simple. The runtime
 overhead of each scheduler call also depends on the success or the failure of the current placement
 attempt. Each successful placement results in operations that can be time consuming. In the
 present case, the placer uses the hash matrix that have been presented in Walder et al. (2003) and
 briefly discussed in section 3.8.5, page 118. The matrix requires a long update at each task place-
 ment or withdrawal on the reconfigurable array. Hence, the more successful the tasks placements,
 the more matrix update processes are required, the latter processes being time greedy. Any data
 structure that holds the state of the reconfigurable array (e.g. binary tree) needs to be updated at
 each task placement or withdrawal. This impact of the update process on the runtime overhead
 214
6. Experiments Results Multi-shape scheduling
 Figure 6.9: Multi-shape scheduling algorithms : the simulation results of the scheduling runtime
 overhead, with basic scheduling as reference scheduling.
 can be seen in the results (top graph of figure 6.9) of multi-shape algorithms that use at least a
 same size standing version per task. It was noticed earlier that these algorithms (e.g. Sm_Std(2),
 215
6. Experiments Results Multi-shape scheduling
 Sm_Std_Lay(3) and Sm_Sml_Std_Lay(5)) reject less tasks in general, which means in other
 words that they successfully place more tasks. Therefore for tasks of laxity class A, as the number
 of updates grows, their runtime overheads are higher (see in the top graph of figure 6.9).
 However, even if Sml_Std_Lay(3) algorithm rejects more tasks compared with the above men-
 tioned three algorithms, it also achieves a high runtime overhead (see top of figure 6.9). The
 reason for this is the time spent by the algorithm in trying to place the smaller size version of the
 task to be placed. This reason is more highlighted by the results for tasks of laxity class B and
 C. The two highest runtime overheads are due to multi-shape scheduling algorithms which extra
 tasks versions are exclusively smaller size tasks versions. In the latter case, the algorithms always
 determine whether the smaller size version of the task can be used without violating its deadline.
 Let us remind that smaller size versions of tasks require longer execution times. This extra effort
 in verifying deadline violation significantly increases the runtime overheads of the Sml_Std(2) and
 Sml_Std_Lay(3) scheduling algorithms as highlighted on the top graph of figure 6.9, especially
 for tasks of laxity class B and C.
 At this point, based on analysis of results in the top graph of figure 6.9, two main factors that
 influence the runtime overheads of scheduling algorithms and confirm the preliminary observations
 made in Chapters 3 and 4 can be identified:
 (i). the search for an accommodating area for a task ; it can be observed in the above results
 that the more the versions per tasks, the longer the runtime. This is especially the case
 when the additional versions are smaller size versions. The latter require an extra effort
 to determine whether they are not violating the deadline. It results in the highest runtime
 overheads for Sml_Std(2) and Sml_Std_Lay(3) for tasks of laxity class B and C.
 (ii). the update of the data structure after a task placement or removal; if time consuming, it
 affects the algorithms that achieve best tasks rejection ratios. Sm_Std(2), Sm_Std_Lay(3)
 and Sm_Sml_Std_Lay(5) are good examples, especially in laxity class A.
 The graph in the middle of figure 6.9 depicts the average number of calls to the scheduling function.
 Intuitively, as the scheduler is invoked at each task arrival and task ending, the algorithms that
 successfully placing more tasks induces more scheduler invocations. This intuition is confirm by
 the simulation results for tasks of laxity class A. Indeed, as shown in the graph, multi-shape
 algorithms (Sm_Std(2), Sm_Std_Lay(3) and Sm_Sml_Std_Lay(5)) which provide better tasks
 rejection ratios are those with higher number of scheduler invocations . The graph also shows that
 the relationship between the tasks rejection ratio and the number of calls to the scheduler can be
 216
6. Experiments Results Multi-shape scheduling
 extended to the cases of tasks of laxity class B and tasks of laxity class C. In the latter cases, one
 can observed :
  On one hand, regardless the scheduling algorithms, there is a general increase of the number
 of calls to the scheduler due to the decrease of the tasks rejection ratios.
  On the other hand, contrary to laxity class A, multi-shape algorithms that use smaller size
 tasks versions exclusively as additional versions also record a higher number of scheduler
 calls. For example, in the case of the Sml_Std(2) and Sml_Std_Lay(3) algorithms, the
 number of calls to the scheduler is among the highest for the laxity class B and especially
 for the laxity class C. At first sight, these results seem to contradict the primary intuition,
 as Sml_Std(2) and Sml_Std_Lay(3) algorithms reject more tasks than the three others
 first mentioned. However they do not invalidate the intuition because the bigger the laxity,
 the more the scheduler attempts to place the ready or waiting tasks. The number scheduler
 invocations increases accordingly.
 The graph in the bottom of figure 6.9 gives a right idea of the runtime overhead of each
 scheduling algorithm. It accumulates the total amount of time devoted to the scheduling algo-
 rithm itself by the microprocessor running the algorithm. It results in a multiplication between
 the two first graphs (the average runtime overhead and the average number of calls). Considering
 the cumulative sum of the runtime overheads does not invert the trends of the runtime overheads
 results commented above.
 To summarize, the lower the tasks rejection ratio, the higher the cumulative runtime over-
 heads. One can say that when compared with the basic scheduling algorithm, a multi-shape
 algorithm like Sm_Std(2) decreases the tasks rejection ratio from  21:5% to  13:5% and raises
 the reconfigurable array utilization ratio from  30% to  37% at the cost of higher runtime
 overhead. The latter increases from  39us to  46us compared with the basic scheduling al-
 gorithm. Especially for the laxity classes B and C, one can observe that the two multi-shape
 algorithms which additional tasks versions are exclusively smaller versions (e.g. Sml_Std(2) and
 Sml_Std_Lay(3)), achieve the highest cumulative runtime overheads. These runtime overheads
 culminate at 62% and 60% respectively for the laxity class A. This observation is very important
 to point out because these algorithms have been also proven earlier to behave poorly in terms of
 chip utilization ratios and tasks rejection ratios.
 Once again, one can see that the Sm_Std(2) scheduling algorithm provide the best trade-off when
 217
6. Experiments Results Looking-Ahead Scheduling
 considering all the metrics assessed above, and the additional efforts that are require to generate
 two versions of identical sizes per hardware task at design time. Indeed, the algorithm assumes
 that each task has two versions, the second version being from identical size with the normal
 version, but with a rectangular shape that suits to the underlying areas partitioning strategy used
 by the scheduler. Regarding the Shu?e(1...5); the algorithm assumes a more realistic scenario
 where the number and the size of the versions are randomly generated. Shu?e(1...5) brings a
 significant improvement that demonstrates the usefulness of the multi-shape approach.
 6.4.4 Conclusion on multi-shape scheduling
 Multi-shape tasks approach shows through the above results that it can significantly improve the
 scheduling and placement quality. This improvement can be obtain from two versions per task,
 as far as the size and the aspect ratio of the second version is choosing in order to match with
 the areas management strategy used. Thus, simulation results have shown that the multi-shape
 approach is not only about the number of versions per task, but above all a question of trade-off
 between the number of versions per task, the areas partitioning strategy, and the tasks laxity. For
 example, generating smaller size extra version(s) for low laxity tasks is not recommended, as it
 does not bring any improvement. However, same size standing versions tasks are recommended
 when the free areas are vertically split. Indeed, vertically split areas fit standing tasks the best.
 All the improvements brought by multi-shape scheduling algorithms are at the cost of acceptable
 runtime overheads, as shown by the results. Therefore, it is worth using multi-shape scheduling
 in an online real time context, even if additional efforts are required at design time to generate
 several versions per hardware task.
 6.5 Horizon Looking-Ahead Scheduling Algorithms
 Hereinafter are presented preliminary simulation results of the looking-ahead scheduling algorithms
 that were presented in the previous chapter. EAAF and SFAF are the two first algorithms. They
 use the ternary tree structure proposed in Chapter 5, section 5.4.1 as an enabling structure for
 looking-ahead scheduling. Therefore the two algorithms use a 2D placement strategy. The third
 algorithm is the 1D variable slots looking-ahead scheduling also described in the previous chapter,
 section 5.3.2.
 218
6. Experiments Results Looking-Ahead Scheduling
 6.5.1 Horizon Looking-Ahead Scheduling using a Ternary Tree
 A ternary tree that suits to the management of the reconfigurable array for looking-ahead schedul-
 ing algorithms have been discussed in the previous chapter, section 5.4.1. Two variants of horizon-
 looking-ahead scheduling denoted as horizon-EAAF and horizon-SFAF were detailed. These algo-
 rithms are compared with two without-looking-ahead scheduling algorithms: the Basic scheduling
 and the EDF scheduling. The latter are described in Chapter 5. Basic scheduling tries to place
 the tasks in a first come first served basis. Thus, tasks in the ready list are sorted accordingly.
 The simulations have been conducted with 20 sets of 50 aperiodic tasks with arbitrary arrival
 time. A vertical split partitioning strategy has also been used as shown in figure 5.4, page 190.
 Other parameters were uniformly distributed within the intervals as follows :
  FPGA size : width = 96 and height = 64. Tasks of size 2 [50; 1500] CLBs
  Tasks aspect ratio 2 [ 15 ; 5], laxity 2 [11; 50] which corresponds to laxity class B. Execution
 time 2 [5; 100].
 The results are grouped in table 6.2.
 Algorithms Uav(%) Umax(%) Rjav(%) Rdav Scav mkav
 Basic 30.6 62 27.2 34 64 28.2
 EDF 31.6 60.8 23.8 31.8 66 33
 horizon-EAAF 30.5 55.5 24.6 0 60 33
 horizon-SFAF 32.2 52.5 23.5 0 63 33
 Uav : FPGA utilization ratio Umax : Maximum FPGA utilization ratio = max(Uav)
 Rjav : Tasks rejection ratio Rdav : Task rejection delay = trej  ai
 trej : Task rejection time Scav : Number of scheduler calls
 mkav : makespan
 Table 6.2: Simulation results for looking-ahead scheduling using a ternary tree : comparison
 with basic scheduling and EDF scheduling.
 Chip average utilization ratio Uav and tasks rejection ratio Rjav
 One can observe that the average utilization ratio Uav is around  30% no matter which schedul-
 ing algorithm (looking-ahead and not-looking-ahead) is used. This result has already been found
 219
6. Experiments Results Looking-Ahead Scheduling
 Figure 6.10: Differential quality metrics for horizon-EAAF, horizon-SFAF, Basic and EDF
 scheduling algorithms.
 Figure 6.11: Rejection delay for horizon-EAAF, horizon-SFAF, Basic and EDF scheduling
 algorithms.
 earlier for not-looking-ahead scheduling while presenting the simulation results of the tasks pa-
 rameters based scheduling algorithms. It can be concluded that the chip utilization ratio highly
 depended on the placement strategy used. The multi-shape approach presented a way of improving
 the Uav. The worth of horizon looking-ahead scheduling algorithms is not in terms of reconfigurable
 array utilization ratio. Indeed, horizon (and even stuffing) scheduling accepts or rejects each task
 as soon as it is released. As a task cannot be rescheduled once accepted and planned, it can
 prevent the scheduling algorithm for finding a better scheduling for the future. Consequently,
 a looking-ahead scheduling cannot significantly outperform a not-looking-ahead scheduling that
 use the same underlying placement strategy. The same similarities can be observed on average
 220
6. Experiments Results Looking-Ahead Scheduling
 tasks rejection ratios Rjav for the same reasons above. However, SFAF slightly outperforms other
 algorithms in terms of placement quality (Uav and Rjav). This is emphasized by the differential
 quality metric depicted in figure 6.10.
 Rejection delay Rdav
 The real contribution of looking-ahead scheduling algorithms is on the rejection delay Rdav. As
 show in the table, horizon-EAAF and horizon-SFAF algorithms immediately reject the tasks
 (Rdav=0) which cannot fit in the reconfigurable array, while not-looking ahead scheduling (Basic
 and EDF) rejects them with a delay that is equal to their laxity (after  30 time units ). Re-
 member that a real-time task that is rejected so late prevent the scheduler from finding any other
 resource that can execution the task. The graph in figure 6.11 highlights the rejection delay.
 Conclusion
 The above results have shown which improvements are really brought by a looking-ahead schedul-
 ing approach. As only the horizon looking-ahead scheduling algorithms were used here, these
 results can be improved by a stuffing approach. The stuffing looking-ahead scheduling schedules
 the tasks even on currently unused parts of reserved areas. For example, as indicated in figure
 5.4 (c') page 190, area nodes 5X2 and 2X6 can be used by other tasks in the time interval [3; 8]
 without affecting the reservation made at node 7X6 for task T3 to be started at time t = 8. This
 corresponds to the stuffing algorithms.
 The runtime overheads of horizon-EAAF and horizon-SFAF scheduling algorithms have not been
 measured for these algorithms, contrary to others previously studied. As stated in Chapter 4,
 looking-ahead scheduling are far more time consuming compared to without-looking-ahead schedul-
 ing. However, on one hand the placement strategy used in the present case was simple as it relied
 on the ternary tree proposed in this thesis, on the other hand, one can observe in table 6.2 that
 the number of scheduler invocations is lower for looking-ahead scheduling algorithms.
 6.5.2 1D Variable Slots Looking-Ahead Scheduling
 This section presents the simulation results of a 1D variable size slots based looking-ahead schedul-
 ing, the 1D-VSSH. Details on the 1D variable size slots horizon (1D-VSSH) scheduling algorithm
 can be found in Chapter 5, section 5.3.2.
 The simulation parameters are identical to those used for the algorithm discussed in the previous
 221
6. Experiments Results Looking-Ahead Scheduling
 section, in terms of size of the reconfigurable array, size of tasks, number of tasks sets, arrival
 time, execution time, etc. However, the simulations were conducted only on tasks of laxity class
 B, where the laxity l 2 [11; 50].
 In 1D variable size slots looking-ahead scheduling, the number of slots depends on the size of
 arrival tasks. However, in the following simulations, the maximum number of slots were limited
 to nmax = 6. In addition, as the width of a slot is determined by the width of the first tasks in
 the slot, a width ratio wr = task_widthslot_width has been defined. Therefore, while placing the first task
 in a slot that is empty and wider, an extra slot will be generated if and only if :
 8
 <
 :
 slot_width  FPGA_widthnmax
 wr  14
 (6.3)
 The 1D variable size slots horizon scheduling (1D-VSSH) is compared to the 1D scheduling and the
 2D horizon scheduling, in terms of tasks rejection ratio and reconfigurable array utilization ratio.
 The results are enough to draw major conclusions on 1D variable size slots horizon scheduling.
 As discussed in the previous chapter while presenting the algorithm, combining a low complexity
 (e.g. 1D-like) placement strategy with a looking-ahead scheduling approach is beneficial in terms
 of runtime overheads, especially when it does not increase the tasks rejection ratio and/or decrease
 the reconfigurable array utilization ratio.
 Figure 6.12 depicts the simulation results of the tasks rejection ratio, the utilization ratio and
 the differential quality metric.
 One can first observe that using a pure 1D placement strategy gives the worst results experienced
 so far with tasks of laxity class B. Hence, the 1D horizon scheduling rejects more than 50% of
 tasks (51; 2% exactly), where the 2D horizon scheduling rejects 30%. The reconfigurable array
 utilization ratio does not exceed 22% and the differential quality metric is far negative (  30% ).
 The differential quality metric was formerly defined as another way of expressing in a single metric,
 a placement metric that takes into account both the tasks rejection ratio and the reconfigurable
 array utilization ratio. This negative differential quality metric shows that a simple 1D placement
 strategy leads to a poor placement quality. All the tasks parameters based scheduling algorithms
 presented earlier would lead to slightly similar results if they use a 1D placement strategy. Indeed,
 when discussing the previous results, special attention was paid to the fact that when they use the
 same placement algorithm, a looking-ahead scheduling approach would not significantly perform
 better than a tasks parameters based scheduling algorithm in terms of tasks rejection ratio and
 array utilization ratio. However, a looking-ahead approach takes rapid scheduling decisions, a
 feature especially suitable for online real-time scheduling.
 222
6. Experiments Results Conclusion of the Chapter
 Figure 6.12: Tasks rejection ratio, reconfigurable array utilization ratio and differential quality
 metric for the proposed 1D variable slots horizon scheduling, compared to 1D and 2D horizon
 scheduling from Steiger et al. (2004)
 Interesting results are obtained when an 1D placement is combined with a slots-based area
 management. For example, 2D horizon scheduling and 1D variable slots horizon (1D-VSSH)
 scheduling achieve a slightly similar placement quality. They reject respectively 30% and 28:6%
 of the tasks, and both occupy  30% of the reconfigurable array. Furthermore, 1D variable slots
 horizon is a better scheduling algorithm as its differential quality metric is higher. Therefore,
 this result points out how beneficial the use of a slots-based placer in a looking ahead scheduling
 strategy. The simulation on the runtime overheads was not performed for this algorithm.
 6.6 Conclusion of the Chapter
 This chapter has presented the results of numerous simulations that were conducted. It first
 presented the simulation environment along with the parameters of the tasks sets. The output
 data (metrics) resulting from simulations were then collected. They were analyzed and com-
 223
6. Experiments Results Conclusion of the Chapter
 pared according to different scheduling algorithms. The results showed the great influence of the
 underlying placement strategy on scheduling algorithms. Thus, whenever possible, two scheduling
 algorithms were compared only if they were relying on the same placement strategy. Regardless of
 the considered metric, the suitability of the multi-shape scheduling approach for online real-time
 scheduling has been deeply established.
 Regarding the looking-ahead scheduling approach; the results also showed that when the looking-
 ahead scheduling relies on a 1D placement strategy combined with a 1D slots-based areas mana-
 gement, it provided a slightly similar if not better performance in terms of tasks rejection ratio and
 array utilization ratio, compared to a looking-ahead scheduling that uses a 2D area management.
 In addition, the so-called 1D-variable-slots horizon looking ahead scheduling may have a far lower
 runtime overhead.
 Additional simulation results are presented as it is at the end of the thesis, in Appendix B (page
 248), Appendix C (page 256) and Appendix D (page 260). In these simulations results, as many
 scheduling algorithms as possible are displayed on a single graph, giving more opportunity to
 compare their performance. Hence, such a graph can be a rich source of information on the be-
 havior of scheduling and placement algorithms along with their influence on different scheduling
 and placement metrics that are meaningful in real-time multi-tasking on reconfigurable hardware
 devices.
 The next chapter concludes the thesis and discusses future work.
 224
Chapter 7
 Conclusion and Future Work
 7.1 Discussions
 Embedded electronic devices have become part of everyday life. From DVB
 1
 to hand-held de-
 vices, the technology is becoming ubiquitous and the requirements of embedded applications in
 terms of computational power are skyrocketing. Moreover, these requirements are in addition
 to stringent constraints such as cost, size, power consumption, high data rate, shorter life cycle,
 etc. Nowadays, thanks to improvements in semiconductor technology, entire systems can be com-
 pressed onto a sliver of silicon that is smaller than a penny. The so-called System-on-a-Chip has
 become the solution as it combines the key features mentioned above. However, on one hand,
 this advanced in semiconductor technology is very costly, resulting in increasingly high non re-
 curring engineering (NRE) costs. This thesis has discussed how using dynamically reconfigurable
 hardware devices can lower the NRE. On the other hand, SoC complexity constantly increases as
 it integrates heterogeneous components (including reconfigurable parts) in order to provide the
 required computational power. Thus, as SoC complexity increases, their design process needs to
 undergo significant transformation.
 The original objective of the thesis was to address problem of scheduling online real-time
 hardware tasks on the partially and dynamically reconfigurable part of the SoC, and subsequently
 build a library of scheduling/placement algorithms for an RTOS-driven Reconfigurable-SoC de-
 sign space exploration. Therefore, combining these two reasons made this thesis unique, as the
 investigated issues intervene at two stages :
 1
 Digital Video Broadcasting
 225
7. Conclusion & Outlook of the Thesis Key Contributions (1)
  After the design stage (at runtime); where scheduling and placement algorithms that
 enable online real-time scheduling of hardware tasks on the reconfigurable part of a SoC were
 investigated. These online real-time scheduling and placement algorithms were required to
 provide a reasonable trade-off between their time complexity and their performance in terms
 of chip utilization ratio, tasks rejection ratio, etc.
  Prior to and during the design stage (o?ine); where scheduling and placement
 algorithms dedicated to the reconfigurable part of a Reconfigurable SoC are required while
 exploring the design space of the SoC. It corresponds to the part of this work which consisted
 of implementing, assessing and classifying as many scheduling and placement algorithms
 as possible, and subsequently integrating them in a library. For example in the design
 methodology presented in Chapter 2 and depicted in figure 2.19 (page 60), using such a
 library of algorithms at system level can help to perform a more accurate partitioning of
 the application, and thus to refine the architecture of the Reconfigurable SoC with respect
 to system and application constraints.
 7.2 Key Contributions
 7.2.1 Algorithms for Online Real-time Scheduling/Placement on DPRHWs
 Scheduling algorithms can be classified in two families, namely, looking-ahead scheduling and
 without-looking-ahead scheduling. The two scheduling families differ on the way the availability
 of the areas on the reconfigurable array is expressed, and on the way the algorithms check these
 available areas and assign them to tasks. To put it simply, without-looking-ahead scheduling keeps
 a task as far as it can still meet its deadline, and attempts to place it. Conversely, a looking-ahead
 scheduling decides as a task arrives, if it is rejected or accepted. Therefore, the rejection delay
 is equal to zero in the latter case. Depending on the family, the scheduling algorithm requires
 different quantity of placement operations.
 1. Looking-ahead scheduling algorithm considers present and future states of the reconfi-
 gurable array to detect areas that are currently free, or those who will be free in the future.
 Thus, the scheduler knows whether the task can fit on the array presently or later. For
 example, an area occupied at the present time can be reserved to accommodate a task that
 starts later.
 The significant advantage of looking-ahead approach over without-looking-ahead approach is
 226
7. Conclusion & Outlook of the Thesis Key Contributions (1)
 its ability to take very fast scheduling decisions. This last feature is very useful for schedul-
 ing online real-time tasks. However, this speediness comes at the cost of numerous areas
 management operations. The latter operations mimic future tasks ending and starting im-
 pact on the reconfigurable array, in order to properly schedule arriving tasks. Consequently,
 the placement algorithms must be of low complexity to make the looking-ahead scheduling
 approach affordable in terms of runtime overhead. The contribution of the thesis on online
 looking-ahead scheduling is as follows :
 (i). The study in the Chapter 4 section 4.2.2 demonstrated through a cycle accurate
 runtime overhead measurements that MERs-based placement strategies are very time
 consuming and, therefore, are not suited to online real-time looking-ahead scheduling.
 To the best of our knowledge, such cycle accurate timing measurements on MERs-
 based algorithms have not been conducted before on an embedded processor.
 (ii). The simulation results detailed and discussed in Chapter 6 section 6.5.1 has shown
 that when they rely on the same placement strategy, looking-ahead scheduling and
 without-looking-ahead scheduling algorithms provide slightly similar performance in
 terms of tasks rejection ratio and reconfigurable chip utilization ratio. This finding
 suggests that looking-ahead scheduling is a worthy approach only in special cases
 where rapid scheduling decisions are required (e.g. online real-time scheduling on
 heterogeneous platforms that provide more than one implementation alternative).
 Indeed, equation 3.13 (page 99) shows that with looking-ahead scheduling, a task that
 cannot fit on the reconfigurable array is rejected immediately, allowing for alternative
 implementation solutions.
 (iii). A new metric denoted as differential quality metric URqm was introduced. This
 metric better reflects the difference between two scheduling algorithms which seem at
 first glance identical in terms of utilization ratio and tasks rejection ratio.
 (iv). A ternary tree structure were proposed and developed. The tree were inspired from
 the binary tree structure presented in Bazargan et al. (2000) and improved in Walder
 et al. (2003). The originality of the proposed ternary tree structure is its suitability
 for keeping the information on present and future states of the reconfigurable array.
 227
7. Conclusion & Outlook of the Thesis Key Contributions (1)
 Thus, the tree eases a 2D horizon looking-ahead scheduling approach by providing a
 good visibility for future states of the array.
 Additionally, two variants of 2D horizon looking-ahead scheduling that use the tree
 were proposed. They are denoted as horizon-SFAF scheduling and horizon-EAAF
 scheduling algorithms. As shown by the simulation results in section 6.5.1 page
 219, the horizon-EAAF scheduling algorithm performs better than the horizon-SFAF
 scheduling and the EDF scheduling in terms of tasks rejection ratio and chip utiliza-
 tion ratio. The differential quality metric confirms this result.
 (v). 1D variable size slots horizon (1D-VSSH) scheduling algorithm has been proposed
 and studied. The algorithm combines the main advantage of the the looking-ahead
 scheduling (which is its very short task rejection delay as expressed in equation 3.13,
 page 99) and the simplicity of a 1D placement strategy. The simulation results dis-
 cussed in detail in section 6.5.2 shows that, by dynamically partitioning the reconfi-
 gurable array in slots, and by applying a 1D placement in each slot, there is no loss in
 terms of tasks rejection ratio and chip utilization ratio, compared to a 2D placement.
 Furthermore, the 1D variable size slots horizon (1D-VSSH) is similar if not slightly
 better in performance. This result is important since it suggests that in a looking-
 ahead scheduling approach, one can avoid the use of a time consuming 2D placement
 strategy without sacrificing placement quality.
 2. A without-looking-ahead scheduling algorithm places the tasks only on currently avail-
 able areas. Thus, when a fitting place is found, the task is place immediately and starts its
 execution. The contribution of the thesis on online without-looking-ahead scheduling is as
 follows :
 (i). Several tasks parameters based scheduling algorithms were developed. Their simu-
 lation results have shown that the performance of a scheduling algorithm in terms
 of tasks rejection ratio and reconfigurable array utilization ratio, highly depends on
 the quality of the underlying placement strategy. This finding was highlighted by a
 comparison of simulation results of these scheduling algorithms when they used the
 same placement strategy. According to these results discussed in the previous chapter
 (section 6.3), these algorithms do not significantly differ from each other in terms of
 utilization ratio and tasks rejection ratio. However, the BSF (biggest size first) al-
 228
7. Conclusion & Outlook of the Thesis Hypothesis & Limitations
 gorithm is slightly better, particularly in terms of differential quality metric, a more
 sensitive metric.
 (ii). The Multi-shape scheduling approach was one of the key proposal of this work, to
 solve the issue of scheduling online real-time tasks on partially and dynamically
 reconfigurable hardware devices. The ability of the proposed approach to signifi-
 cantly improve the placement quality when only two versions per task are provided
 and when the area partitioning strategy suits to the aspect ratio of the tasks has been
 proven. The multi-shape scheduling algorithms provide far better results than any
 other algorithms, without significantly increase the runtime overheads.
 7.2.2 Scheduling/Placement algorithms library for RTOS-driven design
 space exploration
 In addition to proposing new online real-time scheduling algorithms for reconfigurable hardware
 devices, an important part of this thesis was devoted to the study, implementation and evaluation
 of existing algorithms in related work. The list of scheduling algorithms and placement heuristics
 that have been implemented within the framework of this work are shown in the appendix of this
 thesis in table 7.3 (page 256), table 7.4 (page 257), table 7.5 (page 258) and table 7.6 (page 259).
 The simulations results are also gathered in the Appendix B and presented as shown. Gathering
 the results of so many scheduling algorithms on the same graphs allows the designer to point
 out the advantages and drawbacks of each algorithm or class of algorithms, and to find suitable
 trade-offs between the metrics of the algorithms. The relevent metrics are the utilization ratio,
 the tasks rejection ratio, the differential utilization metrics, the algorithms runtime overheads, the
 makespan, etc. As discussed in Chapter 2, and shown in figure 2.19, these algorithms can be used
 to refine the dynamically reconfigurable part of the SoC, during the system level simulation. The
 algorithms were designed purposely in C++ language in order to insure a full compatibility with
 any C++/SystemC based SoC design methodology. The algorithms are based on various models
 of hardware tasks, reconfigurable array, scheduler and placer that are reusable and refinable.
 229
7. Conclusion & Outlook of the Thesis Hypothesis & Limitations
 7.3 Hypothesis and Limitations
 The original hypothesis of the thesis can be summarized as follows :
  Hardware tasks are rectangularly shaped and relocatable, but cannot be rotated.
  The reconfigurable hardware device is partially and dynamically reconfigurable.
  A hardware task fits in a rectangular space of the reconfigurable device as far as there is
 enough contiguous free resources to accommodate the task.
  The online scheduling paradigm used in this thesis corresponds to the online clairvoyant
 paradigm.
  The scheduling is said real-time because every task in the system is submitted to a deadline
 constraint. A hardware task that cannot fit on the reconfigurable array and meet its deadline
 is rejected. However, rejecting a task does not lead the failure of the application, because
 it is assumed that there are other implementation alternatives for the task.
  Aperiodic tasks systems are used. They are likely to better reflect a very dynamic system
 where the release time of each job is totally unknown beforehand.
  Tasks parameters are statistically independent.
  It is assumed that there is a communication media that allows the tasks to communicate
 independently to their location on the chip. The thesis does not deal with the inter-tasks
 communication issues.
 The aforementioned assumptions and limitations are widely accepted by the research community
 in Reconfigurable Computing to study the problem of scheduling and placing hardware tasks on
 DPRHWs. These assumptions rely on technological advances in devices such as FPGAs, and on
 research topics that address other problems in the field (e.g. communication, tasks relocation,
 tasks migration, etc.).
 Though many optimal and non optimal scheduling and placement algorithms have been pro-
 posed in related work, their suitability for online real-time scheduling was not established, es-
 pecially in an embedded environment. The foundation of this thesis was a methodology which
 first assessed the limitations of some areas management approaches in such an environment. Ac-
 cordingly, few scheduling and placement methods have been proposed and discussed. In addition,
 combination of scheduling and placement strategies have been suggested.
 230
7. Conclusion & Outlook of the Thesis Future Work
 The proposed solutions have been proven to improve the scheduling and placement quality
 without significantly modifying the algorithms complexity and runtime overheads. These results
 prove the primary hypothesis that resulted from the methodology proposed in this thesis.
 7.4 Future Work
 Several scheduling algorithms have been studied in Chapter 5. Most of these algorithms have been
 implemented and simulated, and the simulation results discussed. However, few have been either
 partially implemented or partially simulated, and some others have not been implemented.
 In this thesis, the n X 1D variable size slots scheduling relies on an areas management
 strategy that has been used only with a horizon-looking-ahead scheduling algorithm. In Chapter
 5, three variants of the algorithm, namely First Fit (FF), Next Fit (NF) and Best Fit (BF) have
 been discussed. Only the BF variant has been combined with the horizon-looking-ahead scheduling
 and simulated. Any experiment has been carried out with a without-looking-ahead scheduling.
 An extension of this research could be to replace the horizon schedulings by stuffing schedulings
 in order to assess the improvements that are obtained.
 Multi-shape scheduling algorithms have been proven to improve the scheduling and place-
 ment quality. Looking-ahead scheduling algorithms have been proven to be suitable for online
 real-time scheduling if they rely on a placer with acceptable complexity. We have separately as-
 sessed the two scheduling approaches. Combining them will improve the scheduling quality. On
 one hand, the multi-shape scheduling will increase the chip utilization ratio and decrease the tasks
 rejection ratio, while the looking-ahead approach will allow the scheduler to take fast scheduling
 decisions, which is very useful in an online real-time context.
 In this thesis, the complexity of designing multi-shape hardware tasks has not been clearly
 studied. However, the assumptions made above on hardware tasks are the same for multi-shape
 hardware tasks (relocatability, etc.). The simulation results has shown that using two versions
 per tasks is sufficient to significantly improve the placement quality. Therefore, it can be roughly
 assumed that the multi-shape scheduling approach requires about twice as much memory as the
 single-version tasks scheduling approach, to store the bitstreams.
 231
7. Conclusion & Outlook of the Thesis Future Work
 The multi-shape scheduling algorithm have been proposed in Chapter 5 as a solution for
 improving the scheduling and placement quality, in terms of tasks rejection ratio, chip utilization
 ratio and scheduling algorithm runtime overheads. The same algorithm could be applied to power-
 aware embedded systems. The power consumption of a hardware task consists of two main parts.
 The part due to the power and the time required to configure the task on the reconfigurable
 array, and the part due to the power dissipated by the hardware task when it runs to completion.
 In both cases, the power depends on the size of the task, in addition to other parameters. As
 the multi-shape tasks scheduling assumes that each hardware task can be instantiated in more
 than one size and/or shape, a power-aware multi-shape scheduling approach will always choose
 the instantiation that minimizes the power consumption of the chip.
 A possible extension of this work may be to consider the preemption of hardware tasks.
 The competitive analysis has been presented in Chapter 3 as the best tool for assessing
 online scheduling algorithms. The analysis has been discussed for software tasks scheduling on
 monoprocessor and multiprocessor systems, with the aim to transpose to hardware tasks scheduling
 on reconfigurable devices. This transposition is far from being obvious, as scheduling hardware
 tasks on a reconfigurable array is further complicated than multiprocessor scheduling. Thus, the
 competitive analysis approach has been used only in the study of the 1D variable slots scheduling
 with minimum makespan in Chapter 5, section 5.3.3. As stated at the beginning of Chapter 6,
 for the sake of consistency, an average case analysis has been used instead, while carrying out the
 experiments. Using competitive analysis in online scheduling of hardware tasks on dynamically
 and partially reconfigurable hardware devices remains a niche.
 Through lack of time, the implementation of a real life application on an embedded RSoC
 following the OS-driven RSoC design methodology discussed in Chapter 2 and Chapter 4 has not
 been done in this thesis. The so-called OveRSoC methodology has been proven suitable for explor-
 ing the design space of an an MPSoC (Miramond et al., 2009a). A case study for implementing a
 mobile robotic vision application on an MPSoC was presented in Verdier et al. (2008). However,
 the latter case was targeting an MPSoC without a dynamically reconfigurable hardware part.
 The next step in this work could be to include the dynamically reconfigurable hardware devices
 models along with the placer model, the hardware tasks models and the scheduling/placement
 strategies provided by this work into the methodology. Hence, as depicted in figure 2.19 (page
 232
7. Conclusion & Outlook of the Thesis Future Work
 60), these models and algorithms will be taken into account by the architecture and RTOS services
 exploration strategy while refining the final architecture for the case study.
 233
Publication
 Publications
 This thesis will be summarized and submitted as a Journal Paper.
 Refereed Conference Papers
 1. G. Wassi, G. Lawday, M-E-A. Benkhelifa, F. Verdier, "Online Real-time Scheduling of
 Multi-shape Hardware Tasks on Partially Reconfigurable FPGAs", In the proceedings of
 22th National Symposium of Research Group on Signal and Images Processing, GRETSI
 2009, 8-11 septembre 2009, Dijon, France.
 Non-Refereed Workshop/Conference Papers
 1. G. Wassi, M-E-A. Benkhelifa, F. Verdier, G. Lawday, "Tree Structure for Online Real-
 time Scheduling on Partially Reconfigurable FPGAs", In the proceedings of 4th National
 Symposium of Research Group on System-on-Chip & System-in-Package, GDR SoC/SiP
 010, 2010, Paris, France.
 2. Guy Wassi, Geoff Lawday, Amine Benkhelifa, Francois Verdier, "Online Scheduling and
 Placement of Real-Time Hardware Tasks on FPGAs", In the proceedings of 3rd National
 Symposium of Research Group on System-on-Chip & System-in-Package, GDR SoC/SiP
 009, 2009, Orsay, France.
 234
Bibliography
 Leon Adam. Choosing the right architecture for real-time signal processing designs. White paper
 spra879, Texas Instruments, Nov 2002.
 A. Ahmadinia, C. Bobda, and J. Teich. A dynamic scheduling and placement algorithm for
 reconfigurable hardware. Architecture of Computing Systems (ARCS), pages 125?139, 2004.
 Elias Ahmed and Jonathan Rose. The effect of lut and cluster size on deep-submicron fpga perfor-
 mance and density. In FPGA '00: Proceedings of the 2000 ACM/SIGDA eighth international
 symposium on Field programmable gate arrays, pages 3?12, New York, NY, USA, 2000. ACM.
 Susanne Albers and Bianca Schr?der. An experimental study of online scheduling algorithms.
 ACM Journal of Experimental Algorithmics, 7:154, 2002.
 Altera. www.altera.com. URL http://www.altera.com.
 Altera. Introducing innovations at 28 nm to move beyond moore's law. Tech-
 nical report, Altera Corp., 2010a. URL http://www.altera.com/literature/wp/
 wp-01125-stxv-28nm-innovation.pdf.
 Altera. Altera unveils innovations for 28-nm fpgas. Technical report, Altera Corp.,
 2010b. URL http://www.altera.com/corporate/newsroom/releases/2010/products-/
 nr-innovating-at-28-nm.html.
 P. M. Athanas and H. F. Silverman. Processor reconfiguration through instruction set metamor-
 phosis. IEEE Computer, 3(26):11?18, 1993.
 235
Bibliography
 I. Bandara and C. Hudson. Detection and tracking of eye blink to identify driver fatigue and
 napping. In HCI 2006: Engage!, The 20th BCS HCI Group conference in co-operation with
 ACM. London, UK., 11-15 September 2006.
 V. Baumgarte, G. Ehlers, F. May, A. Naeckel, M. Vorbach, and M. Weinhardt. PACT XPP?a
 self-reconfigurable data processing architecture. Journal of Supercomputing, 26(2):167?184,
 2003.
 K. Bazargan, R. Kastner, and M. Sarrafzadeh. Fast template placement for reconfigurable comput-
 ing systems. In IEEE Design and Test for Computers, volume 17, pages 68?83, Los Alamitos,
 CA, USA, 2000. IEEE Computer Society Press.
 John Blyler. Navigating the silicon jungle: FPGA or ASIC ? Chip Design Magazine, June release,
 2005.
 Ivo Bolsens. The future of FPGAs. Victorian Microelectronics Designers Network. Event, Mel-
 bourne, pages 21?30, Jan 27 2005.
 Celoxica. Handel-c language. Technical report, Celoxica, 2000. URL www.celoxica.com.
 K. B. Chehida and M. Auguin. HW / SW Partitioning Approach For Reconfigurable System
 Design. CASES 2002, 2002.
 Yuan-Hsiu Chen and Pao-Ann Hsiung. Hardware task scheduling and placement in operating
 systems for dynamically reconfigurable soc. In EUC, pages 489?498, 2005.
 E. G. Coffman, M. R. Garey, and D. S. Johnson. Approximation algorithms for bin packing:
 A survey. In PWS Publishing Company, editor, In D. Hochbaum, Approximation algorithms,
 1997.
 K. Compton, Z. Li, J. Cooley, S. Knol, and S. Hauck. Configuration relocation and defragmenta-
 tion for run-time reconfigurable computing. IEEE Trans. Very Large Scale Integration (VLSI)
 Systems, 10(3):209?220, 2002.
 Jin Cui and Qingxu Deng. An efficient algorithm for online management of 2d area of partially
 reconfigurable fpgas. In in Proc. Design, Automation and Test in Europe (DATE, pages 129?
 134, 2007.
 236
Bibliography
 K. Danne. Real-Time Multitasking in Embedded Systems Based on Reconfigurable Hardware. PhD
 thesis, PhD thesis, University of Paderborn, Germany, September 2006.
 K. Danne and M. Platzner. Partitioned Scheduling of Periodic Real-Time Tasks onto Reconfi-
 gurable Hardware. In International Parallel and Distributed Processing Symposium (IPDPS'06),
 Reconfigurable Architecture Workshop (RAW'06), 2006a.
 K. Danne and M. Platzner. A heuristic approach to schedule periodic real-time tasks on reconfi-
 gurable hardware. In International Conference on Field Programmable Logic and Applications,
 pages 24?26, 2005.
 K. Danne and M. Platzner. An EDF schedulability test for periodic tasks on reconfigurable hard-
 ware devices. In Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language,
 compilers, and tool support for embedded systems, pages 93?102. ACM New York, NY, USA,
 2006b.
 K. Danne, R. Muhlenbernd, and M. Platzner. Executing hardware tasks on dynamically reconfi-
 gurable devices under real-time conditions. In Field Programmable Logic and Applications,
 2006. FPL'06. International Conference on, pages 1?6, 2006.
 Andr? DeHon. The density advantage of configurable computing. IEEE Computer Society, 33:
 pages 41?49, 2000.
 C. Ebeling, Cronquist DC., and Franklin P. Rapid reconfigurable pipelined datapath. Lec-
 ture Notes in Computer Science 1142 - Field Programmable Logic: Smart Applications, New
 Paradigms and Compilers, pages 126?135, 1996.
 Jeff Edmonds. Scheduling in the dark. Theoretical Computer Science, 235(1):109?141, 2000.
 Fujitsu. White paper: Fujitsu develops new soc design methodology based on uml and c. Technical
 report, 2002.
 A. Gatherer, S. Sriram, F. Moerman, Sengupta C., and K. Brown. Cost Effective Software Radio
 for CDMA Systems. In: Walter H. W. Tuttlebee, Software Defined Radio: Baseband Technolo-
 gies for 3G Handsets and Basestations. ed John Wiley & Sons., England, 2004.
 M. Gokhale and J. Stone. NAPA C: Compiling for a hybrid RISC/FPGA architecture. In 6th
 IEEE Symposium on FPGAs for Custom Computing Machines (FCCM'98), 15-17 April 1998,
 Napa Valley, CA, USA, pages 126?. IEEE Computer Society, 1998.
 237
Bibliography
 M. B. Gokhale and P. S. Graham. Reconfigurable Computing: Accelerating Computation with
 Field-Programmable Gate Arrays. Springer, 1 edition, 2006.
 Maya B. Gokhale, Janice M. Stone, Jeff Arnold, and Mirek Kalinowski. Stream-oriented fpga
 computing in the streams-c high level language. In Proceedings of the 2000 IEEE Symposium
 on Field-Programmable Custom Computing Machines, pages 49?, Washington, DC, USA, 2000.
 IEEE Computer Society. URL http://portal.acm.org/citation.cfm?id=795659.795916.
 Seth Copen Goldstein, Herman Schmit, Mihai Budiu, Srihari Cadambi, Matt Moe, R. Reed Taylor,
 and R. Reed. Piperench: A reconfigurable architecture and compiler. IEEE Computer, 33:70?77,
 2000. URL http://www.cs.cmu.edu/~seth/papers/goldstein-ieee00.pdf.
 R.L. Graham, E.L. Lawler, J.K. Lenstra, and A.H.G.Rinnooy Kan. Optimization and approxima-
 tion in deterministic sequencing and scheduling: a survey. In E.L. Johnson P.L. Hammer and
 B.H. Korte, editors, Discrete Optimization II,, volume 5 of Annals of Discrete Mathematics,
 pages 287 ? 326. Elsevier, 1979. URL http://www.sciencedirect.com/science/article/
 B8G56-4SD21YG-M/2/4b302b1ea464cf17986f7e4642be86a1.
 M. Handa and R. Vemuri. Area fragmentation in reconfigurable operating systems. In Interna-
 tional Conference on Engineering of Reconfigurable Systems and Architectures (ERSA), pages
 77?83, 2004a.
 M. Handa and R. Vemuri. An integrated online scheduling and placement methodology. In Field
 Programmable Logic and Applications (FPL'04). . International Conference on, pages 444?453,
 2004b.
 Manish Handa and Ranga Vemuri. An efficient algorithm for finding empty space for online fpga
 placement. In Proceedings of the 41st annual Design Automation Conference, DAC '04, pages
 960?965, New York, NY, USA, 2004c. ACM. URL http://doi.acm.org/10.1145/996566.
 996820.
 R. Hartenstein. Coarse grain reconfigurable architecture (embedded tutorial). In ASP-DAC '01:
 Proceedings of the 2001 Asia and South Pacific Design Automation Conference, pages 564?570,
 New York, NY, USA, 2001a. ACM.
 R. Hartenstein. A decade of reconfigurable computing: a visionary retrospective. In DATE
 '01: Proceedings of the conference on Design, automation and test in Europe, pages 642?649,
 Piscataway, NJ, USA, 2001b. IEEE Press.
 238
Bibliography
 Paul Heysters, Gerard Smit, and Egbert Molenkamp. A flexible and energy-efficient coarse-grained
 reconfigurable architecture for mobile systems. J. Supercomput., 26(3):283?308, 2003.
 ImpulseC. Impulsec language. Technical report, Impulse Accelerated Technologies,
 www.impulseaccelerated.com. URL http://www.impulseaccelerated.com.
 P. A. Jackson J. L. Tripp and B. L. Hutchings. Sea cucumber: A synthesizing compiler for
 fpgas. In Proceedings of the 12th International Conference on Field Programmable Logic and
 Applications, 2002.
 Ahmed A. Jerraya. Long term trends for embedded system design. In DSD '04: Proceedings of
 the Digital System Design, EUROMICRO Systems, pages 20?26, Washington, DC, USA, 2004.
 IEEE Computer Society.
 Chi jui Chou, Satish Mohanakrishnan, and Joseph B. Evans. Fpga implementation of digital filters.
 In In Proceedings of the Fourth International Conference on Signal Processing Applications and
 Technology, pages 80?88, 1993.
 A. Kaplan, M. Sarrafzadeh, and R. Kastner. A survey of hardware/software system partitioning.
 Technical report, 2003.
 D. Koch and J. Torresen. Advances and trends in dynamic partial run-time reconfiguration.
 Department of Informatics, University of Oslo, Norway, 2010.
 Markus Koester, Mario Porrmann, and Heiko Kalte. Task placement for heterogeneous reconfi-
 gurable architectures. In FPT, pages 43?50, 2005.
 Markus Koester, Heiko Kalte, and Mario Porrmann. Relocation and defragmentation for hetero-
 geneous reconfigurable systems. In ERSA, pages 70?76, 2006.
 Ian Kuon and Jonathan Rose. Measuring the gap between fpgas and asics. In Proceedings of the
 2006 ACM/SIGDA 14th international symposium on Field programmable gate arrays, FPGA
 '06, pages 21?30, New York, NY, USA, 2006. ACM. ISBN 1-59593-292-5. doi: http://doi.acm.
 org/10.1145/1117201.1117205. URL http://doi.acm.org/10.1145/1117201.1117205.
 C. L. Liu and James W. Layland. Scheduling algorithms for multiprogramming in a hard-real-time
 environment. J. ACM, 20:46?61, January 1973. URL http://doi.acm.org/10.1145/321738.
 321743.
 239
Bibliography
 Carey Douglass Locke. Best-effort decision-making for real-time scheduling. PhD thesis, Pitts-
 burgh, PA, USA, 1986. AAI8702895.
 Y. Lu, T. Marconi, G. N. Gaydadjiev, K.L.M. Bertels, and R. J. Meeuws. A self-adaptive on-
 line task placement algorithm for partially reconfigurable systems. In Proceedings of the 22nd
 Annual International Parallel & Distributed Processing Symposium (IPDPS 2008) - RAW2008,
 page 8, April 2008.
 P. Mahr, S. Christgau, C. Haubelt, and C. Bobda. Integrated temporal planning, module selec-
 tion and placement of tasks for dynamic networks-on-chip. In Proceedings of the 25th IEEE
 International Parallel & Distributed Processing Symposium, 2011.
 T. Marconi, Y. Lu, K.L.M. Bertels, and G. N. Gaydadjiev. Online hardware task scheduling
 and placement algorithm on partially reconfigurable devices. In Proceedings of International
 Workshop on Applied Reconf. Computing (ARC), pages 306?311, March 2008.
 G. Martin. The History of the SoC revolution. In: G. MARTIN, H. CHANG, Winning the Soc
 revolution. ed. Kluwer Academic Publishers., USA, 2003.
 MathWorks. In www.mathworks.com; www.mathworks.com/products/slhdlcoder;,
 www.mathworks.com.
 Bingfeng Mei, Serge Vernalde, Diederik Verkest, Hugo De Man, and Rudy Lauwereins. ADRES:
 An architecture with tightly coupled vliw processor and coarse-grained reconfigurable matrix.
 pages 61?70. 2003.
 Uwe Meyer-B?se, Divya Sunkara, Encarnacion Castillo, and Antonio Garcia. Custom instruction
 set nios-based ofdm processor for fpgas. volume 6248, page 62480O. SPIE, 2006.
 J.-Y. Mignolet, V. Nollet, P. Coene, D. Verkest, S. Vernalde, and R. Lauwereins. Infrastructure for
 design and management of relocatable tasks in a heterogeneous reconfigurable system-on-chip.
 In Proceedings of Design, Automation and Test in Europe 2003 (DATE 03), IEEE Computer
 Society, pages 986?991, March 2003.
 B. Miramond, E. Huck, F. Verdier, Mohamed El Amine Benkhelifa, B. Granado, M. Aichouch, J.-
 C. Pr?votet, D. Chillet, S. Pillement, Th. Lefebvre, and Yaset Oliva. Oversoc : a framework for
 the exploration of rtos for rsoc platforms. International Journal on Reconfigurable Computing,
 2009(450607):1?18, dec 2009a. URL http://publi-etis.ensea.fr/2009/MHVBGAPCPLO09.
 240
Bibliography
 B. Miramond, F. Verdier, and M. Aichouch. Dogme distributed operating system graphical mod-
 eling environment, 2009b.
 J. Noguera and R. M. Badia. Multitasking on reconfigurable architectures: Microarchitecture
 support and dynamic scheduling. 3(2):385?406, 2004.
 V. Nollet, P. Coene, D. Verkest, S. Vernalde, and R. Lauwereins. Designing an operating sys-
 tem for a heterogeneous reconfigurable soc. In Reconfigurable Architectures Workshop (RAW),
 Proceedings of the International Parallel and Distributed Processing Symposium, Paris, March
 2003.
 V. Nollet, P. Avasare, D. Verkest, and H. Corporaal. Exploiting hierarchical configuration to
 improve run-time mpsoc task assignment. In ERSA'06, pages 49?55, 2006.
 M. Middendorf H. Schmeck O. Diessel, H. ElGindy and B. Schmidt. Dynamic scheduling of tasks
 on partially reconfigurable fpgas. In In IEE Proc. on Computers and Digital Techniques, pages
 181?188, 2000.
 R. J. Petersen. An Assessment of the Suitability of Reconfigurable Systems for Digital Signal
 Processing. PhD thesis, Master's Thesis, Brigham Young University, 1995.
 Polis. A codesign environment. University of California Berkeley, Center for Electronic System
 Design, http://embedded.eecs.berkeley.edu/Respep/Research.
 Ptolemy. An environment for modelling, simulation, and design of concurrent real-time em-
 bedded systems. University of California Berkeley, Center for Electronic System Design,
 http://embedded.eecs.berkeley.edu/Respep/Research.
 Jan M. Rabaey. Wireless beyond the third generation wireless beyond the third generation: facing
 the energy challenge. In ISLPED '01: Proceedings of the 2001 international symposium on Low
 power electronics and design, pages 1?3, New York, NY, USA, 2001. ACM.
 Steven Phillips Rajeev Motwani and Eric Torng. Non-clairvoyant scheduling. In Theor. Comput.
 Sci, 1994.
 S. Roman, H. Mecha, Mozos D., and Septien J. Partition-based dynamic 2d hw multitasking
 management. In Proceedings of the 9 EUROMICRO Conference on Digital System Design
 (DSD'06), 2006.
 241
Bibliography
 Jonathan Rose, Abbas El Gamal, Senior Member, and Albert Sangiovanni-vincentelli. Architecture
 of field-programmable gate arrays: The effect of logic block functionality on area efficiency.
 Proceedings of the IEEE, 25:1217?1225, 1990.
 P. Schaumont and I. Verbauwhede. Thumbpod puts security under your thumb. In Xilinx Xcell
 Journal, 2003.
 E. Sch?ler and L. Tan. XPP ? An Enabling Technology for SDR Handsets. In: Walter H. W.
 Tuttlebee, Software Defined Radio: Baseband Technologies for 3G Handsets and Basestations.
 ed John Wiley & Sons., England, 2004.
 Jir? Sgall. On-line scheduling - a survey. In Online Algorithms: The State of the Art, Lecture
 Notes in Computer Science 1442, 1998.
 Satwant Singh, Jonathan Rose, Paul Chow, and David Lewis. The effect of logic block architecture
 on fpga performance. IEEE Journal of Solid-State Circuits, 27:281?287, 1992.
 D. Sleator and R. E. Tarjan. Amortized efficiency of list update and paging rules. In Communi-
 cations of the ACM, page 28:202?208. ACM New York, NY, USA, 1985.
 Lodewijk T. Smit, Gerard J.M. Smit, Johann L. Hurink, Hajo Broersma, Dani?l Paulusma, and
 Pascal T. Wolkotte. Run-time mapping of applications to a heterogeneous reconfigurable tiled
 system on chip architecture. In IEEE International Conference on Field-Programmable Tech-
 nology, FPT, pages 421?424. IEEE, 2004. URL http://doc.utwente.nl/48501/.
 John A. Stankovic. Misconceptions about real-time computing. IEEE Computer, 21(10):10?19,
 1988.
 C. Steiger, H. Walder, and M. Platzner. Operating Systems for Reconfigurable Embedded Plat-
 forms: Online Scheduling of Real-Time Tasks. IEEE TRANSACTIONS ON COMPUTERS,
 53(11):1393?1407, 2004.
 SystemC. Systemc language, www.systemc.org. Technical report, www.systemc.org.
 SystemVerilog. Systemverilog, an unified hardware description and verification language (hdvl)
 standard, http://www.systemverilog.org. Technical report, www.systemverilog.org.
 242
Bibliography
 J. Tabero, J. Septi?n, H. Mecha, and D. Mozos. A low fragmentation heuristic for task placement
 in 2d rtr hw management. In Field Programmable Logic and Applications, 2004. FPL'04.
 International Conference on, pages 241?250, 2004.
 J. Tabero, J. Septi?n, H. Mecha, and D. Mozos. Task placement heuristic based on 3d-adjacency
 and look-ahead in reconfigurable systems. In Asia and Pacific Design Automation Conference
 (ASP-DAC), pages 396?401, 2006.
 Michael Taylor, Michael Bedford Taylor, Jason Kim, Jason Miller, David Wentzlaff, Fae Ghodrat,
 Ben Greenwald, Henry Hoffman, Paul Johnson, Jae wook Lee, Walter Lee, Albert Ma, Arvind
 Saraf, Mark Seneski, Nathan Shnidman, Volker Strumpen, Matthew Frank, Saman Amaras-
 inghe, and Anant Agarwal. The raw microprocessor: A computational fabric for software
 circuits and general-purpose programs, 2002.
 Tensilica. www.tensilica.com.
 R. Tessier and W. Burleson. Reconfigurable computing for digital signal processing: A survey.
 Journal of VLSI Signal Processing, 3(28):7?27, 2001.
 Jan C. Van Der Veen, S?ndor P. Fekete, Mateusz Majer, Ali Ahmadinia, Christophe Bobda, Frank
 Hannig, and J?rgen Teich. Defragmenting the module layout of a partially reconfigurable device.
 In Proceedings of the International Conference on Engineering of Reconfigurable Systems and
 Algorithms (ERSA), Las Vegas, pages 92?101. CSREA Press, 2005.
 F. Verdier, B. Miramond, M. Maillard, E. Huck, and Th. Lefebvre. Using high-level rtos models
 for hw/sw embedded architecture exploration: Case study on mobile robotic vision. EURASIP
 Journal on Embedded Systems, Special issue on Design and Architectures for Signal Image
 Processing, 2008. URL http://publi-etis.ensea.fr/2008/VMMHL08.
 A.S. Vincentelli and Grant Martin. Platform-based design and software design methodology for
 embedded systems. IEEE Design and Test of Computers, 18:23?33, 2001.
 H. Walder and M. Platzner. Non-preemptive Multitasking on FPGAs: Task Placement and
 Footprint Transform. Proc. Int. Conf. on Engineering of Reconfigurable Systems and Algorithms
 (ERSA'02), pages 24?30, 2002.
 243
Bibliography
 H. Walder, C. Steiger, and M. Platzner. Fast online task placement on FPGAs: free space par-
 titioning and 2D-hashing. Parallel and Distributed Processing Symposium, 2003. Proceedings.
 International, page 8, 2003.
 Xilinx. In www.xilinx.com, www.xilinx.com.
 Xilinx. Two flows for partial reconfiguration: Module based or difference based. Technical report,
 Xilinx Inc., 2004. URL www.xilinx.com.
 Xilinx. Distributed Arithmetic FIR Filter V9.0. Xilinx Inc., San Jose California, April 2005. URL
 http://www.xilinx.com/ipcenter/catalog/logicore/docs/da_fir.pdf.
 Xilinx. In Company reports, Company reports.
 244
Appendix
 7.5 Appendix A : Table classifying related work on schedul-
 ing and placement strategies
 245
Appendix A. Related works on scheduling and placement strategies
 Are
 a
 M
 o
 del
 s
 Author
 s
 &
 Algo
 -
 rithm
 s
 Strength
 s
 Dr
 a
 wba
 c
 k
 s
 Comme
 n
 t
 s
 Bazarga
 n
 e
 t
 al
 .
 (2000)
 ,
 KAME
 R
 MERs-base
 d
 Optima
 l
 placeme
 n
 t
 (2D
 )
 V
 e
 r
 y
 hig
 h
 complexi
 t
 y
 O
 (n
 2
 ),
 sl
 o
 w
 Static
 ,
 o?in
 e
 placeme
 n
 t
 .
 Use
 d
 a
 s
 a
 com
 p
 ariso
 n
 refer
 -
 ence
 .
 Without-l
 o
 oking-ah
 e
 a
 d
 s
 c
 h
 edulin
 g
 Homo
 -
 Bazarga
 n
 e
 t
 al
 .
 (2000)
 ,
 Non
 o
 v
 erlappin
 g
 F
 aste
 r
 tha
 n
 K
 A
 M
 ER
 ,
 complexi
 t
 y
 O
 (n
 )
 Quali
 t
 y
 los
 t
 (fragmen
 -
 tation
 ,
 task
 s
 rejection
 )
 Static
 ,
 online
 ,
 BF
 ,
 FF
 ,
 BL
 ,
 P
 artia
 l
 reconfiguratio
 n
 (highe
 r
 fragme
 n
 tatio
 n
 an
 d
 tas
 k
 s
 rejection)
 ,
 without
 -
 l
 o
 oking-ah
 e
 a
 d
 s
 c
 hedulin
 g
 geneou
 s
 W
 al
 d
 e
 r
 e
 t
 al
 .
 (2003)
 ,
 OT
 F
 &
 EOT
 F
 Has
 h
 matrix
 .
 Com
 -
 plex
 .
 O
 (1
 )
 fo
 r
 are
 a
 sear
 c
 h
 V
 e
 r
 y
 lon
 g
 u
 p
 dat
 e
 p
 ro
 -
 cess
 :
 u
 p
 t
 o
 O
 (w
  
h
 ),
 width/heig
 h
 t
 o
 f
 th
 e
 ar
 -
 r
 a
 y
 OT
 F
 a
 n
 d
 Enhance
 d
 OT
 F
 p
 artitioning
 ,
 Enha
 n
 ce
 d
 Bazarga
 n
 e
 t
 al
 .
 (2000)'
 s
 p
 artitioner
 .
 Static
 ,
 on
 -
 line/o?in
 e
 p
 l
 a
 ceme
 n
 t
 (BF
 ,
 FF
 ,
 WF)
 ,
 u
 p
 t
 o
 70
 %
 b
 ette
 r
 tha
 n
 Bazarga
 n
 e
 t
 al
 .
 (2000
 )
 (no
 n
 o
 v
 erlapping)
 .
 Are
 a
 Ahmadini
 a
 e
 t
 a
 l
 .
 (2004)
 ,
 Cluster
 s
 based
 .
 Complex
 .
 O(n
 )
 Les
 s
 fragme
 n
 tation
 .
 1
 D
 placeme
 n
 t
 .
 N
 o
 t
 reall
 y
 fragme
 n
 tatio
 n
 based
 .
 Enhance
 d
 Bazarga
 n
 e
 t
 al
 .
 (
 2
 000)'
 s
 partitioner
 .
 Run
 -
 tim
 e
 o
 v
 erhea
 d
 1
 5
 t
 o
 20
 %
 b
 ette
 r
 tha
 n
 KAMER
 ,
 f
 o
 r
 a
 slig
 h
 t
 l
 y
 simila
 r
 task
 s
 rejectio
 n
 ratio
 .
 Dynami
 c
 place
 -
 me
 n
 t
 ,
 equa
 l
 siz
 e
 slots
 .
 Proto
 t
 y
 p
 e
 d
 example
 .
 M
 o
 de
 l
 Roma
 n
 e
 t
 al
 .
 (2006
 )
 Complex
 .
 O(1)
 .
 Out
 -
 p
 erform
 s
 th
 e
 Firs
 t
 Fi
 t
 algorith
 m
 Les
 s
 efficie
 n
 t
 fo
 r
 het
 -
 erogeneou
 s
 (equa
 l
 size
 )
 tas
 k
 set
 s
 Slot
 s
 wit
 h
 differe
 n
 t
 sizes
 ,
 ru
 n
 tim
 e
 siz
 e
 adjustme
 n
 t
 .
 Queu
 e
 fo
 r
 s
 c
 hedulin
 g
 T
 abl
 e
 7.1
 :
 Relate
 d
 w
 or
 k
 o
 n
 s
 c
 hedulin
 g
 an
 d
 plac
 e
 me
 n
 t
 strategie
 s
 (homogeneou
 s
 reconfigurabl
 e
 arr
 a
 y
 m
 o
 del
 )
 246
Appendix A. Related works on scheduling and placement strategies
 Are
 a
 M
 o
 del
 s
 Author
 s
 &
 Algo
 -
 rithm
 s
 Strength
 s
 Dr
 a
 wba
 c
 k
 s
 Comme
 n
 t
 s
 K
 o
 este
 r
 e
 t
 al
 .
 (2005
 )
 SUPFi
 t
 &
 R
 UPFi
 t
 al
 -
 gorithm
 s
 Bot
 h
 h
 a
 v
 e
 b
 ette
 r
 1D
 -
 placeme
 n
 t
 tha
 n
 Best
 -
 Fit
 .
 R
 UPFi
 t
 ha
 s
 a
 hig
 h
 Run-tim
 e
 complexi
 t
 y
 .
 SUPFi
 t
 reject
 s
 les
 s
 tas
 k
 s
 .
 R
 UPFi
 t
 ha
 s
 a
 b
 ette
 r
 util
 i
 z
 a
 -
 tio
 n
 rati
 o
 an
 d
 th
 e
 leas
 t
 relati
 v
 e
 tas
 k
 rejections
 .
 Prior
 -
 i
 t
 y
 featur
 e
 o
 n
 tasks
 .
 Heterogenei
 t
 y
 (stati
 c
 v
 s
 config
 -
 urabl
 e
 cells)
 .
 BestFi
 t
 u
 se
 d
 fo
 r
 b
 en
 c
 hmarking
 .
 P
 r
 oto
 -
 t
 y
 p
 e
 d
 example
 .
 Hetero
 -
 Steige
 r
 e
 t
 al
 .
 (2004)
 ,
 Horizon(1D)
 ,
 Stuff
 -
 ing(1
 D
 an
 d
 2D
 )
 T
 ask
 s
 rejectio
 n
 rati
 o
  
10
 %
 fo
 r
 th
 e
 2
 D
 stuff
 -
 ing
 ,
 m
 o
 del
 s
 simila
 r
 t
 o
 our
 s
 N
 o
 detail
 s
 o
 n
 place
 -
 me
 n
 t
 strategie
 s
 use
 d
 Onlin
 e
 s
 c
 hedulin
 g
 &
 pla
 c
 eme
 n
 t
 o
 f
 real-tim
 e
 tasks
 .
 T
 a
 sk
 s
 rel
 o
 catabili
 t
 y
 ,
 Bette
 r
 s
 c
 hedulin
 g
 p
 erformanc
 e
 wit
 h
 standin
 g
 tasks
 .
 Proto
 t
 ypin
 g
 o
 f
 th
 e
 1
 D
 m
 o
 d
 e
 l
 us
 -
 in
 g
 a
 Xilin
 x
 FPG
 A
 an
 d
 th
 e
 microblaz
 e
 a
 s
 hos
 t
 CPU
 .
 geneou
 s
 S
 c
 haumo
 n
 t
 an
 d
 V
 er
 -
 ba
 u
 whed
 e
 (2003
 )
 faste
 r
 reconfigu
 r
 a
 -
 tion
 ,
 J
 a
 v
 a
 securi
 t
 y
 ar
 c
 hitectur
 e
 N
 o
 r
 u
 n
 tim
 e
 task
 s
 as
 -
 signme
 n
 t
 featur
 e
 Th
 e
 T
 h
 u
 m
 b
 P
 o
 d
 ar
 c
 h
 i
 te
 c
 ture
 :
 Hierar
 c
 hica
 l
 configura
 -
 tio
 n
 concep
 t
 use
 d
 fro
 m
 a
 desig
 n
 t
 i
 m
 e
 p
 oi
 n
 t
 o
 f
 view
 .
 Run-tim
 e
 reconfiguration
 .
 Eas
 y
 desig
 n
 pr
 o
 ces
 s
 an
 d
 programming
 ,
 Are
 a
 Nolle
 t
 e
 t
 al
 .
 (2006
 )
 Hierar
 c
 hica
 l
 configura
 -
 tio
 n
 Hierar
 c
 hica
 l
 configuratio
 n
 concep
 t
 i
 n
 a
 n
 MP
 S
 oC
 ,
 Pr
 o
 cessin
 g
 Eleme
 n
 t
 s
 all
 o
 cation
 .
 Th
 e
 t
 w
 o
 w
 o
 r
 k
 s
 a
 r
 e
 m
 or
 e
 a
 b
 ou
 t
 task
 s
 assi
 -
 No
 l
 MPSo
 C
 appr
 o
 a
 c
 h
 ,
 T
 as
 k
 assignme
 n
 t
 s
 c
 hem
 e
 gneme
 n
 t
 i
 n
 a
 n
 MPSo
 C
 tha
 n
 hard
 w
 ar
 e
 task
 s
 pla
 c
 e
 -
 me
 n
 t
 .
 Configuratio
 n
 hier
 a
 r
 c
 h
 y
 i
 n
 tr
 o
 duce
 d
 her
 e
 ca
 n
 b
 e
 applie
 d
 t
 o
 hard
 w
 ar
 e
 t
 a
 s
 k
 s
 placeme
 n
 t
 heuristics
 .
 Deal
 s
 wit
 h
 com
 m
 unicatin
 g
 task
 s
 (Ne
 t
 w
 ork-on-Chip)
 .
 M
 o
 de
 l
 Smi
 t
 e
 t
 al
 .
 (2004
 )
 Run-tim
 e
 task
 s
 assign
 -
 me
 n
 t
 .
 T
 ask
 s
 prio
 r
 i
 t
 y
 feature
 .
 Heterogeneou
 s
 are
 a
 m
 o
 de
 l
 ta
 k
 e
 n
 i
 n
 t
 o
 accou
 n
 t
 .
 Scarci
 t
 y
 o
 f
 th
 e
 resource
 s
 i
 s
 no
 t
 use
 d
 t
 o
 a
 d
 j
 u
 s
 t
 th
 e
 Run
 -
 tim
 e
 task
 s
 assignme
 n
 t
 a
 lgorithm
 .
 T
 abl
 e
 7.2
 :
 Relate
 d
 w
 or
 k
 o
 n
 s
 c
 hedulin
 g
 an
 d
 pla
 c
 eme
 n
 t
 strategie
 s
 (heterogeneou
 s
 reconfigurabl
 e
 arr
 a
 y
 m
 o
 de
 l
 )
 247
Appendix B. Tasks paramaters based and multi-shape tasks based scheduling
 7.6 Appendix B : Additional Simulation Results
 Here are presented global results of without-looking-ahead scheduling algorithms.
 The results are aggregated in order to provide a global overview. The results for
 scheduling algorithms that use tasks with a single version are put beside those for
 the multi-shape scheduling algorithms in order to ease the comparison.
 248
Appendix B. Tasks paramaters based and multi-shape tasks based scheduling
 Figure 7.1: The tasks rejection ratio for paramaters based scheduling and multi-shape tasks
 based scheduling.
 249
Appendix B. Tasks paramaters based and multi-shape tasks based scheduling
 Figure 7.2: The reconfigurable array utilization ratio for paramaters based scheduling and
 multi-shape tasks based scheduling.
 250
Appendix B. Tasks paramaters based and multi-shape tasks based scheduling
 Figure 7.3: The runtime overheads of without-looking-ahead scheduling algorithms (paramaters
 based scheduling and multi-shape tasks based scheduling)
 251
Appendix B. Tasks paramaters based and multi-shape tasks based scheduling
 Figure 7.4: The number of scheduler invocations for without-looking-ahead scheduling
 algorithms (paramaters based scheduling and multi-shape tasks based scheduling)
 252
Appendix B. Tasks paramaters based and multi-shape tasks based scheduling
 Figure 7.5: The cumulative runtime overheads for without-looking-ahead scheduling algorithms
 (paramaters based scheduling and multi-shape tasks based scheduling)
 253
Appendix B. Tasks paramaters based and multi-shape tasks based scheduling
 Figure 7.6: Tasks parameters based scheduling algorithms vs multi-shape algorithm : Simulation
 on a large number of tasks (10 sets of 5000 tasks).
 254
Appendix C. Tables of algorithms and data structures implemented
 255
Appendix C. Tables of algorithms and data structures implemented
 7.7 Appendix C : Tables of algorithms and data structures
 implemented
 Table 7.3: Scheduling algorithms implemented.
 256
Appendix C. Tables of algorithms and data structures implemented
 Table 7.4: List of placement algorithms implemented.
 257
Appendix C. Tables of algorithms and data structures implemented
 Table 7.5: List of placement structures implemented (1): The areas partitioning (existing works
 are cited and those from us are highlighted).
 258
Appendix C. Tables of algorithms and data structures implemented
 Table 7.6: List of placement structures implemented (2): Finding fitting areas and merging free
 areas (existing works are cited and those from us are highlighted).
 259
Appendix D. Size of IPs from the Xilinx core generator
 7.8 Appendix D : Size of IPs from the Xilinx core generator
 Application Parameters Number of Projection
 slices on Virtex2pro on Virtex5
 Bus CAN bus 569 to 885 247 to 385
 Flexray 3089 to 3500 1350 to 1520
 MOST 2306 1000
 Optic Fiber 5000 2173
 PCI master/target 500 to 1000 217 to 434
 target 500 217
 USB 2 2000 869
 ethernet 10/100 Mbits 1000 to 2000 434 to 869
 Floating addition / 470 204
 Calcu- substraction
 lation multiplication 424 184
 division 867 377
 square root 524 228
 Comparison 45 20
 DSP COMPLEX MUL 32 bits 645 280
 CORDIC
 Arctan 582 253
 cosine 16 bits 622 270
 square root 16 bits 113 49
 square root 32 bits 755 328
 FIR32 136 59
 multiply/accumulate 32 bits 215 93
 FFT 1541 670
 32 points - 16 bits
 multipliers cabl?s 1550 673
 32 points - 16 bits 2088 907
 multipliers (LUTs)
 linear wrt
 number of points 400 to 2500 174 to 1086
 Modulations cosine, 1 line 17 7
 cosine, 4 lines 81 35
 Table 7.7: Few IPs for Virtex2pro FPGAs from the XILINX Core Generator System
 1
 .
 1
 Xilinx Core Generator System provides a library of user-customizable IPs (or hardware tasks) for
 Xilinx FPGAs.
 260
Appendix E. Implementating of a Multi-shape Hardware Task : the FIR Filter
 7.9 Appendix E : DA Implementation of a Multi-shape Hard-
 ware Task : the FIR Filter
 A FIR (Finite Impulse Response) filter is commonly used in Digital Signal Processing applications
 to implement low-pass, band-pass and high-pass filters and other convolution functions. Tradi-
 tionally, digital filtering algorithms were most commonly implemented using DSP processors for
 low rates applications (e.g. audio) and ASICs for higher rates. The dataflow representations of
 direct structure of an N  order FIR filter has been illustrated in figure 2.4, chapter 2 page 30
 and the corresponding equation is given by :
 Y (n) =
 N 1X
 l=0
 Hl  Xl(n) =
 N 1X
 l=0
 Hl  Xl (7.1)
 H0, H1,..., HN 1 are N constant and time-invariant filter coefficients that are computed before-
 hand. N is the filter length. At each time n, the output response Y (n) is function of the N lasts
 inputs samples X0, X1...XN 1 only. Therefore, n may be implicit as shown in the final equation.
 The output requires 2N  1 arithmetic operations (N multiplications and N  1 additions).
 Different techniques that range from pure serial implementation to fully parallel may be used for
 FPGA implementation of the filter.
 A fully parallel implementation is meant to map all the functional blocks depicted in the
 dataflow representation of figure 2.4, page 30. As one output sample is delivered at every clock
 cycle, this implementation provides a higher throughput but at the cost of more logic resources
 consumption.
 In a serial implementation, input samples are conveyed serially in the filter and computed one
 by one. The implementation is denoted as bit-serial. As the same hardware is re-used to compute
 bits one by one, this approach saves hardware resources but requires many clock cycles to compute
 one output sample (figure 7.7).
 Digit-serial architecture is another alternative that combine bit-serial and fully parallel imple-
 mentations. In Digit-serial, a W  bit data word is processed in units of N  bit digit in P clock
 cycles, where P = W=N . Digit-serial implementation approach is a good trade-off between a lower
 throughput bit-serial implementation and a higher hardware resources consumption bit-parallel
 implementation. A range of trade-off may be found between the throughput of the filter and the
 amount of configurable resources required.
 261
Appendix E. Implementating of a Multi-shape Hardware Task : the FIR Filter
 7.9.1 Distributed Arithmetic as an enabling technique
 Distributed Arithmetic (DA) is a computation algorithm that uses memory instead of multipliers
 to perform sum of products where one of the operand remains constant. Hence, DA suits to
 implement sums of products similar to the FIR equation 7.1 above which may be any convolution
 where processing one output sample requires the accumulation of N product terms.
 If inputs samples and coefficients are two's complement signed number where a binary number Xl
 equation is given by :
 Xl =  Xl;B 1  2
 B 1 +
 B 2X
 b=0
 Xl;b  2
 b =  X(l;B 1)  2
 B +
 B 1X
 b=0
 X(l;b)  2
 b
 (7.2)
 where Xl;B 1 is the sign bit of Xl, Xl;0 its less significant bit, and Xl;b, b 6= B  1 the bit at
 position b of digit Xl.
 If scaled by a factor S = 12B 1 , the resulting two's complement scaled number representation is as
 followed :
 Xl =  Xl;B 1 +
 B 2X
 b=0
 Xl;b  2
 b (B 1)
 (7.3)
 The scaling operation maximizes the signal to noise ratio (SNR). Equation 7.3 can be rewritten
 in a different way as below by swapping b and  (b B + 1) :
 Xl =  Xl;B 1 +
 B 1X
 b=1
 X(l;B 1 b)  2
  b
 , with jXlj  1 (7.4)
 The scaled representation of equation 7.4 applied to the FIR filter equation gives :
 Y =
 PN 1
 l=0 Hl  Xl
 =
 PN 1
 l=0 Hl  [ X(l;B 1) +
 PB 1
 b=1 X(l;B 1 b)  2
  b]
 ) Y =  [
 PN 1
 l=0 Hl  X(l;B 1)] +
 PB 1
 b=1 [
 PN 1
 l=0 Hl  X(l;B 1 b)]  2
  b
 (7.5)
 Equation 7.5 is commonly used to express the scaled output Y of a FIR filter; jY j  1 for scaled
 inputs and coefficients.
 Developing Y in the above equation leads to equation 7.6 below, easier to memorize and to
 manipulate. The latter equation 7.6 consists of N  B products terms. These products terms may
 be implemented without multipliers (the so-called multipliers-less implementation). Each product
 term Hl  X(l;b) is a binary AND operation between a single bit X(l;b) of the input sample Xl and
 262
Appendix E. Implementating of a Multi-shape Hardware Task : the FIR Filter
 its corresponding constant coefficient Hl.
 Y =  [
 PN 1
 l=0 Hl  X(l;B 1)] +
 PB 1
 b=1 [
 PN 1
 l=0 Hl  X(l;B 1 b)]  2
  b
 =  [
 PN 1
 l=0 Hl  X(l;B 1)]
 +
 PB 1
 b=1 [H0  X(0;B 1 b) +H1  X(1;B 1 b) +    +H(N 1)  X(N 1;B 1 b)]  2
  b
 =  [H0  X(0;B 1) +H1  X(1;B 1) +    +H(N 1)  X(N 1;B 1)]
 +[H0  X(0;B 2) +H1  X(1;B 2) +    +H(N 1)  X(N 1;B 2)]  2 1
 +[H0  X(0;B 3) +H1  X(1;B 3) +    +H(N 1)  X(N 1;B 3)]  2 2
 .
 .
 .
 .
 .
 .
 +[H0  X(0;1) +H1  X(1;1) +    +    +H(N 1)  X(N 1;1)]  2 (B 2)
 +[H0  X(0;0) +H1  X(1;0) +    +    +H(N 1)  X(N 1;0)]  2 (B 1)
 (7.6)
 As coefficients H0, H1...H(N 1) are known beforehand, all possible results of each partial product
 may be also known beforehand as follows:
 Hl  X(l;b) =
 8
 <
 :
 Hl if X(l;b) = 1
 0 if X(l;b) = 0
 The possible results may be pre-stored in a look-up table and addressed by different bits of Xl.
 Hence, the same bits X(0;i), X(1;i), ...X(N 1;i) of the N input samples X0, X1, ...X(N 1) are used
 to address small LUTs where partial products terms are stored.
 In addition, a power of two scale factor makes multiplication and division simpler using shift
 registers. Indeed, multiplying (resp. dividing) multiplicands by a power of two' number (2i) is
 equivalent to shifting right (resp. left) i times a binary point. Furthermore, as the numbers are
 2's complement signed, adders may implement subtraction and addition. It is relatively easy to
 remove this gain factor at the output of the filter just by shifting back the binary decimal point.
 7.9.2 Implementing Y using LUT-based DA
 In figure 7.7 is shown an N  order FIR filter implementation using Look-Up-Table based serial
 DA. Input samples are sent as a one-bit-serial stream (via ?time skew buffer?) in N shift registers.
 At each clock cycle, the LUT is addressed by the ith bit of each of the N input samples. The
 corresponding partial product stored in the LUT is applied to the scaling accumulator. The scaling
 accumulator automatically sums partial products and scales the output Y . Let Tc be the time to
 compute one output sample :
 Tc = B  tsm
 263
Appendix E. Implementating of a Multi-shape Hardware Task : the FIR Filter
 where B is the bit width of input samples Xl, and tsm the time required by the scaling accumulator
 to perform a scaled-summation .
 Figure 7.7: Serial Distributed Arithmetic
 Thanks to its simple architecture, serial DA may operate at high frequency and thus achieve
 the same throughput with parallel implementation, but with a greater latency. However, one
 drawback of this architecture is the rapid growth of the required memory with respect to the filter
 length. Indeed, a N order filter requires 2N words size memory storage capacity2 to store partial
 products. For example, a 16 coefficients filter requires a 216 = 65:5x103 words LUT.
 Figure 7.8: Serial-Parallel Distributive Arithmetic
 2
 only the half
 1
 2  2
 N = 2N 1 is required for a linear phase response FIR filter
 264
Appendix E. Implementating of a Multi-shape Hardware Task : the FIR Filter
 Fortunately, as shown in figure 7.8 one may use n LUTs of size 2
 N
 n
 instead of 2N words size
 LUT to implement an N-order filter. Indeed, as partial products could be computed in parallel,
 many LUTs may be used in parallel. In the example of figure 7.8, two LUTs of 24 words size
 are used instead of a 28 words size to implement a 8  order FIR filter. This way, one saves
 28  2  24 = 224 words size LUT at the cost of one more adder. All these implementation trade-
 offs lead to variant tasks size with variant execution time as illustrated earlier in figure 5.5, page
 192.
 7.9.3 Throughput vs reconfigurable resources trade-off
 Table 7.9 depicts examples of trade-off between reconfigurable resources utilization and through-
 put in DA-based implementation of a FIR filter. On one hand, the fully parallel implementation
 provides the highest throughput at the cost of more resources utilization (3072 slices). One out-
 put sample is delivered every clock cycle. On the other hand, various combinations of multi-bit
 serial DA implementation require more than one clock cycle to compute and deliver each output
 sample, but utilizes less resource. As multi-bit serial technique uses several serial units, it provides
 a trade-off between high resources utilization of fully parallel and low throughput of fully serial.
 The table illustrates examples where the many the number of clocks cycle required compute one
 output sample, the less the slices used.
 The throughput of the
 filter, therefore its
 processing time, depends
 on the amount of
 configurable resources used
 (Xilinx, 2005)
 Number of Clock Cycles Slice Filter Sample
 per Output Sample Count Rate (MHz)
 1 3072 150
 2 1571 75
 3 994 50
 4 802 37.5
 Figure 7.9: Example of resources utilization for different DA implementation of a FIR filter
 265