Dynamically adaptable NoC router architecture for multiple pixel streams applications
Nicolas Ngan, Eva Dokladalova, Mohamed Akil

To cite this version:

HAL Id: hal-00789363
https://hal-upec-upem.archives-ouvertes.fr/hal-00789363
Submitted on 18 Feb 2013

HAL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers.

L’archive ouverte pluridisciplinaire HAL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d’enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.
Dynamically adaptable NoC router architecture for multiple pixel streams applications

Nicolas Ngan\textsuperscript{1,2}, Eva Dokladalova\textsuperscript{1} and Mohamed Akil\textsuperscript{1}
\textsuperscript{1}Laboratoire Informatique Gaspard Monge, Equipe A3SI, Unite Mixte CNRS-UMLV-ESIEE (UMR 8049), France
\textsuperscript{2}Sagem, Groupe SAFRAN, Argenteuil, France
Emails: nicolas.ngan@sagem.com; \{e.dokladalova, akilm\}@esiee.fr

Abstract—Modern computing systems for vision have to support advanced image applications. They involve several heterogeneous pixel streams and they have to respect hard timing and area constraints. To face those challenges, an adaptable ring-based interconnection network-on-chip (NoC) has been recently proposed. This NoC is based on a new router architecture, with a dynamically adaptable internal datapath, which allows handling of multiple parallel pixel streams. An original datapath adaptation control is proposed by combining instructions and pixel data to be processed in a single packet. Timing performance and area occupation are evaluated on an FPGA prototype.

Index Terms—network-on-chip, router, FPGA, adaptable, image processing

I. INTRODUCTION

Modern embedded vision systems need a performant computing system enabling more and more advanced functionalities and complex applications. Those applications usually involve multiple pixel streams from several heterogeneous image sensors such as low-light, infrared or colour sensors. We can cite for instance panoramic view, picture-in-picture, stereovision [1] or image fusion applications [2]. The cited applications require both spatial and temporal pixel streams management. They also have to meet variable real-time constraints from 25 to 50 frames per second, with a high definition image pixel resolution up to 1080p (1920 × 1080 pixels).

Despite numerous dedicated architectures for different image processing problems [3], [4], there are few systems considering efficient processing and adaptability solutions for multiple pixel streams applications [5], [6]. In particular, performance bottlenecks remain on interconnections to manage multiple streams and efficient parallel data accesses.

In the last decade, several NoC-based solutions [7] have been proposed for image processing applications [8]. The NoC has been studied both at the functional and architectural level [9] in order to bring more flexibility between processing elements (PEs) communication. However, those solutions remain very complex in the context of multiple pixel streams management and application switching adaptations.

Recently, we have proposed a new adaptable ring-based interconnection network-on-chip for a multi-sensors embedded vision system, called Multi Data Flow Ring (MDFR) [10].

To fully exploit that NoC, the router proposed in [11] had to be improved in datapath control transmission. The aim is to avoid a centralized controller, limiting the number of PEs, with dedicated command links to all routers in the NoC.

In this paper, we present a new NoC router architecture which features an original router datapath adaptation control. Datapath adaptations are specified by instructions combined with pixel data in a single packet.

The paper is organized as follows. Section II is a presentation of the NoC principle. Then, the dynamic datapath control is described in Section III. The router architecture proposition is presented in Section IV. Finally, timing and area performance are evaluated in Section V for an FPGA prototype.

II. DYNAMICALLY ADAPTABLE NOC FOR A MULTIPLE IMAGE SENSORS SYSTEM

We present in figure 1, an exemple of using our proposed NoC architecture with three different image sensors at the input and two image displays at the output of the processing system.

This figure shows input image sensors and output displays connected to the NoC through master nodes (M), which represent interfaces to the outside of the embedded computing system. Then, a master can acquire an input pixel stream, or send a pixel stream to a connected display. Inside the network, pixel streams are processed by PEs, which are integrated in the network through slave routers (E) positioned between master nodes. Let us consider that each PE is dataflow-oriented and can process pixel streams with internal pipelined computing structures.

The main role of a slave router (E) is to analyse input pixel data packets and to redirect those packets towards its PE, depending on running image processing applications.

This paper is focused on an architecture proposal for the slave router (E) called Data Flow Router (DFR).

Fig. 1. Principle of using the proposed NoC architecture
A. Data Flow Router operating modes

In order to switch applications involving multiple pixels streams, a DFR has to dynamically adapt its internal datapaths between input and output ports, while avoiding pixel data collisions.

Thus, we define four different predefined operating modes in a DFR: (a) **Forward** (FWD), (b) **Single Stream** (SSP), (c) **Single Stream & Forward** (SSF) and (d) **Multi Stream** (MS).

Figure 2 illustrates examples for a DFR equipped with two unidirectional input ports \(e_1, e_2\) and two unidirectional output ports \(s_1, s_2\). This figure highlights, in red colour, the datapath defined for each operating modes.

**Forward** (FWD) mode, illustrated in figure 2(a), consists in redirecting an input data packet towards an output port without any modification. This mode is activated in the case of the PE is busy (FWD auto), or not able to compute datas with an operation required by the application (FWD check).

**Single Stream** (SSP) mode, illustrated in figure 2(b), is used when the input packet can be processed by the PE. In this mode, input packet datas are transferred to the PE before being sent out of the DFR.

In **Single Stream & Forward** (SSF) mode, illustrated in figure 2(c), the input packet is duplicated. One copy is directly sent out of the DFR without any modification, and the second one is transferred to the PE. A second output port is selected for sending out the resulting output datas from the PE.

In **Multi Stream** (MS) mode, illustrated in figure 2(d), the PE is able to compute \(N\) input pixel streams in parallel \((N = 2\) in our example) to produce one processed pixel stream in a DFR output port.

Those operating modes allow to process several pixel streams sequentially or in parallel, along several DFRs, between two master nodes. Note that MS mode is particularly useful for specific PEs able to combine in parallel pixel streams such as an image fusion or a picture-in-picture operation. Thus, DFR proposition is more optimized for pipelined PE processing compared to generic NoC router solution such as Hermes [12] for instance.

III. Dynamic datapath adaptation

In our NoC, DFR datapath adaptation control is specified as a group of instructions. Those instructions are defined from image applications to be implemented by setting a sequence of operations on pixel datas. According to this sequence, DFRs dynamically adapt internal datapaths in different operating modes.

We propose an original DFR datapath control transmission by combining in a single packet, datapath instructions and associated pixel datas to be processed, as illustrated in figure 3.

![Communication of packets containing instructions and pixel datas](image)

In this figure, a master node (M) transmits two data packets, containing network addresses \([N]\), instructions \([I]\) and datas \([D]\), to be decoded in DFRs. In our NoC, \([H]\) and \([C]\) are specified in packet headers, associated to pixel datas \([D]\) to be processed.

A. Packet header structure

As illustrated in figure 4, the packet header structure is divided in three main parts: a first part of \(S_p\)-bits wide, indicating the **packet size**; a second part of \(S_i\)-bits wide, reserved to instructions and a last part of \(S_a\)-bits wide, describing image attributes. The header is delimited by a **start** and an **end** flit of \(S_t\)-bits wide.

Thus, the complete header of \(S_H\)-bit wide, is the sum of all parts as defined in equation 1.

\[
S_H = (2 \times S_t) + S_p + S_i + S_a
\]
The packet size part specifies the number of pixels in the packet by giving the image block resolution value in pixel width $X$ and pixel height $Y$. In a multiple sensors system, it is critical to be able to differentiate images in the NoC. Thus, the part describing attributes contains image characteristics such as image sensor source $id$ value, an image timestamp $ts$ value and the last image operation $pe$ applied on pixel datas. It also contains network adresses $[N]$ of master node source and destination.

B. Instructions

As detailed in figure 4, an instruction is organized in four fields. (1) $[\text{INST NUMBER}]$ identifies the instruction number in order to ensure consistency between operations on pixels. (2) $[\text{PE OPCODE}]$ gives the image operation, identified by a defined number, to process pixel data in packet. (3) $[\text{NB PASSES}]$ can specify the number of iterations required for the image operation given in $[\text{PE OPCODE}]$. (4) $[\text{TAG}]$ indicates the computing sequence of image operations in a group of instructions (sequential or parallel execution).

The combination of several instructions in a same header infer DFR operating modes, in order to build different PE pipeline structures in the NoC.

IV. DATA FLOW ROUTER ARCHITECTURE

The global DFR architecture is presented in figure 5.

The DFR architecture is organized in three layers: a stream switching layer which is a central layer dedicated to switching input packets; a PE adaptation layer, representing the interface between the DFR and the PE; and a control layer, containing the global DFR control.

The stream switching layer is constituted of several independent units, called Stream Switches, which adapt datapaths of input pixel streams to the connected processing unit or to neighbour routers. There are as many Stream Switches units as the number of input pixel streams to manage in a DFR.

This layer interacts with the control layer containing the global controller, called DFR Controller. The main function of this unit is to control the whole internal DFR datapath adaptation and to arbitrate requests from Stream Switches to access the PE.

Depending on arbitration decision and activated operating modes in Stream Switches, the controller commands the PE adaptation layer to change the input pixel streams direction. The PE adaptation layer contains selection units implemented as a crossbar (XBAR PE) and a demux units (DEMUX PE).

V. HARDWARE PROTOTYPING

To validate the architecture proposition and evaluate the performance in time and area, the DFR has been implemented in a FPGA (Altera Stratix III EP3SL150).

A. Header implementation

Table I presents a header size partition for each part. We define a 6-flits header equivalent to 192 bits, with 2 flits dedicated to a group of four 16-bits instructions.

<table>
<thead>
<tr>
<th>Number of flits</th>
<th>$S_t$</th>
<th>$S_t$</th>
<th>$S_t$</th>
<th>$S_t$</th>
<th>$S_t$ (total)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Size (bits)</td>
<td>32</td>
<td>32</td>
<td>64</td>
<td>32</td>
<td>192</td>
</tr>
</tbody>
</table>

TABLE I
HEADER SIZE PARTITION

As described in previous section, a 16-bits instruction is divided in four fields presented in table II.

<table>
<thead>
<tr>
<th>Field</th>
<th>Size</th>
<th>Description</th>
</tr>
</thead>
<tbody>
<tr>
<td>INST NUMBER</td>
<td>4 bits</td>
<td>Instruction line number</td>
</tr>
<tr>
<td>PE OPCODE</td>
<td>6 bits</td>
<td>Image operation identification</td>
</tr>
<tr>
<td>NB PASS</td>
<td>4 bits</td>
<td>Number of iterations</td>
</tr>
<tr>
<td>TAG</td>
<td>2 bits</td>
<td>Sequential/Parallel execution</td>
</tr>
</tbody>
</table>

TABLE II
16-BITS INSTRUCTION STRUCTURE

With this partitioning, an application can be described by 16 instruction lines with a maximum of 64 distinct types of image operation. Each image operation can be applied on data with a maximum of 16 iterations.

B. DFR area evaluation

Table III presents the FPGA area occupation in a Stratix III EP3SL150 for a DFR with a 6-flits packet header structure.

<table>
<thead>
<tr>
<th>LEs</th>
<th>Registers</th>
<th>Mem (bits)</th>
<th>(% FPGA)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Stream Switch</td>
<td>294</td>
<td>246</td>
<td>100</td>
</tr>
<tr>
<td>DFR Controller</td>
<td>226</td>
<td>130</td>
<td>0</td>
</tr>
<tr>
<td>PE Adapt</td>
<td>21</td>
<td>10</td>
<td>224</td>
</tr>
<tr>
<td>DFR (total)</td>
<td>1154</td>
<td>1738</td>
<td>624</td>
</tr>
</tbody>
</table>

TABLE III
AREA : DATA FLOW ROUTER (EP3SL150)

It shows the number of Logic Elements (LEs), the number of registers, the on-chip-memory (Mem) size required and the area occupation ratio (% in targeted FPGA.

For this prototype, the number of Stream Switches is set to $k=4$, with a bus size set to $n=32$ bits. A 32-bits bus size is sufficient for transmitting any pixel granularity from an 8-bit greyscale to a 24-bit RGB image.

In this configuration, a complete DFR occupies less than 2 % of FPGA with a maximum frequency at 234 MHz.
C. DFR adaptation latency evaluation

This section evaluates the total latency \( L_R \) in cycles between the first flit input and output time in a DFR. Depending on DFR operating modes and the PE of \( L_{PG} \) latency, \( L_R \) value cannot be deterministic. Let us consider \( \delta_h \) latency in cycles between the last header flit and the first data flit, and \( \delta_T \) latency between two input packets to be combined in a PE.

Figure 6 shows a timing diagram describing a transmission of two input packets, with a 2-flits H0-H1 header, in a DFR. Assuming that the PE is not busy, the Stream Switch 0 (SS0) requests to process its input packet in a Single Stream (SSP) operating mode and transmits the packet with a latency of \( L_R(SSP) \) cycles. During that packet processing by the PE, any packet in the other Streams Switches cannot be processed and are directly transferred outside the DFR such as the packet in SS1 in this example, with a latency of \( L_R(FWD) \) without any header analysis.

![Timing diagram for Single Stream and Forward DFR operating modes](image)

Table IV presents the measured latencies for outputting a packet in each operating modes without the arbitration time equivalent to one clock cycle in this implementation. The table details the datapath adaptation latency \( L_a \) and the global interface latency \( L_i \) in cycles. The global interface latency \( L_i \) depends on registers placed at input and output ports in DFR.

<table>
<thead>
<tr>
<th>Mode</th>
<th>( L_a ) (cycles)</th>
<th>( L_i ) (cycles)</th>
<th>( L_R ) (cycles)</th>
</tr>
</thead>
<tbody>
<tr>
<td>Forward (auto)</td>
<td>0</td>
<td>2</td>
<td>2</td>
</tr>
<tr>
<td>Forward (check)</td>
<td>5</td>
<td>2</td>
<td>7</td>
</tr>
<tr>
<td>Single Stream</td>
<td>6</td>
<td>6</td>
<td>12 + ( L_{PG} + \delta_h )</td>
</tr>
<tr>
<td>SSF (packet 1)</td>
<td>5</td>
<td>2</td>
<td>7</td>
</tr>
<tr>
<td>SSF (packet 2)</td>
<td>6</td>
<td>6</td>
<td>12 + ( L_{PG} + \delta_h )</td>
</tr>
<tr>
<td>MS</td>
<td>6 + ( \delta_T )</td>
<td>6</td>
<td>12 + ( L_{PG} + \delta_T )</td>
</tr>
</tbody>
</table>

**TABLE IV**

DFR PACKET PROCESSING LATENCIES

In this table, there is a distinction in Forward mode depending on packet analysis requirement. In Forward (auto) mode, packets are bypassing the analysis and the DFR traversal latency is minimal and equivalent to 2 clock cycles. If the PE is available, 5 additional clock cycles are necessary to analyse header packet, in Forward (check) mode.

In Single Stream mode, \( L_a \) is 6 clock cycles including 1 cycle for requesting datapath modification and 5 cycles to decode header. \( L_i \) is more important because of PE access and \( L_R \) is given with an additional latency \( \delta_h \).

The Single Stream & Forward mode generates two packets. The first one, described as packet 1 in table IV, is transmitted without any modification and the second one, as packet 2, is processed by the PE. Note that, for packet 1, \( L_R \) latency is equal to Forward mode with header analysis.

Finally, the Multi-Stream mode is obviously dependent on the timing difference \( \delta_T \), in cycles, between the arrival of all necessary packets to be combined in parallel by the PE.

From those values, we can conclude that the proposed DFR architecture can output a packet with a \( L_R \) latency value from 2 to 12 clock cycles without considering the PE latency. As our NoC is data-flow oriented with multiple PE pipelines, DFR latencies become negligible compared to the data size in packets representing frame blocks or complete frames. Those pipelines allow to respect real-time image application constraints with a high bandwidth between PEs.

VI. CONCLUSION

In this paper, we have presented a dynamically adaptable NoC router, called Data Flow Router, which is able to manage several pixel streams in parallel. Depending on an original instructions integration in packet headers, the DFR dynamically adapts its internal datapath in order to transmit efficiently pixel streams by building pipelines between PEs in the NoC. FPGA prototype evaluations have been presented for a DFR architecture proposition and show fair performance in timing and area for implementation. Future works will focus on DFR performance evaluation in a complete NoC for different image applications.

REFERENCES