A new FPGA accelerator based on circular buffer unit per orientation for a fast and optimised GLCM and Texture feature computation

Mohamed Amin Ben Atitallah, Rostom Kachouri, Hassene Mnif

To cite this version:
Mohamed Amin Ben Atitallah, Rostom Kachouri, Hassene Mnif. A new FPGA accelerator based on circular buffer unit per orientation for a fast and optimised GLCM and Texture feature computation. IEEE international conference on Design & Test of integrated micro & nano-Systems (DTS), Apr 2019, Gammarth, Tunisia. hal-02172154

HAL Id: hal-02172154
https://hal-upec-upem.archives-ouvertes.fr/hal-02172154
Submitted on 3 Jul 2019
A new FPGA accelerator based on circular buffer unit per orientation for a fast and optimised GLCM and Texture feature computation

Mohamed Amin Ben Atitallah
LETI (E.N.I.S.), University of Sfax, TUNISIA
National Engineering School of Gabes
(ENIG)
University of Gabes, TUNISIA
mohamed.amine@enis.in.rnu.tn

Rostom Kachouri
Gaspard Monge Computer Science Laboratory
ESIEE-Paris
University Paris-Est Marne-la-Vallée,
FRANCE
rostom.kachouri@esiee.fr

Hassene Mnif
LETI (E.N.I.S.), University of Sfax, TUNISIA
ENET.com Sfax, TNISIA
hassene.mnif@enetcom.usf.tn

Abstract—This paper presents an FPGA accelerator based on circular buffer unit per orientation for a fast and optimized Gray Level Co-occurrence Matrix (GLCM) and four Texture features computation. The Four texture features namely, contrast, energy, dissimilarity and correlation are computed using Xilinx FPGA. However, the computation of GLCM and four textures features are very complex and consume a lot of execution time. In this paper, an FPGA accelerator for fast computation of GLCM and four texture features are designed and implemented. This architecture was implemented on a Xilinx Zc-702 using Vivado HLS. The obtained results are then compared against other related works. The synthesis results on FPGA prove a significant gain (about 17%) in execution time compared to the previous work.

Keywords—Image analysis applications; Parallel calculation; Haralick’s texture feature; Hardware/Software Implementation; Circular buffer unit; FPGA; Vivado_HLS; Optimization; Execution time

I. INTRODUCTION

The texture features are used for image classification. These texture features capture information about the patterns that emerge in patterns of texture. The texture features are computed by construction a GLCM matrix that is computationally expensive. Once the GLCM matrix has been constructed, computations of the 14 texture features begin. Some of these Haralick texture features include contrast, energy, dissimilarity and correlation, as well as a variety of entropy measures. Due to the numerical nature of the calculation, this problem is the focus of our optimization.

The GLCM is an effective method of Haralick texture feature extraction, which is focus in texture analysis methods [1, 2], image retrieval [3], image classification [4], image segmentation [5], and image recognition [6].

However, the calculation of the GLCM and the features consume a lot of time. Therefore, many methods to speed up their calculations are highly desired. In order to improve the performance of the Gray Level Co-occurrence Matrix and the features algorithm, we propose a Hardware/Software (HW/SW) implementation on Xilinx FPGA. In the proposed design, the GLCM is calculated in parallel and four texture features are although calculated in parallel. We have implemented the proposed design on Zc_702 FPGA. The synthesis results on FPGA prove a significant gain (about 17%) in execution time compared to the previous work.

In this paper we will present, the GLCM matrices and Haralick Texture Features in Sect 2, some important related works in Sect 3, our HW/SW implementation in Sect 4, the experimental results in Sect 5 and finally, the conclusions in Sect 6.

II. GLCM AND FOUR HARALICK TEXTURE FEATURES

In image processing domain, a texture is a field of the image that appears as a coherent and homogeneous domain, that mean forming everything for an observer. Texture analysis is a very important area in treatment of images, among the main elements of interpretation of the visual message. As well as the filtering is to improve the visual quality of the image or to extract attributes of the image, by modifying the value of gray level of a pixel according to the value of its neighbors. Among the different approaches from the texture analysis we mention the co-occurrences matrix, the texture spectrum, the wavelets and the mathematical morphology. In this paper, we are interested in the co-occurrence matrices. The co-occurrence matrices are analogous to two-dimensional histograms. They represent the number of occurrences of particular pairs of pixel in the image. The elements of the GLCM represent the probabilities of occurrences of the pair of gray levels (i,j) separated by a distance d and oriented by an angle θ. The direction measures are selected as (1), (2), (3), and (4), which corresponds to (0°, 45°, 90°, 135°) in Figure 1, respectively.

The co-occurrence matrices contain the first-order space averages. Several clues have been proposed by Haralick that correspond to descriptive 14 texture features which can be calculated from these matrices. Among the 14 texture features, we are interested just for four types of statistics (Table 1). Energy or Angular second moment (ASM) expresses the regular character of the texture. In general, a high energy is observed when the image is very regular, that is to say when the high values of the GLCM are concentrated in some places of the matrix. Contrast (CON) the texture is contrasted, more the term is greater. Correlation (COR) can be likened to a measure of the linear dependence of gray levels in the image. Dissimilarity (DISS) is used to measure the dissimilarity between two gray levels i and j. These four selected texture features are better for texture feature extraction [9].

In fact, the arithmetic operations of texture features in HW have almost the same order of complexity in implementation.
With:
• \((k, l)\), the coordinates of a graylevel pixel \(i \in [0, n_{max}-1]\).
• \((m, n)\), those of the graylevel pixel \(j \in [0, n_{max}-1]\).

Fig. 1. Direction measure

### TABLE I. TEXTURE FEATURES

<table>
<thead>
<tr>
<th>Features</th>
<th>Formula</th>
</tr>
</thead>
<tbody>
<tr>
<td>Energy or ASM</td>
<td>(f_1 = \sum_{i=1}^{N_g} \sum_{j=1}^{N_g} p(i, j)^2)</td>
</tr>
<tr>
<td>Contrast</td>
<td>(f_2 = \frac{1}{n^2} \sum_{n=1}^{N_g} \sum_{l=1}^{N_g} \sum_{m=1}^{N_g} \sum_{j=1}^{N_g} p(i, j))</td>
</tr>
<tr>
<td>Correlation</td>
<td>(f_3 = \frac{\sum_{i,j} (i,j) p(i,j) - \mu_x \mu_y}{\sigma_x \sigma_y})</td>
</tr>
<tr>
<td>Dissimilarity</td>
<td>(f_4 = \sum_{i=1}^{N_g} \sum_{j=1}^{N_g}</td>
</tr>
</tbody>
</table>

III. RELATED WORKS

Many works have been focused on speeding up the process of the GLCM and the texture features calculation algorithms on FPGAs.

Girisha and al [10] proposed a HW implementation to calculate the GLCM matrices for single direction \((\theta=0^\circ)\) and a distance \((d=1)\). The proposed architecture is implemented on a Xilinx FPGA. In fact, the size of the input image is 8x8 pixels, each pixel is represented on \((n=4)\) bits and \((N_g=8)\). In fact, the input image of this architecture is stored in two memory blocks to determine the number of occurrences of particular pairs of pixel in the image.

This proposed architecture can compute GLCM matrices for just a single angle.

Ali Reza and al [11] proposed a Hardware implementation to compute the GLCM for four angles \((\theta=0^\circ, 45^\circ, 90^\circ, 135^\circ)\) in parallel for an image size 128X128 pixels. Each pixel of the image is represented on \((n=8)\). The HW implementation of GLCM was done on the a Xilinx Virtex 5 platform. In fact, the size of the matrix GLCM is \(N_g \times N_g\) (256x256) since each pixel is presented on \((n=8)\) bits and with distance \((d=1)\). This HW implementation can compute the GLCM matrices in parallel and stored them in memory blocks RAM to determine the calculations of the texture features. The computation of the texture features are done on integer format of 16 bits.

The output results of this proposed architecture are done in an integer format, which generates inaccurate results. In addition, the size of the input image is fixed.

Amin and al [12] proposed a HW / SW implementation to calculate four GLCM matrices \((0^\circ, 45^\circ, 90^\circ\) and \(135^\circ)\) in parallel. The HW implementation of GLCM was done on the Zedboard platform based on the Zynq circuit. In fact, the size of the matrix GLCM is \(N_g \times N_g\) (256x256) since each pixel is presented on \((n=8)\) bits with \((d=1)\). The coefficients of the matrix are presented on \((m=16)\) bits.

This architecture has as disadvantage that the size of image is limited.

Dimitris and al [13] presented a HW implementation to compute 16 GLCMs matrix and four feature vectors using a single core. They chose \(N_g = 64\), the number of gray level for different image sizes from \((512 \times 512)\) to \((2048 \times 2048)\). They used fixed-point operations instead of floating-point operations to compute four texture features namely, Angular Second Moment (ASM), correlation (Cor), Inverse Difference Moment (IDM) and entropy. Their architecture has 16 computation units of GLCM matrices. The size of the GLCM matrices is \(N_g \times N_g\) (64x64) since each pixel is presented on \((n=6)\) bits and with distance \((d=1)\). This HW implementation can compute the GLCM matrices of 16 image blocks in parallel.
and stored them in two RAM blocks to determine the calculations of the texture features. This architecture takes into account the overlap between the windows but it doesn’t take into account the overlap between the different image blocks, which generates not accurate results.

The Dimitris hardware implementation method is the best solution compared to the existing since the input image size is unlimited compared to the other architectures.

The goal of our research project is to solve the problems of the proposed architecture of Dimitris.

IV. HARDWARE ARCHITECTURE OF GLCM AND TEXTURE FEATURES

FPGAs, or reconfigurable chips, have not stopped evolving since their creation and now they are used in complete systems (Xilinx Zynq or Altera Stratix). Nevertheless, there are still many fields application fields which they are absent, and wrongly [14]. For that, many programmers used the High Level Synthesis (HLS) tools to increase the productivity of FPGA. The HLS tools make it possible to pass from the high-level behavioural description of a system, written in C / C ++ code, System C ..., to a synthesizable RTL code consisting of a data path and a control logic [15].

A. Architectural design for GLCM and texture features

The architecture of the implemented hardware is shown in Figure 3. Our architecture was developed by a c code compatible with the HLS tool. The SW part iteratively feeds the Xilinx platform with four vectors. Each pixel of the vectors is presented by (n=8) bits so (Ng=256). Our proposed architecture reads each vector’s pixels, computes the GLCM of each vector and their respective texture features for the four directions and for a distance d=1, and store them into bank memory.

Once the matrices are computed, we can precede the computation of the four texture features in parallel. The output of our system includes two outputs, one presents the texture features results and the other to detect the end of sending data. The SW part is used for setting and sending data via the DMA. Also, perform the division operation to maintain the accuracy of the floating values of contrast, correlation, dissimilarity and ASM. The arrangement of the four vectors for the four angles (θ =0°, 45°, 90° and 135°) is done in the software part. In fact, in our architecture we use a Direct Access Memory in order to accelerate the Data transfer.

B. Circular buffer unit

In order to avoid the saturation of memory, four circular buffers units have been used to reduce the external bandwidth requirements of our architecture. The Figure 2 shows the pixels of the input block. The squares in the grid represent the pixels in the block. The colored pixels are stored in the circular buffers. Two outputs are presented at the end of each circular buffer, namely the central pixel (black background) and its neighboring pixel for the four directions (gray background). As shown in Figure 2, in every clock cycle the last pixel is removed from each circular buffer, the neighborhood is shifted by 1 pixel to the right and a new pixel is inserted into the buffer.

In order to improve the architectural performances, we have exploited the optimization levels favored by Vivado HLS. Firstable, in the "Solution 2", we applied the RESOURCE directive at the level of GLCM matrices to implement them as BRAMs. After that, the ALLOCATION directive is applied to the texture features operations, thus making it possible to share the resources used by the hardware block. In the “Solution 3”, we applied the UNROLL directive in the level of texture features. Finally, in the "Solution 4", we sought to improve the maximum throughput completed by hardware IP. For this reason, the PIPELINE optimization option is applied at the level of GLCM and texture features. It should be noted that the

<table>
<thead>
<tr>
<th>Solutions</th>
<th>FPGA</th>
<th>DSP %</th>
<th>FF %</th>
<th>LUT %</th>
<th>BRAM %</th>
<th>Latency number</th>
</tr>
</thead>
<tbody>
<tr>
<td>#Solution 1</td>
<td>Zc-102</td>
<td>38</td>
<td>2</td>
<td>5</td>
<td>91</td>
<td>2557658</td>
</tr>
<tr>
<td>#Solution 2</td>
<td>Zc-102</td>
<td>38</td>
<td>3</td>
<td>7</td>
<td>91</td>
<td>1019949</td>
</tr>
<tr>
<td>#Solution 3</td>
<td>Zc-102</td>
<td>38</td>
<td>3</td>
<td>7</td>
<td>91</td>
<td>1170725</td>
</tr>
<tr>
<td>#Solution 4</td>
<td>Zc-102</td>
<td>38</td>
<td>3</td>
<td>9</td>
<td>91</td>
<td>232504</td>
</tr>
</tbody>
</table>
synthesis of the "Solution 1" is carried out without any optimization criteria. From the results shown in Table 2, we note that the Solution 4 has the minimum of latency number (or cycles number) over the other solutions. Indeed, the solution 4 allows a significant contribution about 56% in cycles number compared to the solution 1. The next analyzes will be developed based on the "Solution 4" since it satisfies a compromise of the cycles number.

![Diagram](image-url)

Fig. 3. Proposed HW/SW architecture of GLCM and texture feature computation
V. EXPERIMENTAL RESULTS

In this section, our goal is to validate the performance of our proposed design in Hardware/Software codesign.

The implementation of our architecture has been done on Zc-702 FPGA device using Vivado HLS. The Zynq FPGA family contains in addition to the programmable logic (PL) a dual core ARM microprocessor system with its memory controllers and external peripherals as shown in Figure 4. It is a hardcore processor that can operate with a maximum frequency of 700 MHz.

![Fig. 4. Architecture of Zc-702 FPGA](image)

The HLS tool in its internal operation combines three main steps (Figure 5). The first is the compilation. It is used to transform the input code of the HLS tool into an Intermediate Representation (IR) adapted to the needs of the tool. In the second step, the circuit in its intermediate representation undergoes transformations. Finally, we generate the bitstream to configure the HW part using Vivado IDE.

![Fig. 5. System design flow](image)

In the context of SoC development, Xilinx provides the ZC 702 FPGA card. As shown in the Figure 6, it consists of a PS (Processing System) part based on an ARM Dual Cortex A9 processor that communicate with a PL (Programmable Logic) part through internal communication bus. In order to improve the architectural performance of the GLCM design, four vectors are grouped together in the same hardware as the parallel Design IP by exploiting four DMA ports in parallel. The first DMA is configured in read/write mode at the same time while the others DMA are used in only write mode.

![Fig. 6. Heterogeneous SoC design](image)

In the table 2, a comparative study was realized between our proposed architecture and previous work.

In fact, according to the Figure 7, we notice a minimization about 17% of the execution time of our HW/SW solution compared to the Dimitris[13] solution.

![Fig. 7. Comparison of execution time of our proposed HW / SW solution and the other work](image)

As shown in the Figure 8, we notice a minimization about 94% of the execution time of the HW / SW solution compared to the SW solution. As a result, we can conclude that the HW / SW implementation is more efficient than the SW part.

![Fig. 8. Comparison of the execution time of the SW and our proposed HW / SW solution](image)

From these analyzes, it is obvious that the HW / SW solution allows a compromise between the SW flexibility and the HW performance.
VI. CONCLUSION

In conclusion, in this paper we proposed an FPGA accelerator based on circular buffer unit per orientation for a fast and optimized GLCM \((\theta = 0^\circ, 45^\circ, 90^\circ, 135^\circ)\) and four Texture features computation. The synthesis results on FPGA prove a significant gain (about 17\%) in execution time compared to previous work. To highlight the performance of our proposed architectures, we adopted a joint HW/SW design to estimate the execution time. These analyzes proved a considerable gain (about 94\%) of the HW/SW solution compared to the SW in execution time.

Finally, our next work are considering to extract the texture features from a video frames.

ACKNOWLEDGMENT

We thank our colleagues from ESIEE Paris, who provided insight and expertise that greatly assisted the research and improved the manuscript.

REFERENCES


<table>
<thead>
<tr>
<th>Proposed architectures</th>
<th>FPGA</th>
<th>Pixels number</th>
<th>GLCM size</th>
<th>Image size</th>
<th>HW/SW calculations</th>
<th>Freq (Mhz)</th>
<th>Execution time</th>
</tr>
</thead>
<tbody>
<tr>
<td>Dimitris et al [13]</td>
<td>XCV2000E -6</td>
<td>6</td>
<td>64x64</td>
<td>176X144</td>
<td>GLCM+Cor+ASM+IDM+Entropy</td>
<td>100</td>
<td>7 ms</td>
</tr>
<tr>
<td>Our architecture</td>
<td>Zc-702</td>
<td>6</td>
<td>64x64</td>
<td>176X144</td>
<td>GLCM+Cont+ASM+Cor+DISS</td>
<td>100</td>
<td>3.9 ms</td>
</tr>
<tr>
<td>Our architecture</td>
<td>Zc-702</td>
<td>8</td>
<td>256x256</td>
<td>176X144</td>
<td>GLCM+Cont+ASM+Cor+DISS</td>
<td>100</td>
<td>5.8 ms</td>
</tr>
</tbody>
</table>