C. Jason-xue, Z. Shao, E. Hsing-mean, and . Sha, Maximize parallelism minimize overhead for nested loops via loop Striping, J. VLSI Sig. Proc. Syst, vol.47, issue.2, pp.153-167, 2007.

N. Luiz, P. , E. Hsing-mean, and . Sha, Achieving full parallelism using multi-dimensional retiming, J. IEEE Trans. Par. Dist. Syst, vol.7, issue.5, pp.1150-1163, 1996.

M. Sheliga, N. L. Passos, E. Hsing-mean, and . Sha, Fully parallel hardware/software codesign for multidimensional DSP applications, Proceedings of the 4th International Workshop on Hardware/Software Co-Design (CODES'96), pp.18-25, 1996.
DOI : 10.1109/hcs.1996.492222

URL : http://www.cse.nd.edu/~esha/papers/nelson/iwhsc96.ps

C. Jason-xue, E. Hsing-mean, Z. Sha, M. Shao, and . Qiu, Effective Loop Partitioning and Scheduling under Memory and Register Dual Constraints, Proceedings of the conference on Design, automation and test in Europe (DATE'08). Munich(Germany), pp.1202-1207, 2008.

Q. Zhuge, Z. Shao, B. Xiao, E. Hsing-mean, and . Sha, Design space minimization with timing and code size optimization for embedded DSP, Proceedings of the 1st IEEE/ACM/IFIP international conference on Hardware/software codesign & system synthesis , CODES+ISSS '03, pp.144-149, 2003.
DOI : 10.1145/944645.944685

N. Luiz, P. , E. Hsing-mean, and . Sha, Scheduling of Uniform Multi-Dimensional Systems under Resource Constraints, J. IEEE Trans. VLSI Syst, vol.6, issue.4, pp.719-730, 1998.

Q. Zhuge, C. Jason-xue, M. Qiu, J. Hu, E. Hsing-mean et al., Timing optimization via nest-loop pipelining considering code size, Microprocessors and Microsystems, vol.32, issue.7, pp.351-363, 2008.
DOI : 10.1016/j.micpro.2008.02.002

D. C. Nelson-luiz-passos, R. J. Defoe, R. Bailey, R. P. Halverson, and . Simpson, Theoretical Constraints on Multi-Dimensional Retiming Design Techniques, Proc. Of Visual Information Processing X, pp.238-245, 2001.

T. William, O. , S. Tongsima, E. Hsing-mean, and . Sha, Extended retiming: Optimal scheduling via a graphtheoretical approach, IEEE conference on the Acoustics, Speech, and Signal Processing (ICASSP'99), pp.2001-2004, 1999.

E. Charles, J. B. Leiserson, and . Saxe, Retiming synchronous circuitry, Algorithmica, vol.6, pp.1-6, 1991.

Y. Elloumi, M. Akil, and M. H. Bedoui, Timing and Code Size Optimization on Achieving Full Parallelism in Uniform Nested Loop, J. of comput, vol.3, issue.7, pp.68-77, 2011.

N. Maheshwari and S. Sapatnekar, Efficient retiming of large circuits, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.6, issue.1, pp.74-83, 1998.
DOI : 10.1109/92.661250

URL : http://www-cad.eecs.berkeley.edu/HomePages/wjiang/ee219b/sapatnekar_retime.pdf

H. Zhou, Deriving a new efficient algorithm for min-period retiming, Proceedings of the 2005 conference on Asia South Pacific design automation , ASP-DAC '05, pp.990-993, 2005.
DOI : 10.1145/1120725.1120774

J. Wang and H. Zhou, An efficient incremental algorithm for min-area retiming, Proceedings of the 45th annual conference on Design automation, DAC '08, pp.528-533, 2008.
DOI : 10.1145/1391469.1391603

URL : http://users.eecs.northwestern.edu/~haizhou/publications/dac08wang.pdf

N. Maheshwari and S. Sapatnekar, Efficient minarea retiming of large level-clocked circuits, Proceedings Design, Automation and Test in Europe, pp.840-847, 1998.
DOI : 10.1109/DATE.1998.655956

URL : http://www.ee.umn.edu/users/sachin/pubhtml/../PUBS/date98.pdf

H. Rong, Z. Tang, R. Govindarajan, A. Douillet, and G. Gao, Single-dimension software pipelining for multidimensional loops, ACM Transactions on Architecture and Code Optimization, vol.4, issue.1, pp.163-174, 2007.
DOI : 10.1145/1216544.1216550

URL : http://www.cgo.org/cgo2004/papers/13_86_rong_h.pdf

U. Bondhugula, M. Baskaran, S. Krishnamoorthy, J. Ramanujam, A. Rountev et al., Automatic Transformations for Communication-Minimized Parallelization and Locality Optimization in the Polyhedral Model, Lecture Notes in Computer Science, vol.4959, pp.132-146, 2008.
DOI : 10.1007/978-3-540-78791-4_9

URL : https://link.springer.com/content/pdf/10.1007%2F978-3-540-78791-4_9.pdf

A. Morvan, S. Derrien, and P. Quinton, Efficient nested loop pipelining in high level synthesis using polyhedral bubble insertion, 2011 International Conference on Field-Programmable Technology, pp.1-10, 2011.
DOI : 10.1109/FPT.2011.6132715

URL : https://hal.archives-ouvertes.fr/hal-00746434

K. Turkington, G. A. Constantinides, K. Masselos, Y. K. Peter, and . Cheung, Outer Loop Pipelining for Application Specific Datapaths in FPGAs, IEEE Transactions on Very Large Scale Integration (VLSI) Systems, vol.16, issue.10, pp.10-1268, 2008.
DOI : 10.1109/TVLSI.2008.2001744

K. Muthukumar and G. Doshi, Software Pipelining of Nested Loops, Lecture Notes in Computer Science, vol.2027, pp.165-181, 2001.
DOI : 10.1007/3-540-45306-7_12

URL : https://link.springer.com/content/pdf/10.1007%2F3-540-45306-7_12.pdf

M. Fellahi, A. Cohen, and S. Touati, Code-size conscious pipelining of imperfectly nested loops, Proceedings of the 2007 workshop on MEmory performance DEaling with Applications, systems and architecture, MEDEA '07, pp.49-55, 2007.
DOI : 10.1145/1327171.1327177

URL : https://hal.archives-ouvertes.fr/hal-00646688

A. Minhaj and . Khan, Improving performance through deep value profiling and specialization with code transformation, J. Comp. Lang. Syst. Struct, vol.37, issue.4, pp.193-203, 2011.

M. Fellahi and A. Cohen, Software Pipelining in Nested Loops with Prolog-Epilog Merging, Lecture Notes in Comp. Sc, vol.18, issue.4, pp.80-94, 2009.
DOI : 10.1007/3-540-36579-6_2

URL : https://hal.archives-ouvertes.fr/inria-00445489

T. Grosser, A. Cohen, H. Paul, J. Kelly, P. Ramanujam et al., Split tiling for GPUs, Proceedings of the 6th Workshop on General Purpose Processor Using Graphics Processing Units, GPGPU-6, pp.24-31, 2013.
DOI : 10.1145/2458523.2458526

URL : https://hal.archives-ouvertes.fr/hal-00786812

M. Liu, E. Hsing-mean, Q. Sha, Y. Zhuge, M. He et al., Loop Distribution and Fusion with Timing and Code Size Optimization, Journal of Signal Processing Systems, vol.18, issue.2, pp.325-340, 2011.
DOI : 10.1177/1094342004038956

A. Qasem and K. Kennedy, Model-guided empirical tuning of loop fusion, International Journal of High Performance Systems Architecture, vol.1, issue.3, pp.183-198, 2008.
DOI : 10.1504/IJHPSA.2008.021798

D. Liu, Z. Shao, M. Wang, M. Guo, and J. Xue, Optimal loop parallelization for maximizing iteration-level parallelism, Proceedings of the 2009 international conference on Compilers, architecture, and synthesis for embedded systems, CASES '09, pp.67-76, 2009.
DOI : 10.1145/1629395.1629407

URL : http://www.cse.unsw.edu.au/~jingling/papers/cases09.pdf

Y. Lee and C. Chen, A two-level scheduling method: an effective parallelizing technique for uniform nested loops on a DSP multiprocessor, Journal of Systems and Software, vol.75, issue.1-2, pp.1-2, 2005.
DOI : 10.1016/j.jss.2003.02.001

J. Chun-jason-xue, Z. Hu, E. Shao, . Hsing-mean, and . Sha, Iterational Retiming with Partitioning: Loop Scheduling with Complete Memory Latency Hiding, J. ACM Trans. Emb. Comp. Syst, vol.9, issue.3 22, 2010.

T. William, O. , E. Hsing-mean, and . Sha, Combining extended retiming and unfolding for rate-optimal graph transformation, J. VLSI Sign. Proc, vol.39, issue.3, pp.273-293, 2005.

L. Pouchet, C. Bastoul, A. Cohen, and J. Cavazos, Iterative optimization in the polyhedral model: Part II, multidimensional time, Proceedings of the 2008 ACM SIGPLAN conference on Programming language design and implementation (PLDI'08), pp.90-100, 2008.
DOI : 10.1109/cgo.2007.21

URL : https://hal.archives-ouvertes.fr/hal-01257273

J. Lai and A. Seznec, TEG: GPU Performance Estimation Using a Timing Model, 2011.
URL : https://hal.archives-ouvertes.fr/hal-00641726

Y. Elloumi, M. Akil, and M. H. Bedoui, Execution Time and Code Size Optimization Using Multidimensional Retiming and Loop Striping, 2013 Euromicro Conference on Digital System Design, pp.462-466, 2013.
DOI : 10.1109/DSD.2013.132

Y. Wang-;-zhiqin and . Liu, Exploring speculative procedure and loop level parallelism in SPLASH2, International Journal of High Performance Systems Architecture, vol.5, issue.2, pp.84-92
DOI : 10.1504/IJHPSA.2014.061439

P. Edson, H. S. Ferlin, C. R. Lopes, M. Lima, and . Perretto, PRADA: a high-performance reconfigurable parallel architecture based on the dataflow model, Int. J. of High Performance Systems Architecture, vol.3, issue.1, pp.41-55, 2011.