Optimization of recurrence computations on vector and parallel computers

Przemysław Stpiczyński

Abstract


The aim of this monograph is to present the author’s contribution to the fields of designing vector and parallel algorithms for solving problems based on linear recurrence computations and optimizing such software for modern vector and parallel computer architectures. In the first chapter, we give a concise overview of the fundamentals of of modern computer architectures, performance analysis and methods for vector parallel programming. We also define the problem of solving linear recurrence systems and present some basic results which can be found in the literature. Chapter 2 describes the use of the Level 1 BLAS operation AXPY as a key to efficient vectorization of m-th order linear recurrence systems with constant coefficients. Applying the Hockney-Jesshope model of vector computations, we present the performance analysis of the algorithm. We also consider the influence of memory bank conflicts. The theoretical analysis is supported by the experimental results collected on two vector supercomputers manufactured by Cray. The aim of Chapter 3 is to present a new efficient BLAS-based algorithm for solving linear recurrence systems with constant coefficients, which can be easily and efficiently implemented on shared memory machines. The algorithm is based on Level 3 and Level 2 BLAS routines GEMM, GEMV and TRMV, which are crucial for its efficiency. The results of experiments performed on a various shared memory computers are also presented and discussed. The can be efficiently implemented on high performance message passing parallel and vector computers and clusters of workstations. In Chapter 4 we analyze the performance of the algorithm using two well known models: BSP and Hockney-Jesshope. Finally, we present the results of experiments performed of a Cray X1 and two different clusters running under Linux. The aim of the chapters 5, 6 and 7 is to show that the introduced algorithms for solving linear recurrence systems can be applied for solving a number of problems which arise in scientific computing. In Chapter 5 we introduce a new BLAS-based algorithm for narrow-banded triangular Toeplitz matrix-vector multiplication and show how to evaluate linear recursive filters efficiently on distributed memory parallel computers. We apply the BSP model of parallel computing to predict the behavior of the algorithm and to find the optimal values of the method’s parameters. The results of experiments performed on a cluster of twelve dual-processor Itanium 2 computers and Cray X1 are also presented and discussed. The algorithm allows to utilize up to 30% of the peak performance of 24 Itanium processors, while a simple scalar algorithm can only utilize about 4% of the peak performance of a single processor. Next, we show that the performance of the algorithm for evaluating linear recursive filters can be increased by using new generalized data structures for dense matrices introduced by F. G. Gustavson. The results of experiments performed on Intel Itanium 2, Cray X1 and dual-processor Quad-Core Xeon are also presented and discussed. In Chapter 6 we introduce a new high performance divide and conquer algorithm for finding trigonometric sums which can be applied to improve the performance of the Talbot’s method for the numerical inversion of the Laplace Transform on modern computer architectures including shared memory parallel computers. We also show how to vectorize the first stage of the Talbot’s method, namely computing all coefficients of the trigonometric sums used by the method. Numerical tests show that the improved method gives the same accuracy as the standard algorithm and it allows to utilize parallel processors. In Chapter 7 we show hot to apply our algorithms for polynomial evaluation. Finally, Chapter 8 presents the new data distribution of triangular matrices that provides steady distribution of blocks among processes and reduces memory wasting in comparison with the standard block-cyclic data layout that is used in the ScaLAPACK library for dense matrix computations. The new algorithm for solving triangular systems of linear equations is also introduced. The results of experiments performed on a cluster of Itanium 2 processors and Cray X1 show that in some cases, the new method is faster than corresponding PBLAS routines PSTRSV and PSTRSM.

Keywords


recursive computing; computing capacity; BLAS; optimizing programs on computer; linear recursive filters; AXPY; GER

Full Text:

PDF (Polski)

References


Abate J., Valko P.: Multi-precision Laplace transform inversion, Int. J. Numer. Mech. Eng., 60, s. 797÷993, 2004.

Aho A.V., Sethi R., Ullman J.D.: Kompilatory: reguły, metody i narzędzia, WNT, Warszawa 2002.

Allen R., Kennedy K.: Optimizing compilers for modern architectures: a dependence-based approach, Morgan Kaufmann, 2001.

Anderson E., Bai Z., Bischof C., Demmel J., Dongarra J., Du Croz J., Greenbaum A., Hammarling S., McKenney A., Ostruchov S., Sorensen D.: LAPACK User’s Guide, SIAM, Philadelphia 1992.

Axelsson O., Eijkhout V.: A note on the vectorization of scalar recursions., Parallel Comput., 3, s. 73÷83, 1986.

Baboulin M., Giraud L., Gratton S., Langou J.: A distributed packed storage for large dense parallel incore calculations, Tech. Rep. TR/PA/05/30, CERFACS, 2005.

Baker A., Dennis J., Jessup E.R.: Toward memory-efficient linear solvers, Lecture Notes in Computer Science, 2565, s. 315÷238, 2003.

Bansal S.S., Vishal B., Gupta R.: Near optimal Cholesky factorization on orthogonal multiprocessors, Information Processing Letters, 84, s. 23÷30, 2002.

Bario R., Melendo B., Serrano S.: On the numerical evaluation of linear recurrences, J. Comput. Appl. Math., 150, s. 71÷86, 2003.

Bini D.A., Fiorentino G.: On the parallel evaluation of a sparse polynomial at a point., Numer. Algorithms, 20, s. 323÷329, 1999.

Bisseling R.H.: Parallel Scientific Computation. A Structured Approach Using BSP and MPI, Oxford University Press, 2004.

Blackford L., et al: ScaLAPACK user’s guide, SIAM, Philadelphia 1997.

Blelloch G., Chatterjee S., Zagha M.: Solving linear lecurrences with loop raking, Journal of Parallel and Distributed Computing, 25, s. 91÷97, 1995.

Buttari A., Dongarra J., Kurzak J., Langou J., Luszczek P., Tomov S.: The impact of multicore on math software, Lecture Notes in Computer Science, 4699, s. 1÷10, 2007.

Carlson D.A.: Solving linear recurrence systems on mesh-connected computers with multiple global buses, Journal of Parallel and Distributed Computing, 8, s. 89÷95, 1990.

Carlson D.A., Sugla B.: Time and Processor Efficient Parallel Algorithms for a Recurrence Equations and Related Problems, Proceedings of the 1984 International Conference on Parallel Processing (ICPP’84), s. 310÷314, IEEE, Bellaire, Michigan 1984.

Chandra R., Dagum L., Kohr D., Maydan D., McDonald J., Menon R.: Parallel Programming in OpenMP, Morgan Kaufmann Publishers, San Francisco 2001.

Chaudron M.R., van Duin A.C.: The formal derivation of parallel triangular system solvers using a coordination-based design method., Parallel Comput., 24, s. 1023-1046, 1998.

Chen S.C., Kuck D.J.: Time and parallel processor bounds for linear recurrence systems., IEEE Trans. Comput., 24, s. 701÷717, 1975.

Chen S.C., Kuck D.J., Sameh A.H.: Practical parallel band triangular system solvers, ACM Trans. on Math. Soft., 4, s. 270÷277, 1978.

Choi J., Dongarra J., Ostrouchov S., Petitet A., Walker D., Whaley R.: LAPACK Working Note 100: a proposal for a set of parallel basic linear algebra subprograms, http://www.netlib.org/lapack/lawns, 1995.

Consortium UPC: UPC Language Specifications, 2005.

Cormen T., Leiserson C., Rivest R.: Introduction to Algorithms, MIT Press, 1994.

Daydé M.J., Duff I.S.: The RISC BLAS: a blocked implementation of level 3 BLAS for RISC processors, ACM Trans. Math. Soft., 25, s. 316÷340, 1999.

Daydé M.J., Duff I.S., Petitet A.: A parallel block implementation of Level-3 BLAS for MIMD vector processors, ACM Trans. Math. Soft., 20, s. 178÷193, 1994.

D’Azevedo E.F., Dongarra J.J.: LAPACK Working Notes 135: packed storage extension for ScaLAPACK, 1998.

de Rosa M.A., Giunta G., Rizzardi M.: Parallel Talbot’s algorithm for distributed memory machines, Parallel Computing, 21, s. 783÷801, 1995.

Dongarra J.: Performance of Various Computer Using Standard Linear Algebra Software, http://www.netlib.org/benchmark/performance.ps.

Dongarra J., Bunsch J., Moler C., Stewart G.: LINPACK User’s Guide, SIAM, Philadel-phia 1979.

Dongarra J., DuCroz J., Duff I., Hammarling S.: A set of level 3 basic linear algebra subprograms, ACM Trans. Math. Soft., 16, s. 1÷17, 1990.

Dongarra J., DuCroz J., Hammarling S., Hanson R.: An extended set of Fortran basic linear algebra subprograms, ACM Trans. Math. Soft., 14, s. 1÷17, 1988.

Dongarra J., Duff I., Sorensen D., Van der Vorst H.: Solving Linear Systems on Vector and Shared Memory Computers, SIAM, Philadelphia 1991.

Dongarra J., Duff I., Sorensen D., Van der Vorst H.: Numerical Linear Algebra for High Performance Computers, SIAM, Philadelphia 1998.

Dongarra J., Gustavson F., Karp A.: Implementing linear algebra algorithms for dense matrices on a vector pipeline machine, SIAM Rev., 26, s. 91÷112, 1984.

Dongarra J., Hammarling S., Sorensen D.: Block reduction of matrices to condensed form for eigenvalue computations, J. Comp. Appl. Math, 27, s. 215÷227, 1989.

Dongarra J., Johnsson L.: Solving banded systems on parallel processor, Parallel Computing, 5, s. 219÷246, 1987.

Dongarra J., et al: PVM: A User’s Guide and Tutorial for Networked Parallel Computing, MIT Press, Cambridge 1994.

Dongarra J.J., Whaley R.C.: LAPACK Working Note 94: A User’s Guide to the BLACS V1.1, http://www.netlib.org/blacs, 1997.

Elmroth E., Gustavson F., Jonsson I., Kagström B.: Recursive blocked algorithms and hybrid data structures for dense matrix library software, SIAM Rev., 46, s. 3÷45, 2004.

et al. J.D., ed.: The Sourcebook of Parallel Computing, Morgan Kaufmann Publishers, 2003.

Flynn M.: Some computer organizations and their effectiveness, IEEE Trans. Comput.,C–21, s. 94, 1972.

Gajski D.D.: Processor Arrays for Computing Linear Recurrence Systems, Proceedings of the 1978 International Conference on Parallel Processing (ICPP’78), s. 246÷256, IEEE, Bellaire, Michigan 1978.

Garbow B., Boyle J., Dongarra J., Moler C.: Matrix eigensystems routines – EISPACK guide extension, Lecture Notes in Computer Science, Springer-Verlag, New York 1977.

Grama A., Gupta A., Karypis G., Kumar V.: An Introduction to Parallel Computing: Design and Analysis of Algorithms, Addison Wesley, 2003.

Greenberg A., R.E.Lander, Paterson M., Galil Z.: Efficient parallel algorithms for linear recurrence computation, Inf. Proc. Letters, 15, s. 31÷35, 1982.

Gropp W., Lusk E., Skjellum A.: A High-Performance, Portable Implementation of the MPI Message Passing Interface Standard, http://www.unix.mcs.anl.gov/mpi/mpich1/papers/mpicharticle/paper.html.

Gustavson F.G.: New generalized data structures for matrices lead to a variety of high performance algorithms, Lect. Notes Comput. Sci., 2328, s. 418÷436, 2002.

Gustavson F.G.: High-performance linear algebra algorithms using new generalized data structures for matrices, IBM J. Res. Dev., 47, s. 31÷56, 2003.

Gustavson F.G.: The Relevance of New data structure approaches for dense linear algebra in the new multi-core / many core environments, Lecture Notes in Computer Science, 4967, s. 618÷621, 2008.

Gustavson F.G., Karlsson L., Kagstrom B.: Three algorithms for Cholesky factorization on distributed memory using packed storage, Lecture Notes in Computer Science, 4699, s. 550÷559, 2007.

Heath M., Romine C.: Parallel solution of triangular systems on distributed memory multiprocessors, SIAM J. Sci. Statist. Comput., 9, s. 558÷588, 1988.

Higham N.J.: Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia 1996.

Hill J., Donaldson S., Skillicorn D.: Portability of Performance with the BSPlib Communications Library, Programming Models for Massively Parallel Computers, (MPPM’97), IEEE Computer Society Press, London 1997.

Hockney R., Jesshope C.: Parallel Computers: Architecture, Programming and Algorithms, Adam Hilger Ltd., Bristol 1981.

Horvitz G., Bisseling R.H.: Designing a BSP Version of ScaLAPACK, B. Hendrickson, et al, eds., Proceedings Ninth SIAM Conference on Parallel Processing for Scientific Computing, SIAM, Philadelphia 1999.

Hyafil L., Kung H.: The complexity of parallel evaluation of linear recurrences., J. Assoc. Comput. Mach., 24, s. 513÷521, 1977.

Intel Corporation: Intel Pentium 4 and Intel Xeon Processor Optimization Reference Manual, 2002.

Intel Corporation: IA-32 Intel Architecture Software Developer’s Manual. Volume 1: Basic Architecture, 2003.

Intel Corporation: Intel 64 and IA-32 Architectures Optimization Reference Manual, 2007.

Intel Corporation: Intel 64 and IA-32 Architectures Software Developer’s Manual. Volume 1: Basic Architecture, 2008.

Jamieson L.H., Gannon D.B., Douglass R.J.: The Characteristics of Parallel Algorithms, MIT Press, 1987.

Johnsson S.: Solving narrow banded systems on ensemble architectures., ACM Trans. Math. Softw., 11, s. 271÷288, 1985.

Kagström B., Ling P., Loan C.V.: GEMM-based level 3 BLAS: high-performance model implementations and performance evaluation benchmark, ACM Trans. Math. Soft., 24, s. 268÷302, 1998.

Kim H.S., Yoon Y.H., Han D.S.: Parallel processing of first order linear recurrence on SMP machines, The Journal of Supercomputing, 27, s. 295÷310, 2004.

Kincaid D., Chenay W.: Analiza numeryczna, WNT, Warszawa 2006.

Kitowski J.: Współczesne systemy komputerowe, CCNS, Kraków, 2000.

Kowarschik M., Weiss C.: An overview of cache optimization techniques and cache-aware numerical algorithms, Lecture Notes in Computer Science, 2625, s. 213÷232, 2003.

Kozielski S., Szczerbinski Z.: Komputery równoległe: architektura, elementy programowania, WNT, Warszawa 1994.

Kozniewska I.: Rownania rekurencyjne, PWN, Warszawa 1972.

Kuszmaul B.C., Henry D.S., Loh G.H.: A comparison of asymptotically scalable super-scalar processors, Theory of Computing Systems, 35, s. 129÷150, 2002.

Lacoursiere C.: A Parallel block iterative method for interactive contacting rigid multibody simulations on multicore PCs, Lecture Notes in Computer Science, 4699, s. 956÷965,2007.

Larid M.: A Comparison of Three Current Superscalar Designs, Computer Architecture News (CAN), 20, s. 14÷21, 1992.

Larriba-Pey J.L., Navarro J.J., Jorba A., Roig O.: Review of general and Toeplitz vector bidiagonal solvers., Parallel Comput., 22, s. 1091÷1125, 1996.

Lawson C., Hanson R., Kincaid D., Krogh F.: Basic linear algebra subprograms for Fortran usage, ACM Trans. Math. Soft., 5, s. 308÷329, 1979.

Lee J.b., Sung W., Moon S.M.: An Enhanced Two-level Adaptive Multiple Branch Prediction for Superscalar Processors., Lengauer, Christian (ed.) et al., Euro-par ’97 parallel processing. 3rd international Euro-Par conference, Passau, Germany, August 26-29, 1997. Proceedings. Berlin: Springer. Lect. Notes Comput. Sci. 1300, 1053-1060 , 1997.

Li G., Coleman T.F.: A new method for solving triangular systems on distributed-memory message-passing multiprocessors., SIAM J. Sci. Stat. Comput., 10, s. 382÷396, 1989.

Li L., Hu J., Nakamura T.: A simple parallel algorithm for polynomial evaluation., SIAM J. Sci. Comput., 17, s. 260-262, 1996.

Lu M., Qiao X., Chen G.: A parallel algorithm for evaluating general linear recurrence equations., Circuits Syst. Signal Process., 15, s. 481÷504, 1996.

McCurdy C.W., Stevens R., Simon H., Kramer W., Bailey D., Johnston W., Catlett C., Lusk R., Morgan T., Meza J., Banda M., Leighton J., Hules J.: Creating Science-driven Computer Architecture: A New Path to Scientific Leadership, 2007, URL http://www.osti.gov/servlets/purl/806195-zzoRAy/native/

Modi J.: Parallel Algorithms and Matrix Computation, Oxford University Press, Oxford, 1988.

Munro I., Paterson M.: Optimal algorithms for parallel polynomial evaluation, J. Comput. System Sci., 7, s. 189-198, 1973.

Murli A., Rizzardi M.: Algorithm 682: Talbot’s method for the Laplace inversion problem, ACM Trans. Math. Soft., 16, s. 158÷168, 1990.

Netwok Computer Services Inc.: The AHPCRC Cray X1 Primer, http://www.ahpcrc.org/publications/Primer.pdf.

Ortega J.M.: Introduction to Parallel and Vector Solution of Linear Systems, Springer, 1988.

Pacheco P.: Parallel Programming with MPI, Morgan Kaufmann, San Frncisco, 1996.

Pan X.: An improved recursive doubling algorithm for the parallel solution of linear recurrence R., Trans. Nanjing Univ. Aeronaut. Astronaut., 12, s. 218÷220, 1995.

Paprzycki M., Cyphers C.: Gaussian elimination on Cray Y-MP, CHPC Newsletter, 6, s. 77÷82, 1991.

Paprzycki M., Cyphers C.: Multiplying matrices on the cray - practical considerations, CHPC Newsletter, 6, s. 43÷47, 1991.

Paprzycki M., Stpiczyńnski P.: Solving linear recurrence systems on the Cray Y-MP, Lecture Notes in Computer Science, 879, s. 416÷424, 1994.

Paprzycki M., Stpiczński P.: Solving Linear Recurrence Systems on Parallel Computers, R. Kalia, P. Vashishta, eds., Toward Teraflop Computing and New Grand Challenge Applications – Proceedings of the Conference Mardi Gras’94, Baton Rouge, February 1994, s. 379÷384, Nova Science Publishers, New York 1995.

Paprzycki M., Stpiczyński P.: Parallel solution of linear recurrence systems, Z. Angew. Math. Mech., 76, s. 5÷8, 1996.

Paprzycki M., Stpiczy´ski P.: A Brief Introduction to Parallel Computing, E. Kontoghiorghes, ed., Parallel Computing and Statistics, s. 3÷42, Taylor & Francis, 2006.

Parhami B.: Introduction to Parallel Processing: Algorithms and Architectures, Plenum Press, 1999.

Parhi K.K., Messerschmitt D.: Pipeline interleaving and parallelism in recursive digital filters, part ii: Pipelined incremental block filtering, IEEE Trans. Acoust.,Speech Signal Processing, ASSP-37, s. 1118÷1135, 1989.

Quinn M.J.: Parallel Programming in C with MPI and OpenMP, McGraw-Hill Education, 2003.

Rahman N.: Algorithms for hardware caches and TLB, Lecture Notes in Computer Science, 2625, s. 171÷192, 2003.

Robelly J.P., G.Cichon, Seidel H., Fettweis G.: Implementation of Recursive Digital Filters into Vector SIMD DSP Architectures, Proc. International Conference on Acoustics, Speech and Signal Processing. ICASSP 2004. Montreal, Canada, 17.-21. May 2004.

Romine C., Ortega J.: Parallel solutions of triangular systems of equations, Parallel Comput., 6, s. 109÷114, 1988.

Sameh A.H., Brent R.P.: Solving triangular systems on a parallel computer., SIAM J. Numer. Anal., 14, s. 1101÷1113, 1977.

Schendel U.: Introduction to Numerical Methods for Parallel Computers, Ellis Horwood Limited, New York 1984.

Smith S.W.: The Scientist and Engineer’s Guide to Digital Signal Processing, California Technical Publishing, San Diego 1997.

Stewart G.: Introduction to Matrix Computations, Academic Press, New York 1973.

Stoer J., Bulirsh R.: Introduction to Numerical Analysis, Springer, New York 1993.

Stpiczynski P.: Parallel algorithms for solving linear recurrence systems, Lecture Notes in Computer Science, 634, s. 343÷348, 1992.

Stpiczynski P.: Parallel Cholesky factorization on orthogonal multiprocessors, Parallel Computing, 18, s. 213÷219, 1992.

Stpiczynski P.: Efficient data-parallel algorithms for computing trigonometric sums, Ann. Univ. Mariae Curie-Sklodowska Sect. A, 56, s. 85÷96, 2002.

Stpiczynski P.: Fast Parallel Algorithms for Computing Trigonometric Sums, M. Tudruj, A. Jordan, eds., Proceedings of PARELEC 2002, International Conference on Parallel Computing in Electrical Engineering, s. 299÷304, IEEE Computer Society Press, 2002.

Stpiczynski P.: A new message passing algorithm for solving linear recurrence systems, Lecture Notes in Computer Science, 2328, s. 466÷473, 2002.

Stpiczynski P.: Fast parallel algorithm for polynomial evaluation, Parallel Algorithms and Applications, 18, s. 209÷216, 2003.

Stpiczynski P.: Fast solver for Toeplitz bidiagonal systems of linear equations, Ann. Univ. Mariae Curie-Sklodowska Informatica, 1, s. 81÷87, 2003.

Stpiczynski P.: Numerical Evaluation of Linear Recurrences on High Performance Computers and Clusters of Workstations, Proceedings of PARELEC 2004, International Conference on Parallel Computing in Electrical Engineering, s. 200÷205, IEEE Computer Society Press, 2004.

Stpiczynski P.: Numerical Evaluation of Linear Recurrences on Various Parallel Computers, M. Kovacova, ed., Proceedings of Aplimat 2004, 3rd International Conference,Bratislava, Slovakia, February 4–6, 2004, s. 889÷894, Technical Univerity of Bratislava, Bratysława 2004.

Stpiczynski P.: Solving linear recurrence rystems using level 2 and 3 BLAS routines, Lecture Notes in Computer Science, 3019, s. 1059÷1066, 2004.

Stpiczynski P.: Evaluating recursive filters on distributed memory parallel computers, Comm. Numer. Meth. Engng., 22, s. 1087÷1095, 2006.

Stpiczynski P.: A Note on the numerical inversion of the Laplace transform, Lecture Notes in Computer Science, 3911, s. 551÷558, 2006.

Stpiczynski P.: New data distribution for solving triangular systems on distributed memory machines, Lecture Notes in Computer Science, 4699, s. 589÷597, 2007.

Stpiczynski P.: Evaluating linear recursive filters using novel data formats for dense matrices, Lecture Notes in Computer Science, 4967, s. 688÷697, 2008.

Stpiczynski P., Paprzycki M.: Parallel Algorithms for Finding Trigonometric Sums, D. Bailey, et al, eds., Parallel Processing for Scientific Computing – Proceedings of the Sixth SIAM Conference on Parallel Computing, San Francisco, February 1995, s. 291÷292, SIAM, Philadelphia 1995.

Stpiczynski P., Paprzycki M.: Fully Vectorized Solver for Linear Recurrence Systems with Constant Coefficients, Proceedings of VECPAR 2000 – 4th International Meeting on Vector and Parallel Processing, Porto, June 2000, s. 541÷551, Facultade de Engerharia do Universidade do Porto, 2000.

Stpiczynski P., Paprzycki M.: Numerical Software for Solving Dense Linear Algebra Problems on High Performance Computers, M. Kovacova, ed., Proceedings of Aplimat 2005, 4th International Conference, Bratislava, Slovakia, February 2005, s. 207÷218, Technical Univerity of Bratislava, Bratysława 2005.

Talbot A.: The accurate numerical inversion of Laplace transforms, J. Inst. Maths. Applics., 23, s. 97÷120, 1979.

van de Geijn R.A., Overfelt J.: Advanced Linear Algebra Object Manipulation, R.A. van de Geijn, ed., Using PLAPACK: Parallel Linear Algebra Package, Scientific and Engineering Computing, s. 42÷57, MIT Press, Cambridge 1997.

van der Vorst H., Dekker K.: Vectorization of linear recurrence relations., SIAM J. Sci. Stat. Comput., 10, s. 27÷35, 1989.

Wang H.: A parallel method for tridiagonal equations., ACM Trans. Math. Softw., 7, s. 170÷183, 1981.

Wang H., Nicolau A.: Speedup of Banded Linear Recurrences in the Presence of Resource Constraints, 6th ACM International Conference on Supercomputing (6th ICS’92), s. 466÷477, Washington 1992.

Wang H., Nicolau A., Keung S., Siu K.Y.: Computing programs containing band linear recurrences on vector supercomputers, IEEE Trans. Parallel Distrib. Systems, 7, s. 769÷782, 1996.

Whaley R.C., Petitet A., Dongarra J.J.: Automated empirical optimizations of software and the ATLAS project, Parallel Computing, 27, s. 3÷35, 2001.

Wolfe M.: High Performance Compilers for Parallel Computing, Addison–Wesley, 1996.

Yoon Y., Kim H.S., Han D.S., Youn H.Y.: Parallel Processing of Recurrence Operations in SMP Machines, H.R. Arabnia, ed., Proceedings of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2000, June 24-29, 2000, Las Vegas, CSREA Press, 2000.

Zima H.: Supercompilers for Parallel and Vector Computers, ACM Press, 1990.




DOI: http://dx.doi.org/10.21936/si2008_v29.n3B.521