Altivec
From Applepedia
Contents |
What is Altivec?
The Velocity Engine, embodied in the G4 and G5 processors, expands the current PowerPC architecture through addition of a 128-bit vector execution unit that operates concurrently with existing integer and floating-point units. This provides for highly parallel operations, allowing for simultaneous execution of up to 16 operations in a single clock cycle. This new approach expands the processor's capabilities to concurrently address high-bandwidth data processing (such as streaming video) and the algorithmic intensive computations which today are handled off-chip by other devices, such as graphics, audio, and modem functions.The AltiVec instruction set allows operation on multiple bits within the 128-bit wide registers. This combination of new instructions, operation in parallel on multiple bits, and wider registers, provide speed enhancements of up to 30x on operations that are common in media processing.[1]
Links
Introduction
- Altivec by Ian Ollman - Excellent and introduction to Altivec and when, why, and how to use it.
- Freescale Altivec Fact Sheet
Apple's Documents
- Altivec Address Alignment
- The Caches
- Understanding SIMD
- Data Handling and Data Formats
- Performance Issues: Memory Usage
- Throughput and Latency
- Code Optimization
- Performance Measurement
- Software Pipelining
- Algorithms
Tutorials
- Altivec Tutorial by Ian Ollman
- Altivec Hands-On Tutorial (tgz)
- Altivec Tutorial by Craig Lund
- Programming the Velocity Engine COMING SOON: Needs Hosting
- Practical Altivec Strategies COMING SOON: Needs Hosting
Reference
- Altivec Programming Environments Manual: This describes the machine and assembly level stuff
- Altivec Programmer's Interface Manual: This describes the C interface
Papers
- Paper on a Supercomputer-sytle FFT library for Apple G4
- Paper on vector implementation of multiprecision arithmetic
- Paper on vector implementation of color-image wavelet transform
- Paper on fast matrix algebra on Apple G4
- Vectorized Machine Vision Algorithms Using Altivec: Needs hosting
- A Performance Evaluation Of Multimedia Kernels Using AltiVec Streaming SIMD Extensions: Needs Hosting
- Exploiting SIMD Parallelism in DSP and Multimedia Algorithms Using the AltiVec Technology: Needs Hosting
- Some Fast Speech Processing Algorithms Using Altivec Technology: Needs hosting
- Ultra-Fast Matrix Multiplication: An Empirical Analysis of Highly Optimized Vector Algorithms: Needs Hosting
- Paper on accelerating TCP/IP
- Paper on accelerating cryptography
- Parallelized QuickSort with Optimal Speedup: Needs hosting
- Paper on vectorizing insertion sort and extended *merge sort
- Paper on vectorizing hashing algorithms
- Paper on vectorizing the Adler32 hashing algorithm
- AltiVec Extensions to the Portable Expression Template: Needs hosting
Articles
- Unrolling AltiVec, Part 1: Introducing the PowerPC SIMD unit
- Unrolling AltiVec, Part 2: Optimize code for SIMD processing
- Unrolling AltiVec, Part 3: Down and dirty loop optimization: Check the date!
- From the stacks: TCP/IP checksum vectorization using AltiVec, Part 1
- From the stacks: TCP/IP checksum vectorization using AltiVec, Part 2
High Level APIs Already In OS X
Using Apple's high level Accelerate.Framework APIs are recommended to ensure portability to OS X - x86, as Altivec instructions will not execute even on Rosetta!
- vDSP: A collection of digital signal processing functions such as FFTs, convolutions and squares. It uses the vector engine when available and certain criteria are met on G4 equipped computers, and it uses the scalar unit on G3 equipped computers. vDSP also provides basic vector and matrix operations. vDSP is x86 compatible.
- vImage: Image processing functions, fast as hell.
- BLAS: Basic Linear Algebra Subprograms with vector, matrix, and scalar functions.
- LAPACK: LAPACK provides routines for solving systems of simultaneous linear equations, least-squares solutions of linear systems of equations, eigenvalue problems, and singular value problems. The associated matrix factorizations (LU, Cholesky, QR, SVD, Schur, generalized Schur) are also provided, as are related computations such as reordering of the Schur factorizations and estimating condition numbers. Dense and banded matrices are handled, but not general sparse matrices. In all areas, similar functionality is provided for real and complex matrices, in both single and double precision.
Other Libraries
- LibMotoVec Freescale's Altivec libraries
- LibFreeVec (In development) Vectorized replacements for a tiny amount of standard C library functions
- zlib-altivec Vectorized replacement for zlib, 25% speed increase
- macstl (In development) Vectorized replacement for C++ STL
Examples and Samples
- Velocity Engine FFT
- Velocity Engine Wavelet
- Xcode Programming Examples
- vDSP Examples
- vBLAS Samples
- Basic Vectorized Algorithms
- Vector Math Library: Apple provides source for vMathLib for educational purposes. If you intend to use this code in your application, it is suggested that you use the routines in vecLib.framework instead. They are likely to be more optimized and more correct. The Accelerate.framework version also include x86 versions.
- Basic Algebraic Operations: Apple provides source for vBasicOps for educational purposes. If you intend to use this code in your application, it is suggested that you use the routines in vecLib.framework instead. They are likely to be more optimized.
- Vector Big Number Library: Apple provides source for the vBigNum library for educational purposes. If you intend to use this code in your application, it is suggested that you use the routines in vecLib.framework instead. They are likely to be more optimized.
- Complex finite impulse response filter
- Real FIR
- Real delayed least mean squared FIR
- Altivec and Fixed-Point FFT
- Altivec and Floating Pint FFT
- RGB to YCbCr Color Translation
- GSM Soft-Decision Viterbi Decoder
- GSM Convolutional Encoder
- Vectorized Common Math Subroutines
- Sum of Absolute Differences kernel
- 2D DCT
- 2D Inverse DCT
- Quantization
- Dequantization

