[bitc-dev] Bitc and Simd

orthochronous orthochronous at gmail.com
Sat Aug 14 10:50:53 PDT 2010

On Sat, Aug 14, 2010 at 5:06 PM, Ben Kloosterman <bklooste at gmail.com> wrote:
> Eg
> for ( xmm i = 0 ; i <  loopCount ; i = i + 1)
>       RunLoopVariableDependentSIMDAlgorithm(i) ;
> Or this
> //pointers/data must be 16 byte aligned
> int blockMemCopy(void *destination, void *source, int32 size)
> {
>   xmm *dest = (xmm*)&destination;
>   xmm *sour = (xmm*)&source;
>   int c;
>   for(c=0;c< (size <<2) ;c++)
>      *dest++ = *sour++;
>    return c>>2 ;
> }

Just a quick comment: on ARM chips the NEON unit is deliberately run 5
cycles behind the main scalar pipeline. As such, it is heavily advised
against using SIMD instructions unless you're actually using the full
SIMD capabilities (ideally using the main pipeline just to do control
flow) since otherwise you incur notable penalties moving both sending
data to and from the unit from the main pipeline. Additionally the
NEON unit on ARM uses only the L2 cache, requiring explicitly making
the L1 cache coherent with L2 before accessing any of the data in the
main part of the CPU:


This is a reasonable design for multimedia, where most of the time the
scalar and SIMD data-sets are don't overlap. (I'm interested in ARM as
well as Intel because both of these chips turn up in smartphones,
tablets and netbooks.)

David Steven Tweed

More information about the bitc-dev mailing list