[bitc-dev] Bitc and Simd

Ben Kloosterman bklooste at gmail.com
Sat Aug 14 18:35:08 PDT 2010

Thanks for pointing this out , I also note on Neon SIMD is limited to 64bit
..while mentally im focusing on the move to 256 bit this year on x86.

Note the loop code specifically allows SIMD to not be mixed with the GP
registers ( and is cleaner and easier than using all the intrinsic) and the
copy code is not going to be affected by a 5 cycle stall.

It's also worth noting the techniques I commented on  to use the 128bit
registers more like normal registers to allow more work to be done in 1
cycle are not really worth it for 64 bit on x86 since you can use 64 bit GP

You are right about the Hw , the use of SIMD as GPs would be dependent on HW
as some HW could not do these GP ops on SIMD registers ( eg equality)  and
hence the compiler would produce an error.

In theory ( in phase 3) you could just communicate the intent and the
compiler would decide but we are a long way off for this.

I suppose logically we have registers here with limitations eg for 64 bit
you have 

GP 64 bit reg 
Pure SIMD reg
64 bit SIMD with GP full width functions 

Rather than use different unions I would think a compiler error is better
forcing intrinsic for appropriate platforms when needed.


 >-----Original Message-----
 >From: orthochronous [mailto:orthochronous at gmail.com]
 >Sent: Sunday, August 15, 2010 1:51 AM
 >To: bklooste at gmail.com; Discussions about the BitC language
 >Subject: Re: [bitc-dev] Bitc and Simd
 >On Sat, Aug 14, 2010 at 5:06 PM, Ben Kloosterman <bklooste at gmail.com>
 >> Eg
 >> for ( xmm i = 0 ; i <  loopCount ; i = i + 1)
 >>       RunLoopVariableDependentSIMDAlgorithm(i) ;
 >> Or this
 >> //pointers/data must be 16 byte aligned
 >> int blockMemCopy(void *destination, void *source, int32 size)
 >> {
 >>   xmm *dest = (xmm*)&destination;
 >>   xmm *sour = (xmm*)&source;
 >>   int c;
 >>   for(c=0;c< (size <<2) ;c++)
 >>      *dest++ = *sour++;
 >>    return c>>2 ;
 >> }
 >Just a quick comment: on ARM chips the NEON unit is deliberately run 5
 >cycles behind the main scalar pipeline. As such, it is heavily advised
 >against using SIMD instructions unless you're actually using the full
 >SIMD capabilities (ideally using the main pipeline just to do control
 >flow) since otherwise you incur notable penalties moving both sending
 >data to and from the unit from the main pipeline. Additionally the
 >NEON unit on ARM uses only the L2 cache, requiring explicitly making
 >the L1 cache coherent with L2 before accessing any of the data in the
 >main part of the CPU:
 >This is a reasonable design for multimedia, where most of the time the
 >scalar and SIMD data-sets are don't overlap. (I'm interested in ARM as
 >well as Intel because both of these chips turn up in smartphones,
 >tablets and netbooks.)
 >David Steven Tweed
 >No virus found in this incoming message.
 >Checked by AVG - www.avg.com
 >Version: 9.0.851 / Virus Database: 271.1.1/3069 - Release Date: 08/14/10

More information about the bitc-dev mailing list