Understanding Structured Buffer Performance

In this case, using Structures-of-Arrays might be better than Arrays-of-Strucutures for GPU, which may workaround the inefficiently packing and obviate the necessarily of padding.

Structured Buffers are by definition tightly packed. This means that the following code generates a buffer with a stride of 20 bytes:

struct Foo
{
    float4 Position;
    float  Radius;
};
StructuredBuffer <Foo> FooBuf;

That may not seem terrible, but it does have some performance implications for your code that may not be immediately obvious. The fact that the structure is not naturally aligned to a 128-bit stride means that the Position element often spans cache lines, and that it can be more expensive to read from the structure. While one inefficient read from a structure is unlikely to damage your performance terribly, it can quickly explode. Something like a shader iterating over a list of complex lights, with more than 100 bytes of data per light, can be a serious pothole. In fact, we recently found prerelease code where whole-frame performance was penalized by over 5% by just such a difference.

Understanding Structured Buffer Performance.