Uncategorized

Struct Padding in GCC

In general, tight data structures with compact representations are good for low latency as they ensure that as much data as possible fits in the data caches closer to the CPU, that is, Level 1 (L1d) and Level 2 (L2d) on modern CPUs. But sometimes the compiler interferes with your structs to preserve data alignment requirements of the hardware resulting in struct sizes that are not what you'd expect. To illustrate that let's examine a short program that we'll compile with G++ .

   1:   
   2:  #include <iostream>
   3:  using namespace std;
   4:   
   5:  // structure padding...
   6:  struct foo1 {
   7:      char a;        /* 1 byte  */
   8:                  /* 3 bytes padding added here */
   9:      int b;        /* 4 bytes */
  10:      char c;     /* 1 byte  */
  11:                  /* 3 bytes padding added here */
  12:  };
  13:   
  14:  struct foo2 {
  15:      char a1;    /* 1 byte  */
  16:      char a2;    /* 1 byte  */
  17:      char a3;    /* 1 byte  */
  18:      char a4;    /* 1 byte  */
  19:      int b;        /* 4 bytes */
  20:      char c1;     /* 1 byte  */
  21:      char c2;     /* 1 byte  */
  22:      char c3;     /* 1 byte  */
  23:      char c4;     /* 1 byte  */
  24:  };
  25:   
  26:  struct foo3 {
  27:      char a;        /* 1 byte  */
  28:      short b;    /* 2 bytes */
  29:                  /* 1 byte padding added here */
  30:      int c;        /* 4 bytes */
  31:  };
  32:   
  33:  //  reorder the structure members by decreasing alignment
  34:  struct foo4 {
  35:      int a;        /* 4 bytes */
  36:      short b;    /* 2 bytes */
  37:      char c;        /* 1 byte  */
  38:                  /* 1 byte padding added here */
  39:  };
  40:   
  41:  // structure packing... tells g++ not to add padding and not
  42:  // to make assumptions about alignment of accesses to members
  43:  struct foo1p {
  44:      char a;        /* 1 byte  */
  45:      int b;        /* 4 bytes */
  46:      char c;     /* 1 byte  */
  47:  } __attribute__((__packed__));
  48:   
  49:  int main() {
  50:   
  51:      // calculate the sizes of the structs
  52:      const int sizefoo1 = sizeof(foo1);
  53:      const int sizefoo2 = sizeof(foo2);
  54:      const int sizefoo3 = sizeof(foo3);
  55:      const int sizefoo4 = sizeof(foo4);
  56:   
  57:      // try again on the package variant
  58:      const int sizefoo1p = sizeof(foo1p);
  59:   
  60:      // output the size results
  61:      cout << "foo1 has size = " << sizefoo1 << endl;
  62:      cout << "foo2 has size = " << sizefoo2 << endl;
  63:      cout << "foo3 has size = " << sizefoo3 << endl;
  64:      cout << "foo4 has size = " << sizefoo4 << endl;
  65:      cout << "foo1p has size = " << sizefoo1p << endl;
  66:   
  67:      return 0;
  68:  }

Running this program, after compiling with g++ and the -O0 option, we get...

foo1 has size = 12
foo2 has size = 12
foo3 has size = 8
foo4 has size = 8
foo1p has size = 6

Note that the foo1 struct would at first seem to have 6 bytes, but it actually has double that, with 12 bytes. Why? The answer is struct padding. The compiler is trying to keep the CPU happy by ensuring that each member field of the struct is stored at an offset such that offset mod data_size = 0. i.e data-size aligned.

Since the second member of foo1 is an int which has a size of 4 bytes it's address need to be aligned to a 4 byte boundary. Since the first member of foo1 was a 1 byte char the compiler has little choice but to insert 3 bytes of padding after the "char a" member. The last member field in foo1 is also a char which can be aligned to any byte so you might be wondering why there is 3 bytes of padding after it. The reason is because, on x64 architectures, G++ adds padding such that the structure is aligned according to the largest member data type within the structure. In this case the largest member is 4-bytes so the total size is padded by 3 bytes to make it a total of 12 bytes, which is 4-byte aligned.

In the second example note that foo2 has no padding whatsoever (expected size = actual size = 12 bytes) because all member fields are naturally aligned to offsets equal to their data size.

In the next example, Foo3, the first two fields are data-size aligned, but the last field, c, needs 1 byte of padding preceding it to ensure it's offset is a multiple of it's size of 4 bytes.

A clever trick to minimise or avoid the padding is to order the fields in decreasing size, as shown in foo4.

Lastly we consider a case, foo1p, where we have explicitly disabled padding. Notice that the foo1p structure has __attribute__((__packed__)) specified. This is a G++ extension that tells the compiler NOT to pad the members of the structure. Such extensions are implementation specific so your millage may vary depending on the compiler and platform.

Clearly compiler writers are not stupid and they add padding for a reason! Whilst you can fiddle with the padding by re-ordering member fields or by using the above mentioned G++ extension to suppress it, the trade-off you are making is improved space compression vs possible performance issues with unaligned memory access and lesser portability of the code, as some platforms will even throw exceptions for unaligned access. Also note that if your data fields fall across cache lines the processor has little choice but to do at least 2 reads to get the data, rather than one which means you need to careful as that read operation for the member field adds many more CPU cycles and potential data stalls than the equivalent aligned read. Here's an article that tries to measure it in a semi-scientific fashion.

So if fields can be re-arranging to save save, and not span cache lines too much there can still be benefit in doing this, especially when you have a large array/vector of the structs and memory efficiency is paramount.

Uncategorized

Google Code Jam 2010: Snapper Chain

Google Code JamCode Jam 2010 started with a qualification round that was harder than previous years. The first problem, Snapper Chain, involves a series of N switches arranged in sequence, that toggle when you snap your fingers if they are receiving power. The problem asks that you indicate if a light plugged into the end of the chain of N switches will be on after K toggles. A brute force evaluation of the state of the light would have a run time ~ O(NK). This isn't feasible since the upper limits on N and K are 30 and 108 respectively. That would take far too long to evaluate. Intuitively you know there must be some sort of pattern so in an effort to get clarity on it it's best to list the states for some toggles. So let's start with N=5 and see what we find...

K State
0 00000
1 10000
2 01000
3 11000
4 00100
5 10100
6 01100
7 11100
8 00010
9 10010
K State
10 01010
11 11010
12 00110
13 10110
14 01110
15 11110
16 00001
17 10001
18 01001
19 11001
K State
20 00101
21 10101
22 01101
23 11101
24 00011
25 10011
26 01011
27 11011
28 00111
29 10111
30 01111
31 11111

Now it becomes clear how things work. The first switch toggles with every snap so it's immediately obvious that after an even number of snaps the light cannot be on. Conversely, if N=1 then the light would be on for all odd values of K. If N=2, then we can see that the light would be on when K=3, 7, 11, 15, 19, 23 27,31. That seems to be a pattern with common increments of 4. If N=3, the light would be on when K=7,15, 23,31 - common increments of 8. And for N=4, the light is on when K=15,31 - an increment of 16. It seems that the increments are powers of two. In other words, for a given N and K we can hypothesize that the light is only on when K satisfies the following formula:

Here c is any non-negative integer. In other words, the light is on after a certain number of snaps and it comes back on after a cycle of 2N more snaps. Since N << K the summation term here is not a major concern. It is in fact the sum of a geometric progression so the summation could be replaced with a simpler term but since N is fairly small (<30) I didn't bother to optimise it further. I did add memoization to remember the sum of N items so we only ever need to calculate the sum of 30 items once. Here's some C# code to solve the problem:

   1:  using System;
   2:  using System.Collections.Generic;
   3:  using System.IO;
   4:   
   5:  namespace GoogleCodeJam2010.Qualificiation.SnapperChain
   6:  {
   7:      class Program
   8:      {
   9:          static void Main()
  10:          {
  11:              const string basePath = @"";
  12:              var infile = new StreamReader(basePath + "large.txt");
  13:              var outfile = new StreamWriter(basePath + "output.txt");
  14:   
  15:              var termSum = new Dictionary<int, ulong>();
  16:              var cycle = new ulong[31];
  17:              ulong sum = 0;
  18:              ulong current = 1;
  19:              for (int i = 0; i < 31; i++)
  20:              {
  21:                  sum = sum + current;
  22:                  termSum[i] = sum;
  23:                  current = current * 2;
  24:                  cycle[i] = current;
  25:              }
  26:              
  27:              int t = Int32.Parse(infile.ReadLine());
  28:              for (int caseNo = 1; caseNo <= t; caseNo++)
  29:              {
  30:                  var data = infile.ReadLine().Split(' ');
  31:                  int n = Int32.Parse(data[0]);
  32:                  ulong k = UInt64.Parse(data[1]);
  33:   
  34:                  string answer = "OFF";
  35:   
  36:                  if ((k - termSum[n-1]) % cycle[n-1] == 0)
  37:                      answer = "ON";
  38:   
  39:                  outfile.WriteLine("Case #{0}: {1}", caseNo, answer);
  40:              }
  41:              infile.Close();
  42:              outfile.Close();
  43:          }
  44:      }
  45:  }

Uncategorized

Inside Chrome’s V8 JavaScript Engine

Recently I've been digging through the C++ source code for the V8 JavaScript engine inside Google Chrome - in my mind, the best browser on the block by a long shot! Reviewing the internals has been a good refresher on efficient C++ programming though it is hard to adjust back to K-R indenting after using Allman indenting for so long with C# (I wonder how Pythonistas react to all those curly braces when jumping back to C/C++?). The design is very clean and the use of hidden classes for fast property access, and compilation to machine code (yes, MACHINE code) are just some of the secrets to its' speed. Of course with all these hidden classes floating around it was imperative they created an efficient Garbage Collector. I haven't yet been through the GC code so I won't elaborate however, according to Google, V8 uses a "stop-the-world, generational, accurate garbage collector". Looking forward to that! It should be interesting to contrast this approach to the fairly innovative concurrent GC architecture recently unveiled by several Microsoft researchers working on Haskell.

Here's the V8 Tech Lead, Lars Bak from Google, giving a very quick (5 mins) overview of the salient features of V8. He's Danish so trust me - it is English he's speaking:



If you suffer from the same "How does it Work" curiosity as I do you just point your favourite SVN client at:

svn checkout http://v8.googlecode.com/svn/trunk/

and you can read, and contribute to the code. The in-line comments are fairly minimal because the code is fairly self-documenting, but I did notice that the programming gurus at Google have been reading the same books I have. i.e Hackers Delight. The code extract below is from utils.cc. Which reminds me - I must write up a summary of the bitwise operations that come in handy more often than you think, but that's a topic for another post!


// Implementation is from "Hacker's Delight" by Henry S. Warren, Jr.,
// figure 3-3, page 48, where the function is called clp2.
uint32_t RoundUpToPowerOf2(uint32_t x) {
  x = x - 1;
  x = x | (x >> 1);
  x = x | (x >> 2);
  x = x | (x >> 4);
  x = x | (x >> 8);
  x = x | (x >> 16);
  return x + 1;
}

Next »