Optimizing Path Normalization

One of my path functions is normalize. It removes (redundant) slashes and references to directory meta entries (current and parent directory).

A lot of other path functions use or rely on normalize, for example my version of dirname() is simply: return normalize(path + "/..");.

I was recently tasked with rewriting normalize to be more efficient and it proved to be a bit of a challenge, so I’ll share what I came up with.

Motivation / Background

In TextMate I index all bundles and store this index on disk. This is what gets loaded on startup, and in 1.x I stat() directories to see if part of it needs updating.

Starting with Leopard it is possible to use FSEvents rather than having to stat each directory (which is not fast when you have a gazillion bundles). The API of FSEvents is so that you tell what directory you want events for. It may seem like I should just watch ~/Library/Application Support/TextMate/Bundles but it is unfortunately not that simple, since this directory can contain symbolic links pointing outside the directory for which I would then not see events.

Both for this reason, to minimize storage, and to easily figure out what has been added/removed when getting a file system event, I store a lot of relative paths in the index.

I convert all the relative paths to absolute paths for the memory cache (created from the index) and with almost 10,000 bundle items, and join being called multiple times for each item (as we go from bundles directory → bundle → kind folder → actual item) the normalize function was responsible for 55% of the time spent creating this memory cache.

To be exact, it took a tenth of a second, which isn’t much, but in 2.0 editing a bundle does not edit any in-memory data structures, it only saves the new item to disk, which causes a file system event that triggers updating the index, which in turn regenerates the memory cache. So I wanted this to be as close to instant as possible.

Another solution would be to make the paths absolute only when needed. The reason I am presently not doing this has to do with code abstraction (not having the bundle query system know about the index).

The Problem: Resolving Parent References

The difficulty with optimizing normalize is the step which removes parent references. It sounds simple, i.e. each time we see ‘..’ then we backtrack one component in the path.

The complexity comes from allowing ‘..’ as the component preceding ‘..’. So we can’t backtrack one component, we need a stack of backtrack points. The pseudocode for the algorithm is as follows:

stack = [ ]

for component in path

   if component == ".."
         stack.pop
   else  stack.push(component)

return stack.join('/')

This code assumes that the first component of an absolute path is the empty string, e.g. /path/to/foo → [ "", "path", "to", "foo" ] and it doesn’t handle relative paths which (possibly indirectly) start with a reference to parent (e.g. ../foo).

Regardless, this algorithm is not going to work, the problem is that we split a path into an array of strings and use a dynamic stack to handle the backtracking. These things involve memory allocation which makes the code magnitudes slower than a potential in-place algorithm.

We could reduce the number of memory allocations by working with string indices, but it is still based around a dynamic stack and backtracking.

Avoiding the Stack

Turns out getting rid of the stack-based approach is fairly simple if we iterate the path right-to-left, that is, backwards.

If we move backwards then when we see ‘..’ we skip the next component (rather than previous). If next component is also ‘..’ then we skip the two next components, etc. So we can manage this with a simple counter instead of a stack. Here is the code from above, but rewritten to iterate the path backwards:

res  = [ ]
skip = 0

for component in path.reverse

   if component == ".."
      skip += 1
   else if skip > 0
      skip -= 1
   else
      res.push(component)

return res.reverse.join('/')

I kept it close to the stack-based approach, but the above can be converted to an in-place version which simply overwrites path, so we can write the above without any memory allocations at all.

The skip counter also tells us how many parent references we weren’t able to resolve, so we can add (what we didn’t handle in the stack-based version):

while skip > 0
   res.push("..")
   skip -= 1

Absolute versus Relative Paths

I mentioned that for absolute paths the first component should be the empty string, for the stack-based algorithm to work.

This isn’t ideal, for example an (ill formed) path like /foo/../../bar will be converted to bar, i.e. a relative path.

To avoid this, a simple approach is to check if the received path is absolute, if so, remove the first slash to make it relative. Our code then only needs to handle relative paths, and by re-adding the slash last, we ensure the result is absolute.

In the in-place version removing and adding that slash is much simpler than it sounds, at least if we use the calling conventions of the C++ standard template library and takes a first/last iterator as argument and return the new end-point of the sequence, since with such API we can write the entire function like this:

char* remove_parent_references (char* first, char* last)
{
   if(first != last && *first == '/')
      ++first;

   std::reverse(first, last);

   char* src = first;
   char* dst = first;

   size_t skip = 0;
   while(src != last)
   {
      char* from = src;
      while(src != last && (src == from || *src != '/'))
         ++src;

      if(is_parent_meta_entry(from, src))
         ++skip;
      else if(skip)
         --skip;
      else
         dst = std::copy(from, src, dst);
   }

   static char const parent_str[3] = { '/', '.', '.' };
   while(skip--)
      dst = std::copy(parent_str, parent_str + sizeof(parent_str), dst);

   std::reverse(first, dst);
   return first != dst && dst[-1] == '/' ? --dst : dst;
}

SIGPIPE 13