<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>SIGPIPE 13</title>
	<atom:link href="http://sigpipe.macromates.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://sigpipe.macromates.com</link>
	<description>Programming and using OS X</description>
	<lastBuildDate>Sat, 23 Jan 2010 18:02:24 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
			<item>
		<title>Build Automation Part 2</title>
		<link>http://sigpipe.macromates.com/2010/01/23/build-automation-part-2/</link>
		<comments>http://sigpipe.macromates.com/2010/01/23/build-automation-part-2/#comments</comments>
		<pubDate>Sat, 23 Jan 2010 18:00:36 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2010/01/23/build-automation-part-2/</guid>
		<description><![CDATA[This is part 2 of what I think will end up as four parts. This might be a bit of a rehash of the first part, but I skimmed lightly over why it actually is that I am so fond of make compared to most other build systems, so I will elaborate with some examples.

Part [...]]]></description>
			<content:encoded><![CDATA[<p>This is part 2 of what I think will end up as four parts. This might be a bit of a rehash of the <a href="http://sigpipe.macromates.com/2010/01/15/build-automation-part-1/">first part</a>, but I skimmed lightly over why it actually is that I am so fond of <code>make</code> compared to most other build systems, so I will elaborate with some examples.</p>

<p>Part 3 will be a general post about declarative systems, not directly related to build automation. Part 4 should be about auto-generating the make files (which is part of the motivation for writing about declarative systems first).</p>

<p><span id="more-41"></span></p>

<h2>Fundamentals</h2>

<p>The original “insight” of <code>make</code> is that whatever we want executed can be considered a goal and:</p>

<ol>
<li>Each goal is represented by exactly one file.</li>
<li>Each dependency of a goal is itself a goal.</li>
<li>A goal is outdated when the represented file does not exist or is older than at least one of its depenencies.</li>
<li>A goal can be brought up-to-date by one or more shell commands.</li>
</ol>

<p>This is all there is to it. By linking the goals (via depenencies) we get the aforementioned <a href="http://en.wikipedia.org/wiki/Directed_acyclic_graph">DAG</a>, and with this simple data structure we can model all our processes as long as the four criteria above are met, which they generally are, at least on unix where “everything is a file” :)</p>

<h2>Extending the Graph</h2>

<p>One of the reasons I like to view the process as a directed graph is that it becomes easy to see how we need to “patch” it to add our own actions. Yes, I said patch, because we can actually do that, and quite easily, even if we can’t edit the original make file.</p>

<p>Imagine we are building <a href="http://wiki.videolan.org/Lunettes">Lunettes</a> (a new UI for the <a href="http://www.videolan.org/vlc/">VLC media player</a>) which depends on <a href="http://wiki.videolan.org/VLCKit">VLCKit</a>.</p>

<p>Considering the graph there must be some goal of Lunettes that depend on the VLCKit, in Makefile syntax this could simply be:</p>

<pre><code>APP_DST=Lunettes.app/Contents

$(APP_DST)/MacOS/Lunettes: $(APP_DST)/Frameworks/VLCKit.framework
</code></pre>

<p>This syntax establish a connection (dependency) between the executable and the framework. Here I made it depend on the framework’s root directory, of course it should depend on the actual binary in the framework (but then my box will overflow).</p>

<p>What this means is that each time the framework is updated, the executable is considered out-of-date and as a result, will be relinked (with the updated framework).</p>

<h3>Unit Tests</h3>

<p>The reason I mentioned the above link between the application and its framework is because this is where we want to insert new nodes (goals) in the graph incase we want to add unit tests to the VLCKit framework.</p>

<p>So the scenario is this: We write a bunch of unit tests for the VLCKit framework and we want these to run every single time the framework is updated, not only when we feel like it, but at the same time, since we probably spend most time developing on the application itself, we do not want the tests to run each time we do a build.</p>

<p>What we do is mind-boggling simple, we introduce a file to represent the unit test goal and we touch this each time the test has been successfully run:</p>

<pre><code>vlckit_test: $(APP_DST)/Frameworks/VLCKit.framework
    if «run test»; then touch '$@'; else false; fi
</code></pre>

<p>We can now <code>make vlckit_test</code> to run the test, and if the test has been run (succesfully) after last build of the framework, then it will just tell us that the goal is up-to-date.</p>

<p>To avoid running this manually, we add the following to our make file:</p>

<pre><code>$(APP_DST)/MacOS/Lunettes: vlckit_test
</code></pre>

<p>Now our application depends on having succesfully run the unit test for the used framework.</p>

<p>This is all done without touching any of the existing build files, we simply <strong>extend</strong> the build graph with our new actions.</p>

<p>And the result is IMO beautiful in the sense that the unit tests are only run when we actually change the framework, and failed unit tests will cause the entire build to fail.</p>

<p>As a reader exercise, go download the actual build files of the Lunettes / VLCKit project (much of it is in Xcode) and add something similar. What you will end up with is Xcode’s answer to the problem of extensibility: “custom shell script target” which will run every single time you re-build your target, regardless of whether or not there actually is a need for it.</p>

<p>This might be ok if you only have one thing that falls outside what the system was designed to handle, but when you have half a dozen of these…</p>

<h3>Build Numbers</h3>

<p>Another common build action these days is automated build numbers. Say we are going to do nightly builds of Lunettes and want to put the git revision into the <code>CFBundleVersion</code>.</p>

<p>You remember how everything is a file on unix? To my great delight, git conforms quite well to this paradigm and we can find the current revision as <code>.git/HEAD</code>, although this file contains a reference to the symbolic head which likely is <code>.git/refs/heads/master</code>.</p>

<p>For simplicity let us just assume we always stay on master (and we don’t create packs for the heads). The file is updated each time we make a commit, bumping its date, so all we need to do is have our <code>Info.plist</code> depend on <code>.git/refs/heads/master</code> and let the action to bring <code>Info.plist</code> up-to-date insert the current revision as value for the <code>CFBundleVersion</code> key.</p>

<p>Again make’s simple axiomatic system makes it a breeze to do this, and “do it right”, that is, do it in a way that limits computation to the theoretical minimal, rather than update the <code>Info.plist</code> with every single build or require it to be manually updated.</p>

<h3>External Dependencies</h3>

<p>I have used Lunettes as example in this post so let me continue and link to the <a href="http://wiki.github.com/pdherbemont/Glasses/how-to-build">build instructions</a>.</p>

<p>Here you see several steps you have to do in order to get a succesful build, additionally if you look in the <a href="http://github.com/pdherbemont/Glasses/tree/master/Frameworks/">frameworks directory of Lunettes</a> you’ll find that it deep-copied these from other projects.</p>

<p>Since every single person who wants to build this has to go through these steps, we should incorporate it in the build process, and it is actually quite simple (had this project been based on make files), for example we need to clone and build the VLC project which can be done using:</p>

<pre><code>vendor/vlc:
    git clone git://git.videolan.org/vlc.git '$@'
    $(MAKE) -sC '$@'
</code></pre>

<p>So if there is no <code>vendor/vlc</code> then we do a git checkout and call <code>make</code> afterwards. In theory we can also include the make file from this project so that we can do fine-grained dependencies, but since this is not our project we do not have control over its make file and can’t fix any potential clashes, so it’s safer to simply call <code>make</code> recursively on the checked out project.</p>

<p>We need to setup a link between Lunettes and <code>vendor/vlc</code> so that the checkout will actually be done (without having to <code>make vendor/vlc</code>), but that is just a single line in our make file.</p>

<h3>Other Actions</h3>

<p>If it isn’t clear by now, make files is what drives my own build process when I build TextMate. I run the build from TextMate itself, and the goal I ask to build is relaunching TextMate on a successful build.</p>

<p>This isn’t always desired, as I am actually using the application when it happens, so what I have done is rather simple and mimics the unit test injection shown above.</p>

<p>Let me start by quoting from my make file:</p>

<pre><code>$(APP_NAME)/run: ask_to_relaunch

ask_to_relaunch: $(APP_PATH)/Contents/MacOS/$(APP_NAME)
    @[[ $$("$$DIALOG" alert …|pl) = *"buttonClicked = 0"* ]]

.PHONY: ask_to_relaunch
</code></pre>

<p>This introduces a new goal (<code>ask_to_relaunch</code>), it is declared “phony” so it is not backed by a file on disk (and therefor, always considered outdated). It depens on the actual application binary, so it will never be updated before the application has been fully built.</p>

<p>I use phony goals like <code>«app»/run</code>, <code>«app»/debug</code> and similar. When I build from within TextMate it is the <code>«app»/run</code> goal that I build, and I have set this to depend on my (phony) <code>ask_to_relaunch</code> goal.</p>

<p>As this goal is always outdated, it will run the (shell) command to bring it up-to-date. The shell command opens a dialog (via the <code>"$DIALOG" alert</code> system) which asks whether or not to relaunch. If the user cancels the dialog, the shell command will return a non-zero return code and <code>make</code> will treat that as having failed updating the <code>ask_to_relaunch</code> goal which in turn will cause the <code>«app»/run</code> goal to never be updated (have its (shell) commands executed), as one of its dependencies failed.</p>

<p>Simple yet effective.</p>

<h2>Conclusion</h2>

<p>This has just been a bunch of examples, what I hope to have shown is how simple the basic concept of make is, how easy it is to extend an existing build process, and how flexibile make is in what it can actually do for us.</p>

<p>Of the many build systems I have looked at, I don’t see anything which has this simple axiomatic definition nor is actually very versatile. A lot of build systems have been created because make files are ugly/complex/arcane/etc., and I agree with that sentiment, but it seems like many of the replacements are systems hardcoded for specific purposes which simplify the boilerplate but make them inflexibile, or they are actual programming languages, which makes the build script only marginally better than a custom script, for example some, but not all, of the systems which takes the “programming language route” lack the ability to execute tasks in parallel, which, with 16 cores and counting, is a pretty fatal design limitation.</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2010/01/23/build-automation-part-2/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Build Automation Part 1</title>
		<link>http://sigpipe.macromates.com/2010/01/15/build-automation-part-1/</link>
		<comments>http://sigpipe.macromates.com/2010/01/15/build-automation-part-1/#comments</comments>
		<pubDate>Fri, 15 Jan 2010 22:05:54 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2010/01/15/build-automation-part-1/</guid>
		<description><![CDATA[A blog post about Ant vs. Maven concludes that “the best build tool is the one you write yourself” and the Programmer Competency Matrix has “can setup a script to build the system” as requirement for reaching the higher levels in the “build automation” row.

I have looked at a lot of build systems myself, and [...]]]></description>
			<content:encoded><![CDATA[<p>A <a href="http://kent.spillner.org/blog/work/2009/11/14/java-build-tools.html">blog post about Ant vs. Maven</a> concludes that <em>“the best build tool is the one you write yourself”</em> and the <a href="http://www.indiangeek.net/wp-content/uploads/Programmer%20competency%20matrix.htm">Programmer Competency Matrix</a> has <em>“can setup a script to build the system”</em> as requirement for reaching the higher levels in the “build automation” row.</p>

<p>I have looked at a lot of build systems myself, and while I agree that the best build system is the one you create yourself I am also a big fan of <a href="http://www.gnu.org/software/make/manual/make.html"><code>make</code></a> and believe that the best approach is to use generated Makefiles.</p>

<p>This post is a “getting started with <code>make</code>”. I plan to follow up with a part 2 about how to handle auto-generated self-updating Makefiles.</p>

<p><span id="more-38"></span></p>

<h2>Concept</h2>

<p>The <a href="http://www.faqs.org/docs/artu/ch01s06.html">UNIX philosophy</a> is to have small tools (commands) which solve a well defined problem. These can then be combined to build more complex systems.</p>

<p>While each build process is different, the common denominator is that we should be able to represent our target(s) as nodes in a <a href="http://en.wikipedia.org/wiki/Directed_acyclic_graph" title="Directed Acyclic Graph">directed acyclic graph</a> where each node represents a file and each edge represents a dependency.</p>

<p>This is what a Makefile captures, i.e. a Makefile should be a <strong>declaration</strong> of the dependency graph with actions per node to create it if (the file it corresponds to on disk) is missing or older than its dependencies, i.e. the nodes we can reach from the (directed) edges.</p>

<p>By keeping the dependency information declarative we let <code>make</code> figure out which files are outdated and need to be rebuilt plus give it freedom to pick a strategy to rebuild files which may include running jobs in parallel.</p>

<h2>Example</h2>

<p>To give an example let us look at the <a href="http://github.com/andymatuschak/Sparkle/blob/master/generate_keys.rb"><code>generate_keys</code></a> script which is part of Sparkle and can generate a public and private key file.</p>

<p>The public key is extracted from the private key and the private key requires a DSA parameter file (we’ll ignore the <code>-genkey</code> flag to <code>dsaparam</code>).</p>

<p>So our (simple) graph looks like this:</p>

<pre><code>pubkey → privkey → dsa_parameters
</code></pre>

<p>A Makefile “rule” is effectively one node in our graph and looks like:</p>

<pre><code>«goal»: «dependencies»
    «action»
</code></pre>

<p>Here <code>«goal»</code> is the node itself, that is, the file it represents. The <code>«dependencies»</code> is the nodes it depends on and <code>«action»</code> is the command(s) to execute to generate/update the node/file (interpreted by the shell).</p>

<p>Using the <a href="http://github.com/andymatuschak/Sparkle/blob/master/generate_keys.rb"><code>generate_keys</code></a> script as source our Makefile ends up like this:</p>

<pre><code>pubkey: privkey
    openssl dsa -in '$&lt;' -pubout -out '$@'

privkey: dsa_parameters
    openssl gendsa '$&lt;' -out '$@'

dsa_parameters:
    openssl dsaparam 2048 &lt; /dev/urandom -out '$@'
</code></pre>

<p>In the above I have used two variables. The variable <code>$@</code> expands to the goal (i.e. the file we are generating) and <code>$&lt;</code> expands to the first dependency.</p>

<p>If you save the above as <code>Makefile</code> and run <code>make</code> then it will generate 3 files: <code>pubkey</code>, <code>privkey</code>, and <code>dsa_parameters</code>. By default calling <code>make</code> without arguments will ensure the first goal in the Makefile is up to date. If you re-run <code>make</code> it should say:</p>

<pre><code>make: `pubkey' is up to date.
</code></pre>

<p>You can also run <code>make privkey</code> to ensure (only) <code>privkey</code> is up to date (which then won’t extract the public key).</p>

<h2>Intermediate Files</h2>

<p>The above Makefile reproduce the script except that we are not removing the temporary <code>dsa_parameters</code> file after having generated the keys. We can fix this by making <code>dsa_parameters</code> a dependency of the fake <code>.INTERMEDIATE</code> goal by adding this line:</p>

<pre><code>.INTERMEDIATE: dsa_parameters
</code></pre>

<p>If we now run <code>make</code> it will automatically remove the <code>dsa_parameters</code> file after it has been used.</p>

<p>We probably want to use our public key from C so let us add another goal (node) namely <code>pubkey.h</code>. This goal will create a C header from the <code>pubkey</code> file, so it will depend on it. This goal can be handled by adding the following rule:</p>

<pre><code>pubkey.h: pubkey
    { echo 'static char const* pubkey ='; \
      sed &lt; '$&lt;' -e $$'s/.*/\t"&amp;\\\\n"/'; \
      echo ';'; } &gt; '$@'
</code></pre>

<p>Perhaps not the nicest way to generate the <code>pubkey.h</code> file but what is nice about this is that whatever application needs to use this header can declare it as a dependency, and it will be generated when needed, including extracting the public key if not already done.</p>

<h2>Includes</h2>

<p>To keep things modular we can save our Makefile as <code>Makefile.keys</code> and include it from our main Makefile using:</p>

<pre><code>include Makefile.keys
</code></pre>

<p>If we go back to the Sparkle distribution there is also a <code>sign_update</code> script which signs an update using the private key.</p>

<p>We can add this as another goal to our Makefile, e.g. using:</p>

<pre><code>archive.sig: privkey archive.tbz
    openssl dgst -dss1 -sign privkey archive.tbz
</code></pre>

<p>Here the archive signature depends on both having a private key and an archive. The private key will be generated if not already there, the archive we of course need to add another goal to create. The archive goal will depend on our actual binary which will depend on its object files which will depend on the sources (where one source is likely going to depend on <code>pubkey.h</code>).</p>

<h2>Phony Targets</h2>

<p>In addition we probably want to add another goal to construct an RSS feed (or similar) which include the archive signature and eventually we will want a deploy goal which will depend on the RSS feed and the archive. The action for this goal will likely be using <code>scp</code> to copy the files to the server and the goal itself will not be a file, i.e. when we run <code>make deploy</code> we do not expect an actual <code>deploy</code> file to be generated. While there is little harm in declaring a goal with actions that do not generate the file, we could risk getting a:</p>

<pre><code>make: `deploy' is up to date.
</code></pre>

<p>If there actually is a <code>deploy</code> file which is newer then the dependencies of the <code>deploy</code> goal. To avoid this we make the fake goal named <code>.PHONY</code> depend on <code>deploy</code> similar to what we did with the <code>.INTERMEDIATE</code> goal:</p>

<pre><code>.PHONY: deploy
</code></pre>

<h2>Closing Words</h2>

<p>This post is just a mild introduction to <code>make</code>. I have deliberately picked something that does not involve building C sources as the example to show that <code>make</code> is a versatile tool.</p>

<p>Whenever you have a set of actions that need to be run in a specific order then consider if a Makefile can capture the dependency graph.</p>

<p>When you do write a Makefile aim for having a rule only do one thing. For example imagine we are writing a manual and store each chapter as Markdown. Rather than do something like this:</p>

<pre><code>chapter.html: header.html chapter.mdown footer.html
    { cat header.html; \
      Markdown.pl &lt; chapter.mdown; \
      cat footer.html } &gt; '$@'
</code></pre>

<p>We can instead do:</p>

<pre><code>chapter.html: header.html cache/chapter.html footer.html
    cat &gt; '$@' $^

cache/chapter.html: chapter.mdown
    Markdown.pl &lt; '$&lt;' &gt; '$@'
</code></pre>

<p>The new <code>$^</code> variable expands to all the dependencies.</p>

<p>There are a few reasons to favor this approach. In this concrete example we have the advantage of not needing to pipe all the chapters through <code>Markdown.pl</code> if we change the header or footer. But in general it just makes things more flexible, easier to re-use goals, faster to restart a failed build, it may improve the number of jobs that can run in parallel, etc.</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2010/01/15/build-automation-part-1/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Self-balancing Trees</title>
		<link>http://sigpipe.macromates.com/2009/08/22/self-balancing-trees/</link>
		<comments>http://sigpipe.macromates.com/2009/08/22/self-balancing-trees/#comments</comments>
		<pubDate>Sat, 22 Aug 2009 20:02:48 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2009/08/22/self-balancing-trees/</guid>
		<description><![CDATA[In a previous blog post I describe a data structure which require the use of a self-balancing binary search tree.


Few need to implement their own self-balancing trees, but since two previous comments referred to AVL and red/black trees respectively, I should give a shout-out to Arne Andersson and his paper titled Binary Search Trees Made [...]]]></description>
			<content:encoded><![CDATA[<p>In a <a href="http://sigpipe.macromates.com/2009/08/13/maintaining-a-layout/">previous blog post</a> I describe a data structure which require the use of a <a href="http://en.wikipedia.org/wiki/Self-balancing_binary_search_tree">self-balancing binary search tree</a>.</p>

<p><span id="more-37"></span>
Few need to implement their own self-balancing trees, but since two previous comments referred to AVL and red/black trees respectively, I should give a shout-out to <a href="http://user.it.uu.se/~arnea/">Arne Andersson</a> and his paper titled <a href="http://user.it.uu.se/~arnea/ps/simp.pdf">Binary Search Trees Made Simple</a> (PDF).</p>

<p>The paper introduces <a href="http://en.wikipedia.org/wiki/AA_tree">AA trees</a> which are simple to implement but understanding the logic for when to skew/rotate is not clear from the paper. Julienne Walker filled that hole with a great <a href="http://www.eternallyconfuzzled.com/tuts/datastructures/jsw_tut_andersson.aspx">tutorial about AA trees</a> and how they (<a href="http://en.wikipedia.org/wiki/Red-black_tree#Analogy_to_B-trees_of_order_4">like red/black trees</a>) stem from <a href="http://www.cs.ucr.edu/cs14/cs14_06win/slides/2-3_trees_covered.pdf" title="PDF Showing insert/delete for 2-3 trees">B-trees of order 3</a> (PDF).</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2009/08/22/self-balancing-trees/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Cuckoo Hashing</title>
		<link>http://sigpipe.macromates.com/2009/08/18/cuckoo-hashing/</link>
		<comments>http://sigpipe.macromates.com/2009/08/18/cuckoo-hashing/#comments</comments>
		<pubDate>Tue, 18 Aug 2009 20:23:55 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2009/08/18/cuckoo-hashing/</guid>
		<description><![CDATA[The Achilles’ heel of hashing is collision: When we want to insert a new value into the hash table and the slot is already filled, we use a fallback strategy to find another slot, for example linear probing.

The fallback strategy can affect lookup time since we need to do the same probing when a lookup [...]]]></description>
			<content:encoded><![CDATA[<p>The Achilles’ heel of hashing is collision: When we want to insert a new value into the hash table and the slot is already filled, we use a fallback strategy to find another slot, for example <a href="http://en.wikipedia.org/wiki/Linear_probing">linear probing</a>.</p>

<p>The fallback strategy can affect lookup time since we need to do the same probing when a lookup results in an entry with wrong key, turning the nice <em>O(1)</em> time complexity into (worst case) <em>O(n)</em>.</p>

<p><span id="more-36"></span>
Of course the <em>O(n)</em> time is pessimistic as we will rehash to a larger table size when we reach a certain threshold, though from a theoretical point of view an intriguing approach to handling collisions is <a href="http://en.wikipedia.org/wiki/Cuckoo_hashing">cuckoo hashing</a> which guarantees <em>O(1)</em> lookup time (insertion can still be worse).</p>

<p>Quoting the <a href="http://en.wikipedia.org/wiki/Cuckoo_hashing">Wikipedia page</a>:</p>

<blockquote>
  <p>The basic idea is to use two hash functions instead of only one. This provides two possible locations in the hash table for each key.</p>
  
  <p>When a new key is inserted, a greedy algorithm is used: The new key is inserted in one of its two possible locations, “kicking out”, that is, displacing, any key that might already reside in this location. This displaced key is then inserted in its alternative location, again kicking out any key that might reside there, until a vacant position is found, or the procedure enters an infinite loop. In the latter case, the hash table is rebuilt in-place using new hash functions.</p>
</blockquote>

<p>This means that no additional probing is required during lookup as an element will always be in one of the two slots given by the hash functions (if it is in the table).</p>

<p>In practice linear probing with proper thresholds and a good hash function may perform better (due to locality of reference) plus insertion using cuckoo hashing can be more expensive (as we do more memory writes on collisions than linear probing), still, I love the theoretical property of this collision strategy :)</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2009/08/18/cuckoo-hashing/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Maintaining a Layout</title>
		<link>http://sigpipe.macromates.com/2009/08/13/maintaining-a-layout/</link>
		<comments>http://sigpipe.macromates.com/2009/08/13/maintaining-a-layout/#comments</comments>
		<pubDate>Thu, 13 Aug 2009 15:22:48 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2009/08/13/maintaining-a-layout/</guid>
		<description><![CDATA[TextMate works with fixed-width fonts both because of the simplicity and because it is the immediate difference between a plain text editor and a word processor.

Though for version 2.0 I want it to do a richer layout, e.g. larger headings in markup languages, indented soft wrap, proper support for unicode, etc. So I had to [...]]]></description>
			<content:encoded><![CDATA[<p>TextMate works with fixed-width fonts both because of the simplicity and because it is the immediate difference between a plain text editor and a word processor.</p>

<p>Though for version 2.0 I want it to do a richer layout, e.g. larger headings in markup languages, indented soft wrap, proper support for unicode, etc. So I had to bite the bullet and figure out how to allow this with reasonable performance, this article explains the problem and data structure I picked.</p>

<p><span id="more-35"></span></p>

<h2>Problem</h2>

<p>The main problem is variable height of lines. If we make a change to line 3,128 then what pixel position does that correspond to (for redrawing) when lines preceding it can have an arbitrary height?</p>

<h2>Naive Solution Methods</h2>

<p>Two solution methods exist with different characteristics:</p>

<ol>
<li>Use an array of lines where each line knows its <em>Y</em> position.</li>
<li>Use a linked list of lines and let each line know its height.</li>
</ol>

<p>If we pick the first we can find lines and line positions in constant time, but changing a line’s height or inserting/removing lines is linear in time.</p>

<p>Solution number two allows us to insert/remove lines in constant time (if we know where to insert) and also to update a line’s height in constant time. Finding the position of a line (or the nth line) is however linear in time.</p>

<h2>Pick Your Poison</h2>

<p>Both the solution methods are unusable as explained above, but unlike the first one, the latter one has potential. The reason for this is that the first solution method requires the entire data structure to be updated, the data structure is invalid if we do not perform this update, it is unlikely we will be able to improve on that (and updating is a common operation).</p>

<p>The latter solution method has expensive queries, but there is almost always a way to speedup a query, for example by using a cache.</p>

<h2>Optimizing Queries</h2>

<p>If we go with the second solution method our problem is now how to speedup queries in a linked list.</p>

<p>One way to improve it is by using an index, for example a <a href="http://en.wikipedia.org/wiki/Skip_lists">skip lists</a>. We have two different query keys, line number and <em>Y</em> position. I.e. given a layout, we may want to ask for the node corresponding to line number 32, or we may want to ask which node is at <em>Y</em> position 27,423.</p>

<p>For now, let us only focus on the first type of queries, so query key is the line number. The problem is the same because if we create an index for the line number and insert a new line, our line numbers change, and the index becomes void.</p>

<p>Rather than work with a skip list, let us use a <a href="http://en.wikipedia.org/wiki/Binary_search_tree">binary search tree</a> and let me draw how a (perfectly balanced) one looks for a layout with 7 lines:</p>

<pre><code>     4
   /   \
  2     6
 / \   / \
1   3 5   7
</code></pre>

<p>So with this tree we can find the node for line 3 by first visiting node 4 and 2.</p>

<p>If we insert a new line after line 2 then it becomes:</p>

<pre><code>     4
   /   \
  2     6
 / \   / \
1   3 5   7
   /
  ×
</code></pre>

<p>After the insertion all line numbers (after line 2) have changed, so we need to update the keys in our search tree:</p>

<pre><code>     5
   /   \
  2     7
 / \   / \
1   4 6   8
   /
  3
</code></pre>

<p>So we are back to the original problem of having to update the entire data structure. But there is a way around this.</p>

<p>Instead of storing the actual line number in each node, we store number of nodes in the sub-tree rooted at the node in question (which I <em>think</em> is called a <a href="http://en.wikipedia.org/wiki/Finger_trees">finger tree</a>), so the initial tree instead becomes:</p>

<pre><code>     7
   /   \
  3     3
 / \   / \
1   1 1   1
</code></pre>

<p>The way we use this tree is slightly different than when we had the actual line numbers.</p>

<p>We need to always start at the root and set: <code>skipped_lines = 0</code>.</p>

<p>Whenever we go right in the tree, we add the number of children in the left subtree plus one (for the node itself): <code>skipped_lines += node.left.count + 1</code>.</p>

<p>The line number for a node is then: <code>skipped_lines + node.left.count + 1</code>.</p>

<p>So the transformation has not made queries more expensive. Let’s now look at an insertion (again after line 2):</p>

<pre><code>     7
   /   \
  3     3
 / \   / \
1   1 1   1
   /
  ×
</code></pre>

<p>The only nodes which need updating are those on the path from the root down to the inserted node, i.e. we never update more nodes than the height of the tree which is <em>log(n)</em> (assuming our binary search tree is self-balancing), the updated tree looks like this:</p>

<pre><code>     8
   /   \
  4     3
 / \   / \
1   2 1   1
   /
  1
</code></pre>

<h2>Back to Pixels</h2>

<p>We have solved the problem of storing <em>n</em> items in a binary search tree and finding items based on their index (line number) in <em>log(n)</em> time (which is different than regular binary search trees where each item has a fixed search key).</p>

<p>This is no different than using <em>Y</em> positions. Instead of storing number of children in each tree, we store the (pixel) height of the (sub)tree which is <code>node.height = node.lineHeight + node.left.height + node.right.height</code>. When we traverse the tree (as loosely described above) we add <code>node.lineHeight</code> to <code>skipped_lines</code> instead of <code>1</code> (and probably rename the variable to <code>skipped_pixels</code>).</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2009/08/13/maintaining-a-layout/feed/</wfw:commentRss>
		<slash:comments>4</slash:comments>
		</item>
		<item>
		<title>Blog Spam Filtering Ideas</title>
		<link>http://sigpipe.macromates.com/2009/08/11/blog-spam-filtering-ideas/</link>
		<comments>http://sigpipe.macromates.com/2009/08/11/blog-spam-filtering-ideas/#comments</comments>
		<pubDate>Tue, 11 Aug 2009 11:50:12 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2009/08/11/blog-spam-filtering-ideas/</guid>
		<description><![CDATA[I have previously detailed how I fight comment spam using a JavaScript challenge.

I host two blogs, a wiki, and a ticket system, all targets for spam, so I have since generalized the system by using mod_rewrite to redirect all POSTs without a cookie to a page which uses JavaScript to set this cookie and resubmit [...]]]></description>
			<content:encoded><![CDATA[<p>I have previously detailed <a href="http://sigpipe.macromates.com/2005/09/25/fighting-comment-spam/">how I fight comment spam</a> using a JavaScript challenge.</p>

<p>I host two blogs, a wiki, and a ticket system, all targets for spam, so I have since generalized the system by using <a href="http://httpd.apache.org/docs/2.2/mod/mod_rewrite.html"><code>mod_rewrite</code></a> to redirect all POSTs without a cookie to a page which uses JavaScript to set this cookie and resubmit the request (which is then no longer catched by <code>mod_rewrite</code> due to the cookie being set). This means “blocking” spam doesn’t require a plug-in written specifically for the particular web application.</p>

<p>Despite this JS challenge some spam still gets through, and that’s what this post is about.</p>

<p><span id="more-34"></span></p>

<h2>Spam Not Caught</h2>

<p>Until recently I deleted all spam that fooled the JS challenge but I did look through the logs for some of them to look for patterns, and until recently it was hard to find one since:</p>

<ol>
<li>The log entries look like a human (and it might be), e.g. loading all images and CSS and generally taking a few minutes from first hit to the POST.</li>
<li>The comment says <em>“thanks for this article”</em>, <em>“very interesting”</em>, or something along those lines. I.e. just one or two lines of praise (and often slight variations).</li>
<li>Even the URL provided (for author) can look very non-spammy (including the landing page).</li>
</ol>

<h2>Patterns</h2>

<p>Given the above, I started to consider other things that could be used for filtering and arrived at the following list:</p>

<ol>
<li>While comment has no spammy content it has also no actual content, so comparing it to legit content (the previous comments + post) might provide a score for how likely it is a valid comment.</li>
<li>User agent generally include Windows, yet all the cool kids are on Mac :)</li>
<li>IP is often from Central America, sometimes Eastern Europe.</li>
<li>URL referrer is often Google (searching for <em>“blog”</em> or similar).</li>
<li>Comment is often added to an old post. While legit comments are also added to old posts, these tend to be long.</li>
</ol>

<p>As said above, so far I have just thrown spam comments away, so I don’t have a corpus to test the above on, so take it only as anecdotal.</p>

<p>I plan to archive all spam from this point on so that I can later experiement with filters based on the above list. Though my other project might not allow me to get around to experiment with this anytime soon, perhaps someone else will find inspiration in the above.</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2009/08/11/blog-spam-filtering-ideas/feed/</wfw:commentRss>
		<slash:comments>7</slash:comments>
		</item>
		<item>
		<title>Get OS Version From Scripts</title>
		<link>http://sigpipe.macromates.com/2009/08/01/get-os-version-from-scripts/</link>
		<comments>http://sigpipe.macromates.com/2009/08/01/get-os-version-from-scripts/#comments</comments>
		<pubDate>Sat, 01 Aug 2009 14:26:07 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2009/08/01/get-os-version-from-scripts/</guid>
		<description><![CDATA[It is sometimes useful to have a script check the OS version, for example the way to get the user’s full name was previously done using niutil but Apple removed that command in Leopard (it can now be done using dscl).


One BSD command to read OS properties is sysctl but prior to Leopard it didn’t [...]]]></description>
			<content:encoded><![CDATA[<p>It is sometimes useful to have a script check the OS version, for example the way to get the user’s full name was previously done using <code>niutil</code> but Apple removed that command in Leopard (it can now be done using <code>dscl</code>).</p>

<p><span id="more-32"></span>
One BSD command to read OS properties is <a href="x-man-page://8/sysctl"><code>sysctl</code></a> but prior to Leopard it didn’t offer <code>kern.osrelease</code> making it a bit awkward, as you had to keep track of how <code>kern.osversion</code> maps to a more humane number.</p>

<p>Lo and behold! Today I stumbled over <a href="x-man-page://1/sw_vers"><code>sw_vers</code></a> (<code>x-man-page</code> links fail when there is an underscore in the URL, so type <code>man sw_vers</code> in Terminal, <a href="rdar://7111174">rdar://7111174</a>).</p>

<p>The manual page has no history, but the date is 2003 and an example in the manual page has <code>10.2.4</code> as output, so it goes way back.</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2009/08/01/get-os-version-from-scripts/feed/</wfw:commentRss>
		<slash:comments>6</slash:comments>
		</item>
		<item>
		<title>Optimizing Path Normalization</title>
		<link>http://sigpipe.macromates.com/2009/08/01/optimizing-path-normalization/</link>
		<comments>http://sigpipe.macromates.com/2009/08/01/optimizing-path-normalization/#comments</comments>
		<pubDate>Sat, 01 Aug 2009 07:50:30 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2009/08/01/optimizing-path-normalization/</guid>
		<description><![CDATA[One of my path functions is normalize. It removes (redundant) slashes and references to directory meta entries (current and parent directory).

A lot of other path functions use or rely on normalize, for example my version of dirname() is simply: return normalize(path + "/..");.

I was recently tasked with rewriting normalize to be more efficient and it [...]]]></description>
			<content:encoded><![CDATA[<p>One of my path functions is <code>normalize</code>. It removes (redundant) slashes and references to directory meta entries (current and parent directory).</p>

<p>A lot of other path functions use or rely on <code>normalize</code>, for example my version of <a href="x-man-page://3/dirname"><code>dirname()</code></a> is simply: <code>return normalize(path + "/..");</code>.</p>

<p>I was recently tasked with rewriting <code>normalize</code> to be more efficient and it proved to be a bit of a challenge, so I’ll share what I came up with.</p>

<p><span id="more-30"></span></p>

<h2>Motivation / Background</h2>

<p>In TextMate I index all bundles and store this index on disk. This is what gets loaded on startup, and in 1.x I <a href="x-man-page://2/stat"><code>stat()</code></a> directories to see if part of it needs updating.</p>

<p>Starting with Leopard it is possible to use <a href="http://developer.apple.com/documentation/Darwin/Conceptual/FSEvents_ProgGuide/Introduction/Introduction.html">FSEvents</a> rather than having to <code>stat</code> each directory (which is not fast when you have a gazillion bundles). The API of FSEvents is so that you tell what directory you want events for. It may seem like I should just watch <code>~/Library/Application Support/TextMate/Bundles</code> but it is unfortunately not that simple, since this directory can contain symbolic links pointing outside the directory for which I would then not see events.</p>

<p>Both for this reason, to minimize storage, and to easily figure out what has been added/removed when getting a file system event, I store a lot of relative paths in the index.</p>

<p>I convert all the relative paths to absolute paths for the memory cache (created from the index) and with almost 10,000 bundle items, and <code>join</code> being called multiple times for each item (as we go from bundles directory → bundle → kind folder → actual item) the <code>normalize</code> function was responsible for 55% of the time spent creating this memory cache.</p>

<p>To be exact, it took a tenth of a second, which isn’t much, but in 2.0 editing a bundle does not edit any in-memory data structures, it only saves the new item to disk, which causes a file system event that triggers updating the index, which in turn regenerates the memory cache. So I wanted this to be as close to instant as possible.</p>

<p>Another solution would be to make the paths absolute only when needed. The reason I am presently not doing this has to do with code abstraction (not having the bundle query system know about the index).</p>

<h2>The Problem: Resolving Parent References</h2>

<p>The difficulty with optimizing <code>normalize</code> is the step which removes parent references. It sounds simple, i.e. each time we see ‘<code>..</code>’ then we backtrack one component in the path.</p>

<p>The complexity comes from allowing ‘<code>..</code>’ as the component preceding ‘<code>..</code>’. So we can’t backtrack one component, we need a stack of backtrack points. The pseudocode for the algorithm is as follows:</p>

<pre><code>stack = [ ]

for component in path

   if component == ".."
         stack.pop
   else  stack.push(component)

return stack.join('/')
</code></pre>

<p>This code assumes that the first component of an absolute path is the empty string, e.g. <code>/path/to/foo</code> → <code>[ "", "path", "to", "foo" ]</code> and it doesn’t handle relative paths which (possibly indirectly) start with a reference to parent (e.g. <code>../foo</code>).</p>

<p>Regardless, this algorithm is not going to work, the problem is that we split a path into an array of strings and use a dynamic stack to handle the backtracking. These things involve memory allocation which makes the code magnitudes slower than a potential <a href="http://en.wikipedia.org/wiki/Inplace_algorithm">in-place algorithm</a>.</p>

<p>We could reduce the number of memory allocations by working with string indices, but it is still based around a dynamic stack and backtracking.</p>

<h2>Avoiding the Stack</h2>

<p>Turns out getting rid of the stack-based approach is fairly simple if we iterate the path right-to-left, that is, backwards.</p>

<p>If we move backwards then when we see ‘<code>..</code>’ we skip the <em>next</em> component (rather than previous). If next component is also ‘<code>..</code>’ then we skip the two next components, etc. So we can manage this with a simple counter instead of a stack. Here is the code from above, but rewritten to iterate the path backwards:</p>

<pre><code>res  = [ ]
skip = 0

for component in path.reverse

   if component == ".."
      skip += 1
   else if skip &gt; 0
      skip -= 1
   else
      res.push(component)

return res.reverse.join('/')
</code></pre>

<p>I kept it close to the stack-based approach, but the above can be converted to an in-place version which simply overwrites <code>path</code>, so we can write the above without any memory allocations at all.</p>

<p>The <code>skip</code> counter also tells us how many parent references we weren’t able to resolve, so we can add (what we didn’t handle in the stack-based version):</p>

<pre><code>while skip &gt; 0
   res.push("..")
   skip -= 1
</code></pre>

<h2>Absolute versus Relative Paths</h2>

<p>I mentioned that for absolute paths the first component should be the empty string, for the stack-based algorithm to work.</p>

<p>This isn’t ideal, for example an (ill formed) path like <code>/foo/../../bar</code> will be converted to <code>bar</code>, i.e. a relative path.</p>

<p>To avoid this, a simple approach is to check if the received path is absolute, if so, remove the first slash to make it relative. Our code then only needs to handle relative paths, and by re-adding the slash last, we ensure the result is absolute.</p>

<p>In the in-place version removing and adding that slash is much simpler than it sounds, at least if we use the calling conventions of the C++ standard template library and takes a first/last iterator as argument and return the new end-point of the sequence, since with such API we can write the entire function like this:</p>

<pre><code>char* remove_parent_references (char* first, char* last)
{
   if(first != last &amp;&amp; *first == '/')
      ++first;

   std::reverse(first, last);

   char* src = first;
   char* dst = first;

   size_t skip = 0;
   while(src != last)
   {
      char* from = src;
      while(src != last &amp;&amp; (src == from || *src != '/'))
         ++src;

      if(is_parent_meta_entry(from, src))
         ++skip;
      else if(skip)
         --skip;
      else
         dst = std::copy(from, src, dst);
   }

   static char const parent_str[3] = { '/', '.', '.' };
   while(skip--)
      dst = std::copy(parent_str, parent_str + sizeof(parent_str), dst);

   std::reverse(first, dst);
   return first != dst &amp;&amp; dst[-1] == '/' ? --dst : dst;
}
</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2009/08/01/optimizing-path-normalization/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Worker Thread Protocol</title>
		<link>http://sigpipe.macromates.com/2009/07/30/worker-thread-protocol/</link>
		<comments>http://sigpipe.macromates.com/2009/07/30/worker-thread-protocol/#comments</comments>
		<pubDate>Thu, 30 Jul 2009 04:27:00 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2009/07/30/worker-thread-protocol/</guid>
		<description><![CDATA[When two components are used together, let’s call them A and B, it is a good approach to figure out who is using whom, and if A is using B then B should not know about A and vice versa.

This rule of thumb lowers complexity and makes both refactoring and re-use of code easier.

One scenario [...]]]></description>
			<content:encoded><![CDATA[<p>When two components are used together, let’s call them A and B, it is a good approach to figure out who is using whom, and if A is using B then B should not know about A and vice versa.</p>

<p>This rule of thumb lowers complexity and makes both refactoring and re-use of code easier.</p>

<p>One scenario where it might be appealing to ignore this rule is when outsourcing computation to a worker thread, but here it is actually more important to stick with it.</p>

<p><span id="more-29"></span>
Let us say we want to search folders recursively and provide the user with status about where we are in the process.</p>

<p>To provide this status we can have the worker thread send a message back to the main thread to let it know which folder it is presently searching, but this breaks the rule! The main thread sets up the worker thread and will also terminate it, should the user abort the search, so the main thread clearly knows about the worker thread (and need to). If the worker thread sends back messages, then it knows about the main thread.</p>

<h2>Synchronous Message Deadlock</h2>

<p>If we do synchronous message passing then this simple design can lead to a deadlock. E.g. if the worker thread sends back a status update and at the same time, the main thread sends a terminate message to the worker then both threads are stuck waiting for the other to acknowledge the message.</p>

<h2>Asynchronous Message Race Condition</h2>

<p>By using asynchronous message passing we avoid the deadlock but instead introduce potential race conditions. The main thread may send a terminate message to an already completed worker thread, because it hasn’t received a “did terminate” message yet, or the worker thread may send a status update to the main thread after the main thread sent a terminate message.</p>

<p>This may lead to messages sent to disposed objects or resources being leaked, it is not impossible to “get right” but it is definitely not a simple problem.</p>

<h2>The Solution: Polling</h2>

<p>While polling in general should be avoided, it fits this problem very well. Our search code will look something like the following (C++):</p>

<pre><code>class searcher
{
    volatile bool keep_running, done;
    std::vector&lt;std::string&gt; results;

public:
    searcher () : keep_running(true), done(false) { }

    void start_search (std::string const&amp; src)
    {
        std::vector&lt;std::string&gt; toSearch(1, src);
        while(keep_running &amp;&amp; !toSearch.empty())
        {
            std::vector&lt;std::string&gt; tmp;

            // pseudo-code:
            folder = toSearch.pop()
            for each file in folder
               tmp.push(file)      if file matches criterion
               toSearch.push(file) if file.type == folder

            lock(mutex);
            results.insert(results.end(), tmp.begin(), tmp.end());
            unlock(mutex);
        }
        done = true;
    }

    void stop_search ()
    {
        keep_running = false;
    }

    std::vector&lt;std::string&gt; get_results ()
    {
        std::vector&lt;std::string&gt; res;
        lock(mutex);
        res.swap(results);
        unlock(mutex);
        return res;
    }

    bool is_done () const
    {
        return done;
    }
};
</code></pre>

<p>This encapsulates the searching, but does not use a thread itself. The <code>get_results</code> member function though is thread safe, so a user can spawn a thread, call <code>start_search</code> in that thread. In the main thread a timer is started, and <code>get_results</code> is periodically called (together with <code>is_done</code>).</p>

<p>When <code>is_done</code> returns <code>true</code>, the main thread knows that the search is done and can stop the timer (and delete the <code>searcher</code> object).</p>

<h2>Advantages</h2>

<p>In addition to avoiding the potential deadlock and/or race conditions, two other advantages with this approach is:</p>

<ol>
<li>Separation of concerns. The search code is completely self-contained and does not need to incorporate knowledge about threads or message passing. This makes it easier to re-use it, e.g. if we are making unit tests we can test the code without needing to involve an actual worker thread.</li>
<li>Free throttling! In a user interface we don’t want to refresh the progress more than a few times per second, so we simply set the timer to fire e.g. 5 times per second. Had we instead made the worker thread signal the main thread, it would be difficult to control the number of messages sent. For example searching a 4600 RPM disk might only produce one new result per second, so here it might be ideal to signal the main thread whenever we get a new result, but say we are searching an SSD disk, the disk cache is hot, or there are hundreds of matches per folder, then we flood our main thread to a point where it may affect perceived performance.</li>
</ol>

<h2>Closing Remarks</h2>

<p>I started by writing that if A knows about B, B should not know about A. When deciding which of the two should know about the other, it should be the component most likely to be re-used, which should not know about the other component.</p>

<p>In the above example we made the search code be the candidate for re-use by not letting it have any dependencies (knowledge about other objects), in a MVC pattern it is the view and model we want to re-use, and so, these do not know about any of the other parts.</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2009/07/30/worker-thread-protocol/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Simplifying Boolean Expressions</title>
		<link>http://sigpipe.macromates.com/2009/07/27/simplifying-boolean-expressions/</link>
		<comments>http://sigpipe.macromates.com/2009/07/27/simplifying-boolean-expressions/#comments</comments>
		<pubDate>Mon, 27 Jul 2009 12:54:51 +0000</pubDate>
		<dc:creator>Allan Odgaard</dc:creator>
				<category><![CDATA[General]]></category>

		<guid isPermaLink="false">http://sigpipe.macromates.com/2009/07/27/simplifying-boolean-expressions/</guid>
		<description><![CDATA[I recently had a boolean expression of the following form:

a &#124;&#124; (x &#38;&#38; b) &#124;&#124; (x &#38;&#38; y &#38;&#38; c) &#124;&#124; (x &#38;&#38; y &#38;&#38; z &#38;&#38; d)


It looked redundant with 10 instances of only 7 different variables.


Pretty much the only rule I know for manipulating boolean logic is how to change AND to OR [...]]]></description>
			<content:encoded><![CDATA[<p>I recently had a boolean expression of the following form:</p>

<pre><code>a || (x &amp;&amp; b) || (x &amp;&amp; y &amp;&amp; c) || (x &amp;&amp; y &amp;&amp; z &amp;&amp; d)
</code></pre>

<p>It looked redundant with 10 instances of only 7 different variables.</p>

<p><span id="more-25"></span>
Pretty much the only rule I know for manipulating boolean logic is how to change <code>AND</code> to <code>OR</code> so I didn’t have much luck working on the above until I realized that <code>AND</code> can be seen as <code>multiply</code> and <code>OR</code> as <code>plus</code> (with <code>false</code> being <code>0</code> and everything else being <code>true</code>).</p>

<p>Using that insight we instead have:</p>

<pre><code>a + (x · b) + (x · y · c) + (x · y · z · d)
</code></pre>

<p>It is now possible to use the <a href="http://en.wikipedia.org/wiki/Natural_numbers#Algebraic_properties">algebraic properties</a> with which most of us should be somewhat familiar. For example we can use the <a href="http://en.wikipedia.org/wiki/Distributivity#Definition">distributivity property</a> to only have one instance of <code>x</code>:</p>

<pre><code>a + x · (b + y · c + y · z · d)
</code></pre>

<p>We can do exactly the same with <code>y</code>:</p>

<pre><code>a + x · (b + y · (c + z · d))
</code></pre>

<p>We now only list each variable once, so this is optimal (assuming that all variables can affect the result of the expression). Going back to boolean operations we end with:</p>

<pre><code>a || x &amp;&amp; (b || y &amp;&amp; (c || z &amp;&amp; d))
</code></pre>

<p>Another thing I gained from this exercise is that now I will always be able to remember that <code>AND</code> has higher <a href="http://en.wikipedia.org/wiki/Order_of_operations">precedence</a> than <code>OR</code> which is useful if you want to cut down on your use of parenthesis.</p>
]]></content:encoded>
			<wfw:commentRss>http://sigpipe.macromates.com/2009/07/27/simplifying-boolean-expressions/feed/</wfw:commentRss>
		<slash:comments>11</slash:comments>
		</item>
	</channel>
</rss>
