Beating Binary Search
June 17th, 2010Jay from LinkedIn’s SNA team writes:
Quick, what is the fastest way to search a sorted array?
Binary search, right?
Wrong. There is actually a method called interpolation search
Jay from LinkedIn’s SNA team writes:
Quick, what is the fastest way to search a sorted array?
Binary search, right?
Wrong. There is actually a method called interpolation search
Whenever I see something that intrigues me, my mind makes a note of it and then subconsciously works toward finding a use-case for my newfound knowledge.
An example is that I recently learned how protected member data (C++) is actually not safe from outside pryers (even in clean code that does not use typecasts).
GCC 4.5.0 is out and their progress on implementing C++0x features is coming along nicely.
If you are on OS X and want to try it out you can install it via MacPorts:
sudo port install gcc45
The binary installed is named g++-mp-4.5 and you must use the -std=c++0x argument to enable the new features.
Of the supported C++0x features here are some of those that I find the most interesting (for my use of C++).
I ran some benchmarks which included PBZip2, a multi-threaded implementation of BZip2 (which is slow yet effective, so my preferred choice of compressor for basically everything).
Running the Burrows–Wheeler transform over the input blocks is a task well suited for being parallelized and the benchmarks show that Jeff Gilchrist did a great job at this:
| Compressor | Time | Archive Size |
|---|---|---|
| None (cat) | 2.3s | 50 MB |
| GZip | 4.0s | 34 MB |
| BZip2 | 16.3s | 29 MB |
| PBZip2 | 3.0s | 29 MB |
| LZip | 41.8s | 24 MB |
The timings were produced by running the code below 4 times and taking the average of the last 3 runs (for each compressor).
This was executed on a 2 × 2.8 GHz Quad Core Mac Pro where PBZip2 (correctly) auto-detected 8 cores.
I am running PBZip2 version 1.1.0 from MacPorts (sudo port install pbzip2).
for Z in cat gzip bzip2 pbzip2 lzip; do
time tar -cf "${Z}.res" --use-compress-prog="${Z}" Avian
done
Update: Added test with LZip (an LZMA based compresser). There is a multi-threaded implementation of this (plzip) but a quick ./configure && make did not cut it.
I just learned this neat thing about the cd shell command:
The variable
CDPATHdefines the search path for the directory containing «dir». Alternative directory names inCDPATHare separated by a colon (:). A null directory name is the same as the current directory. If «dir» begins with a slash (/), thenCDPATHis not used.
For example:
% export CDPATH=$HOME/Source:$HOME/Library/Application\ Support/TextMate
% cd Avian/
/Users/duff/Source/Avian
% cd Bundles/
/Users/duff/Library/Application Support/TextMate/Bundles
% cd Support/lib/
/Users/duff/Library/Application Support/TextMate/Support/lib
% cd Avian/Frameworks/
/Users/duff/Source/Avian/Frameworks
This works with tab completion (using bash 4.1.2) so regardless of the current directory, I can generally do cd Av⇥↩ to reach ~/Source/Avian.
This is part 2 of what I think will end up as four parts. This might be a bit of a rehash of the first part, but I skimmed lightly over why it actually is that I am so fond of make compared to most other build systems, so I will elaborate with some examples.
Part 3 will be a general post about declarative systems, not directly related to build automation. Part 4 should be about auto-generating the make files (which is part of the motivation for writing about declarative systems first).
A blog post about Ant vs. Maven concludes that “the best build tool is the one you write yourself” and the Programmer Competency Matrix has “can setup a script to build the system” as requirement for reaching the higher levels in the “build automation” row.
I have looked at a lot of build systems myself, and while I agree that the best build system is the one you create yourself I am also a big fan of make and believe that the best approach is to use generated Makefiles.
This post is a “getting started with make”. I plan to follow up with a part 2 about how to handle auto-generated self-updating Makefiles.
In a previous blog post I describe a data structure which require the use of a self-balancing binary search tree.
The Achilles’ heel of hashing is collision: When we want to insert a new value into the hash table and the slot is already filled, we use a fallback strategy to find another slot, for example linear probing.
The fallback strategy can affect lookup time since we need to do the same probing when a lookup results in an entry with wrong key, turning the nice O(1) time complexity into (worst case) O(n).
TextMate works with fixed-width fonts both because of the simplicity and because it is the immediate difference between a plain text editor and a word processor.
Though for version 2.0 I want it to do a richer layout, e.g. larger headings in markup languages, indented soft wrap, proper support for unicode, etc. So I had to bite the bullet and figure out how to allow this with reasonable performance, this article explains the problem and data structure I picked.
I have previously detailed how I fight comment spam using a JavaScript challenge.
I host two blogs, a wiki, and a ticket system, all targets for spam, so I have since generalized the system by using mod_rewrite to redirect all POSTs without a cookie to a page which uses JavaScript to set this cookie and resubmit the request (which is then no longer catched by mod_rewrite due to the cookie being set). This means “blocking” spam doesn’t require a plug-in written specifically for the particular web application.
Despite this JS challenge some spam still gets through, and that’s what this post is about.
It is sometimes useful to have a script check the OS version, for example the way to get the user’s full name was previously done using niutil but Apple removed that command in Leopard (it can now be done using dscl).
One of my path functions is normalize. It removes (redundant) slashes and references to directory meta entries (current and parent directory).
A lot of other path functions use or rely on normalize, for example my version of dirname() is simply: return normalize(path + "/..");.
I was recently tasked with rewriting normalize to be more efficient and it proved to be a bit of a challenge, so I’ll share what I came up with.
When two components are used together, let’s call them A and B, it is a good approach to figure out who is using whom, and if A is using B then B should not know about A and vice versa.
This rule of thumb lowers complexity and makes both refactoring and re-use of code easier.
One scenario where it might be appealing to ignore this rule is when outsourcing computation to a worker thread, but here it is actually more important to stick with it.
I recently had a boolean expression of the following form:
a || (x && b) || (x && y && c) || (x && y && z && d)
It looked redundant with 10 instances of only 7 different variables.
I was excited to use the “new” Universal Type Identifiers but excitement turned to confusion and a bit of disappointment. I will share my findings in this article.
One of the things I like about C++ is the ability to have the compiler create code for me that does actual work.
What do I mean? I am thinking about implicit conversions (wrapping) of data types and constructing/destructing data types when they go in/out of scope.
I will focus on the latter in this blog post, show how it can be used with Objective-C and how it can track leaks in C++ code.
Say you create a custom view with arbitrary many tracking rectangles (i.e. dynamically added).
Each time you add a rectangle you get back an
identifier for this rectangle which can’t be
stored in an NSArray as-is since it
is of the primitive type NSTrackingRectTag (an integer).
If you use Objective-C++ then you can use a
std::vector<NSTrackingRectTag> to avoid
having to box/unbox your identifiers but
if you have tried to put non-POD in the interface
declaration of your Objective-C class you have probably
seen that gcc does not like that.
Well, starting with 10.4 (so actually, some time ago)
Apple added a switch to gcc which allows
C++ objects as part of the instance data, and it will
call both constructor and destructor for your C++
objects when allocating/deallocating the Objective-C
object.
The flag you need to set is -fobjc-call-cxx-cdtors.
Occasionally it is convenient to pass a C++ object
to an Objective-C method. For example I have an NSString
initializer that takes a std::string as argument.
This works as long as you pass the object as a
reference (i.e. pass a pointer), but you can use
the “reference of” operator in the method signature
rather than at the call-site. By using a const
reference it will work for temporary/implicit objects.
So with the following method:
+ (NSString*)stringWithCxxString:(std::string const&)cxxString
{
return [[[NSString alloc] initWithBytes:cxxString.data()
length:cxxString.size()
encoding:NSUTF8StringEncoding] autorelease];
}
We can have code like this:
std::string dir = get_some_dir();
std::string file = get_some_file();
NSString* str = [NSString stringWithCxxString:dir + file];
I wanted to localize a shell command to give danish output and decided to look into the message catalog functions described in/by XPG4.
Two very nice shell commands that Apple has given us are the pbcopy and pbpaste commands. These allow stdin to go to the clipboard and the clipboard to be written to stdout.
Unfortunately the commands seem to use a combination of MacRoman and question marks for non-ASCII characters, which often makes them unusable for me, since I work with non-ASCII characters.
So today I decided to write a replacement for the two commands (yes, I did also file an enhancement report). You can download them here.
There's just one source, it compiles to a command which works as pbcopy, when called under that name, otherwise pbpaste.
What I've done is place the command in ~/bin and added a symbolic link from pbpaste to pbcopy, like this:
ln -s pbcopy ~/bin/pbpaste
And in addition ensured that my PATH contains ~/bin before anything else, i.e. by placing the following in my ~/.bash_profile (well, actually ~/.zshrc):
export PATH="$HOME/bin:/opt/local/bin:$PATH:/Developer/Tools"
The source is included in the archive, and it's very simple. No usage instructions etc., and it links with the Application Kit, since NSPasteboard is under that and not Foundation Kit.