Programming

Saturday, June 11, 2011

Human sorting algorithms

Whilst tidying the shed, I had to stack around 100 plant pots. This is a straightforward sorting algorithm.

So why don't humans use quicksort, like computers would?

The answer which occurred to me was that the effort to compare two items is not always equal. The other answer is that humans can't keep track of umpteen piles of stuff, and remember which pile needs to be merged into another pile. Each of these things takes overhead which isn't accounted for in the traditional analysis of sorting algorithms.

When there are a small number of possible sizes, the correct algorithm should be bin-packing. Just keep n separate piles, and put each item into their pile. As the number of sizes is small, arranging them is easy.

The algorithm which I used is as follows: spread everything out and pick the biggest and put it onto the pile. Repeat. If you make a mistake, do an insertion sort. This is efficient for humans, because our eyes are great at parallel processing, and is fine for around 100 pots.

Now, how to prove that this! Another real-life problem where the correct algorithm isn't obvious, is matching socks. In this case, the problem is finding a decent way of comparing them. I would suggest using a bin-packing algorithm to start with, and going from there. At each layer in the hierarchy, pick an appropriate discriminator.

What sorting algorithms don't analyse either is the likelihood of mistakes, and the cost of correcting them!

Labels:

Tuesday, June 03, 2008

Why virtualization is pointless

Virtualization is a growing trend in running computers in "containers" which emulate (virtualize) actual hardware. The benefit is that the container becomes portable, and can be run on different hardware, or migrated. Backups are a lot easier, there are security benefits, and hardware can be utilized more effectively. Virtualization is a growing trend and Microsoft has realized this and put it at the heart of Windows Server 2008.

The heart of the web - URLs (uniform resource locations) are a specification of the location of a resource - usually a web page, but can also refer to services (e.g. SOAP) and web applications.

The real problem is that machine name is being used as a proxy for resource location. In a URL - e.g. http://bbc.co.uk/news/12345.html - the bbc.co.uk part is specifying something physical - often a machine name - but certainly a physical place such as a router. In computer networks, machine name is critical in specifying the location of certain resources. This is the legacy of DNS (domain name system) which assigns names to computers.

Physical device is often used as a proxy for resource in the home. Your photos are stored on a physical disk in a physical computer, and you locate them by specifying where they are stored. When you are out and about, you usually don't have access to them because you don't have a network access to the physical device.

This approach of tying resources to hardware is frankly wrong. Why should I care where my photos are stored - providing I can access them and they are stored securely. Maybe hard disk or memory should be an intelligent cache and nothing more. The current approach does not protect users from data loss, and is hard for users to manage.

Virtualization is a solution to the wrong problem. Virtualization allows the easy migration of a machine - but why would you need to do that? It's only because we rely on certain services running on certain machine names that we even need to think of the "machine" as a unit. Virtualization is incredibly wasteful of machine resources, and is like hitting a walnut with a sledgehammer. It's far better to migrate the software, and have multiple software running on the same OS - like the good old days. We already have a perfectly good container, and it's called the "process".

The solution is to break the link between resource and machine name. I predict that in the future, data, services, resources and computing resources will exist in a redundant, replicating, self-healing "cloud". Running processes will be able to migrate between different computers - much like they can migrate to different processors at the moment. Different computers will form collectives and be given a catchy name like "swarm", "hive" or "cluster". In future, we will think about services, rather than physical or virtual machines. Virtualization is merely a stop-gap measure until we fundamentally rethink the way we provision resources.

Wednesday, April 02, 2008

Vista isn't as bad as everyone says it is

It's much worse than that.

I've just been waiting 20 minutes for my 64-bit quad-core RAIDed desktop to boot. I think I'll just have 30 minutes of work - and almost all of that time is lost to Vista hanging and crashing. This is about the 10th time my computer has hung up like this. The only reason I go with Vista is because we use the Microsoft toolchain at work which I quite like, and I foolishly think that I'll save time by using Vista. I mean, if it's one thing an OS need to be, it's stable. The kernel seems deadlocked and resource starved. This is SP1.

Other niggles are that its safe boot won't recognise my USB keyboard (even though the BIOS does), it hung and didn't work with my Wifi card (so I had to waste money on a new one). It bugs me about UAC each time my computer boots, and the whole kernel is very unstable coming out of hibernation. Each time I reach for the reset, I know my computer will be out of action for hours whilst it verifies its RAID. Really, Linux is looming.

* Harp music and a mirage *

Steve Bulmer (no relation to the fat fuck at Microsoft) is sitting on a park bench. Suddenly a 500kg gorilla grabs him from behind and...

* More harp music *

Ah, Vista shows me a logon screen. All is well. Next time Vista plays up, I'll start having fantasies about Bill Gates and a pack of hungry wolves.

Sunday, March 23, 2008

Speeding up builds

Here's a really simple tip if you are using GNU make. You can pass in the -j option to run the build steps in parallel. The default is -j1, i.e. no parallising.

The next question is what's the optimum number of jobs to run in parallel? An experiment on a build reveals:













NTime
16m13s
23m28s
32m38s
42m11s
51m59s
61m55s
71m53s
81m49s
121m47s
161m47s


So there you have it! On a 4-core CPU (Intel Q6600), it's best with -j4 or above. It's interesting that -j4 didn't max out the CPUs, and that you can keep getting good improvements up to at least -j8.

Thursday, January 17, 2008

Why isn't visual programming popular?

Visual programming is the idea that programmers (or even end-users) can program using a graphical notation. There are a number of reasons why this seems like a good idea:
  • Programs are structural, not linear/sequential
  • Textual syntax is difficult, and a barrier
  • Graphical notation is richer - allowing new paradigms that would not work with text
  • Direct manipulation - treating program components as physical things
  • Programmers use graphical notations widely, both formally and informally
Of course visual programming hasn't actually taken off as expected. A lot of excuses have come up. Here are some of them:
  • Yea, but we're stuck in a textual programming rut
  • Almost all new programming languages fail; this is to be expected
  • Workstations aren't graphical (not true any more)
  • All VP systems are toy systems unsuitable for real world problems (true to an extent)
  • Graphical notation is less compact (true, but there are solutions)
  • Because early VP system sucked, people assume all VP systems suck
  • Graphics is obviously better, we just haven't found the right approach yet.
Until recently, I would have had a great deal of sympathy with these views.

I've recently been working with use cases, which are a predominantly textual notation. The interesting thing is that there is also a graphical notation for use cases. The problem with the graphical notation is that it loses all precision, including extensions, preconditions, actors, goal level, stakeholders and interests, guarantees etc. In fact as a notation, UML use cases diagrams are very poor in comparison to text.

So we need to accept that fundamentally, some information is best presented as text.

Another problem is that the notation isn't as important as other aspects of software development. The text editor is just a convenient universal editor. It isn't the syntax which is hard with programming. It's annoying, sure, but it's not the crux of the problem. The crux of the problem is the design, the OO model, the functionality, memory allocation (if using C), the algorithms, the usability, the process, and the obsessive (male?) thinking. These are problems that aren't going to be changed by a graphical notation. In fact, using the text editor is probably the easiest part in the whole programming process!

What are the lessons from Visula?
In some ways, Visula is yet another VPL. A large part of it was written for fun - what would I come up with? In terms of scalability and practicality, it is orders of magnitude better than many other VPLs. If the question was: can VPLs be practical, usable and scalable, then the answer is a resounding YES! I didn't know this before. Hopefully that blows away some of the old prejudices, if anyone will listen.

Visula will not become a mainstream language though. It doesn't offer huge benefits over other scripting languages. I personally prefer the notation (which is really unique by the way), but that isn't enough on its own to persuade people to switch. Also the editor only runs on Windows, big mistake.

In hindsight, I should have based Visula on an existing scripting language, and created a graphical editor for the language. This would have been simpler to implement, and users would have no barrier to entry for the new notation. If they decide they don't like the editor, they can switch back to text. Creating a totally new language that nobody's going to use is fun, but pointless.

Recent trends
Before discussing new trends, let's review the current trends.
  • Types: Strong -> Dynamic
  • Style: Algorithms -> Glue
  • Data: Bits and bytes -> Objects and patterns
  • Expertise: High -> Low
  • Environment: Textual -> IDE -> Tools
  • UI: Text -> Desktop -> Web -> Web 2.0 -> Web apps.
  • Data: Files -> Databases -> Mobility -> Sharing -> Community
  • Paradigm: Procedural -> OO
  • Memory management: Manual -> Automatic
  • Data binding: Manual processing -> Drag and drop
  • Databases: File -> Relational -> ORM
  • Software: Small and buggy -> Big and bloated
  • Quality: Low -> High
  • Development process: Waterfall -> RUP -> Agile
  • Interoperability: None -> Low -> Inherent
  • Programmers: Generalist nerds -> Specialist communicators
  • Usability: Terrible -> Low -> Better
  • Platforms: Single platform -> Run anywhere. Desktop -> Web -> Mobile
  • Libraries: Small and fragmented -> Monolithic and comprehensive
  • Formats: Binary -> Text -> XML
  • Automation: Low level languages -> Code generators -> High level language
  • Reuse: None -> Libraries -> Platform
Future challenges
We still need to explore new notations - not because they are useful, but because they are interesting. It seems that all new work in HCI has to have proven immediate benefit backed up by experiments. This isn't blue-sky research, this is short-termism. Researchers should avoid making optimistic claims about the usefulness of their approach, but then they shouldn't need to.

There are a number of upcoming challenges to programming that software tools in general should address.
  • Scale. Libraries are now tens of thousands of methods large. How does one manage such a large amount of data?
  • Parallelism. We hear this all the time.
  • End user programming. Includes teaching programming to kids. How does this translate into grown-up programming?
  • Integrating programming with development process. Done in CASE tools. But there is a disjoint between CASE tools and programming languages.
  • Tracking, collaboration.
  • Web-based programming.
  • Textual transformations.
  • Reduce detail.
  • Navigation.
  • Searching.
  • Make it tangible.
  • Specify less.
  • Manage detail.
  • Guide the user. Often you stare at some code asking "what do I do next?"
  • Create a top-down workflow where the programmer assists the tool, not the other way around.
  • New paradigms. If I knew what they were, they wouldn't be new would they?
  • Hide complexity.
  • New abstractions, e.g. for parallelism.
  • Reduce cognitive load. (A posh way of saying make it easier).
  • Focus on the domain and the goal - not the technology.
  • XSLT or other transformations between notations.
  • Raising the level: Making programming more human and less nerdy.
What's clear is that transforming the textual notation alone isn't going to address these challenges. It might look pretty and nice, but it's not where the trouble with programming lies. I'm optimistic that graphics are better, but they are also a bit of a red herring.

As Fred Brookes said: There is no silver bullet. There is so much more research to do. Programming as we know it will change. I just wish I was clever enough to work out how.

Sunday, November 18, 2007

Write more libraries

We keep hearing that code reuse is a Good Idea, that we should do more of it, and that it can drastically reduce costs etcetera etcetera. It makes common sense in an ideal world.

The reality is very different. Code gets build into applications, dependencies build up, behaviour becomes fragile, code becomes less maintainable, and eventually after the people who wrote the code have moved on, decided that the original design was poor, the maintenance costs rise and becomes a fire-fighting exercise, so the code gets thrown away and rewritten. However this can be a cycle: new code eventually also rots and gets thrown away.

Refactoring is supposed to rejuvenate your code and keep it current. I'm sure the theory works fine.

The problem is that fundamentally, people don't write code for reuse. They put a good design in (that is appropriate at the time), and think that's the end of it. However that code won't get reused properly. It might get ripped out and copied-and-pasted into another application, but that is far from ideal code reuse.

Instead I would suggest putting as much code as possible into reusable libraries. This means that as much as possible should work stand-alone, without dependencies. When coding a library, imagine its use for the application at hand, but also treat it as an independent work, with proper documentation, and proper anticipation of the different ways the code could be used. It may mean generalising prematurely, but the code becomes infinitely more useful.

Now in a typical project, I would organise it as follows:

lib1
lib2
test1
test2
applogic1
applogic2
testapplogic1
testapplogic2
app1
app2

lib1 and lib2 are the main code - as independent libraries, They depend on nothing, except perhaps each other (and without circular dependencies). Most of the code should go into these libraries, and follow proper documentation and coding guidelines as appropriate for libraries. Where specific implementations are mandated, try to provide a strategy pattern to make the library not specific to the application. Where code is too interdependent, break the dependency by providing interfaces.

The test1 and test2 projects are the unit tests. These test the functionality in lib1 and lib2 respectively. Part of the build process should be to run the unit tests, and fail the build if the unit tests fail.

applogic1 and applogic2 are libraries of business or application logic (you could also have bl1 and bl2 if that suits your application better). There are also unit tests for applogic1 and applogic2, which should also be run as part of the build.

app1 and app2 are the different applications. The point is that app1 and app2 should be as thin as possible. This is the bit you're going to throw away or port to another platform, whereas everything else should be cross platform. app1 and app2 can share common functionality from lib1 and lib2, and the logic from applogic1 and applogic2.

In terms of longevity, app1 and app2 have the shortest lifespan. They will be full of platform-specific GUI-code which will be hard to unit test. applogic1 and applogic2 contain logic specific to your application, which can be shared between applications. If possible, make the logic extensible without going over the top. The application logic will have a longer lifespan because it can be reused for different applications, but is in general tied to particular applications. The libraries should have an indefinite lifespan.

If you are serious about code reuse, you should push as much code as possible into libraries. Think: does this unit have functionality beyond the application? Can it be generalised? If the answer is yes, then move the code into a library.

In hindsight, I don't see how any code could be reused unless it is in a library, and I don't see how unit tests can be linked against code embedded in an application. Librarifying (!) your code clearly separates out the parts that you intend to reuse, and the parts you intend to throw away. Libraries are your crown jewels.

Thursday, October 11, 2007

Reversing the #include idiom

There's an idiom in C and C++ whereby of you include a header file, you write
#ifndef FILE_H_INCLUDED
#define FILE_H_INCLUDED

// stuff

#endif
It occurs to me that it should actually be
#ifndef EXCLUDE_FILE1_H
#define EXCLUDE_FILE1_H

#endif
This is a minor point, however it means that you can legitimately exclude certain header files if for some reason you don't want to include them:
gcc foo.c -DEXCLUDE_FILE1_H