John Hawkins

Friday, May 9, 2014

Copyright on APIs is a bad idea.

If you have not heard yet, there has been a change in the Google Vs Oracle case. The original ruling that an API could not be copyrighted has been reversed. In essence it means that a company can release a description of a set of functions that they provide for developers, and no one is allowed to create an alternative implementation of those functions without permission.

To understand what this means to the future of software engineering you need to understand two things.

1) APIs are not complex pieces of code (which are and should be subject to copyright). They are very simple descriptions of what a piece of code will do and how to make it do it.

In essence a single API is just one word (the name of the function) and a list of pieces data that should be given to it. it then specifies what will happen and the data that will be returned. An API is not its implementation, it is a high level description of what an implementation should do.

It is equivalent to me copyrighting the sentence

"I am going to the shop, do you want anything?"

When it is combined with the reply

"Yes, some milk."

It is really that simple. Imagine if novelists needed to pay a fee when they used that combination of sentences. Of course they could use "I will go to the shop, do you need me to get something?" Or whatever other variant they need to produce in order to avoid infringing. But suggesting those sidesteps misses the point of copyright. Such small atomic combinations of the basic elements of a language are not significant pieces of work. They are not what copyright laws are designed to protect.

2) Secondly you need to understand the purpose of APIs. They exist so that software programs are easier to write and easier to make communicate with each other. Their purpose is to let one programmer know how to interface with software written by someone else, someone they may have never met, and yet have it function perfectly. The API is a simple contract that says if you want my code to do this, this is how you make it happen.

Another advantage of APIs (well used by software developers everywhere) is that if there are multiple competing programs that do the same thing, then if they all use the same API a software developer can switch between them (almost) effortlessly.

If you are ever frustrated by software not working, Internet sites being unable to perform some task, apps not working on your phone, then I have some bad news for you. If copyrightable APIs become the legal norm, then everything will get much, much worse. Start-up companies and device manufacturers alike will need to protect themselves by ensuring that their APIs are unique and not infringing any one else's copyright. In order to make software that is compatible with something else there will need to be long term financial agreements in place. This will mean that the number of things (devices and programs) that just work together will begin to decrease.

The economic impact is the creation of significant barriers to entry for new technology companies. For the simple reason you cannot create some great new product that will work with products people already have without infringing copyright. Consequently many technical product possibilities will not be explored because of their legal risk. In general copyright on APIs will result in an overall reduction in the pace of innovation.

To you as a consumer it will mean less things will just work out of the box together. It will mean that if you want devices and software to work with each other, then you will need to buy them all from the same vendor. This will be good for the large incumbents in the market place, but for consumers it is very bad.

The sad truth is that if this ruling is upheld you can look forward to less choice and less functionality in your digital world.

Monday, March 3, 2014

The Relative Proportion of Factors of Odd Versus Even Numbers

As I was riding home from work today I was thinking about odd and even numbers.

It is a funny thing that the product of two even numbers must be an even number, while the product of an even number with an odd number must also be even. Only two odd numbers will always give an odd number when multiplied.

If you don't believe me, think about what makes the number odd or even, it is whether there is a remainder of one after you divide by two. When you multiply an odd by an odd, it is the same as multiplying the first odd number by the second number minus one, and then adding the first number. The first operation must give you an even number (odd times even) so that then adding an odd number must give you an odd.

This tells us some interesting things, firstly only even numbers can have factors that are both odd and even. Odd numbers will only ever have odd factors.

It also means that if you take two random numbers then the probability of the product being odd is just 1/4. The reason is that there are 4 possible ways to draw two random numbers: odd+odd, odd+even, even+odd, even+even. Only one of those 4 options can produce an odd number.

This result could also mean that in general even numbers have more factors than odd numbers. I don't have an argument for it, but it seems to me to be the kind of thing for which there might be a formal proof, perhaps I was even shown it and have forgotten. If you know of one please point it out in the comments.

Anyway, these thoughts passed the time as I rode home today and helped me clear my mind of other things. Who would have thought that amateur number theory could be so satisfying.

Thursday, October 31, 2013

Maintaining Constant Probability of Data Loss with Increased Cluster Size in HDFS

In a conversation with a colleague some months ago I was asked if I knew how to scale the replication factor of a Hadoop Distributed File System (HDFS) cluster as the number of nodes increased in order to keep the probability of experiencing any data loss below a certain threshold. My initial reaction to the question was that it would not be affected, I was naively thinking the data loss probability was a product of the replication factor only.

Thankfully, it didn't take me long to realize I was wrong. What is confusing, is that for a constant replication factor as the cluster grows the probability of data loss increases, but the quantity of data lost decreases (if the quantity of data remains constant).

To see why consider a situation in which we have N nodes in a cluster with replication factor K. We let the probability of a single node failing in a given time period be X. This time period needs to be sufficiently small so that we know that the server administrator will not have enough time to replace the machine or drive and recover the data. The probability of experiencing data loss in that time period is the probability of getting K or more nodes failing. The exact value of which is calculated with the following sum:

Although in general a good approximation (a consistent overestimate) is simply:

Clearly as the size of N increases this probability must get bigger.

This got me thinking about how to determine the way the replication factor should scale with the cluster size in order to keep the probability of data loss constant (ignoring the quantity). This problem may have been solved elsewhere, but it was an enjoyable mathematical exercise to go through.

In essence we want to know if the number of nodes in the cluster increases by some value n, then what is the minimum number k such that the probability of data loss remains the same or smaller. Using the approximation from above we can express this as:

Now if we substitute in the formulas for N-choose-K and perform some simplifications we can transform this into:

I optimistically thought that it might be possible to simplify this using Stirling's Approximation, but I am now fairly certain that this is not possible. Ideally we would be able to express k in terms of N,n,K,X, but I do not think that it is possible. If you are reading this and can see that I am wrong please show me how.

In order to get a sense of the relationship between n and k I decided to do some quick numerical simulations in R to have a look at how k scales with n.

I tried various combinations of X, N and K. Interestingly for a constant X the scaling was fairly robust when you varied the initial values of N and K. I have plotted the results for three different values of X so you can see the effect of different probability of machine failure. In all three plots the baseline case was a cluster of 10 nodes with a replication factor of 3.

You can grab the R code used to generate these plots from my GitHub repository.

Tuesday, October 29, 2013

Philosophical Zombies and the Physical Basis of Consciousness

Given that the Walking Dead is back with a new season and World War Z has just ripped through the public conscious I thought that the philosophical implications of zombies would be a worthwhile subject for a post. I would not be the first person to have thought about what the notion of a zombie means for consciousness, and in fact the zombie has a well entrenched place in a series of arguments about the nature of the relationship between mind and matter.

Before we get under way, it is worth noting that to philosophers versed in the age old tradition of the Thought Experiment, a zombie is not a flesh eating monster that shambles around slowly decomposing and smelling disgusting. A philosophical zombie is merely a person without consciousness. I can hear you all respond "Que?" followed by quizzical silence. The idea is to ask if we can conceive of a person who acts and behaves like we do without the mental qualia of consciousness, that is the internal experience of seeing red, smelling roses and the pain of being pricked by a thorn.

The way this is taken to impact on our understanding of mind and brain relies on a second philosophical trick: the notion of a conceivability argument. This is the idea that if we can conceive of something then it is in some sense at least possible. Usually this is taken as metaphysical possibility, i.e. that it may not be possible in this universe, but in some other universe. If you think this is a pretty slippery way to argue, then you are in good company. Nevertheless, it persists as a philosophical tool, and for the sake of this post I am going to grant it temporary validity.

Ok. So.

The argument goes as follows: physicalist explanations of consciousness require that there be some configuration of matter that corresponds to conscious states. If we can conceive of a zombie, then it is metaphysically possible that a being could exist that can act as we do, yet is not conscious. As that means that the zombie must have the configuration of brain matter that allows the specific conscious-like behavior, therefore that configuration of brain matter cannot be the source of consciousness.

However, even allowing the conceivability argument, this is still an invalid argument. The reason is that just because for homo sapiens we observe certain configurations of brain matter that give rise to the set of behaviors and conscious states, it does not preclude the existence of other arrangements that have the former but not the latter. It is equivalent to observing a bunch of four legged tables and concluding that table-ness and four-legged-ness are a necessary combination. In reality other arrangements of legs can also make tables, and four legs does not always a table make.

Strengthening this objection is the fact that we know that the micro-structure of our brains are different between individuals. In fact, this is the source of our individuality. While the macro-structural features of our brains are shared (thalamus, hypothalamus, corpus callosum, regions of the cerebral cortex and their inter-connectedness), the fine grained structures that control our thoughts and actions are (virtually) unique. This means that in reality there is no a single configuration of brain matter that gives rise to a given set of behaviors and their corresponding conscious states, but rather a family of configurations.

There is nothing preventing this family of configurations being broader than we know them to be, and a certain (as of yet unobserved) set of them having the property of giving rise to behaviors without conscious states. This might seem far-fetched, but as I can conceive of it, it must be meta-physically possible.

Tuesday, September 3, 2013

Using AWK for Data Science

Over the years I have become convinced that one of the essential tools needed by anyone whose job consists of working with data is the unix scripting language AWK. It will save you an awful lot of time when it comes to processing raw text data files.

For example, taking a large delimited file of some sort and pre-processing its columns to pull out just the data you want, perform basic calculations or prepare it for entry into a program that requires a specific format.

AWK has saved me countless hours over the years, so now I am writing a super brief primer that should not only convince you it is worth learning but show you some examples.

The first thing you need to know about AWK is that it is data driven, unlike most other languages for which execution is constrained largely by procedural layout of the instructions. AWK instructions are defined by patterns in the data to which actions should be applied. If you are familiar with the regular expression type control structures available in PERL then this should seem like a comfortable idea.

The programs are also data driven in the sense that the entire program is applied to every line of the file (as long as there are patterns that match) and furthermore the program has inbuilt access to the columns of data inside the file through the $0, $1, 2 ... variables: where $0 contains the entire line and $1 upwards has the data from individuals columns. By default the columns are expected to be TAB separated, but you can follow your AWK script with FS=',' to use a comma or any other field separator.

To run a simple AWK script type:

   >awk 'AWK_SCRIPT_HERE' FILE_TO_PROCESS

The basic syntax of the scripts themselves consists of mutiple pattern action pairs defined like this:
PATTERN {ACTION}

One need not include an PATTERN, in which case the action will be applied to every line inside the file to which the program is applied.

So for example, the following program will out the sum of columns 3 and 4

>awk '{print $3+$4}' FILENAME

If we only wanted this to happen when column 1 contained the value 'COSTS' we have a number of options. We could simply use the pattern equivalent of an IF statement as follows:

>awk '$1=="COSTS" {print $3+$4} FILENAME

Alternatively we could use a PATTERN expression as follows

>awk '/COSTS/ {print $3+$4} FILENAME

The problem with the second solution is that it if for some reason the word COSTS can appear in other fields or places in the file then may not get the results you are looking for. There can be a trade off for using the power and flexibility of the regular expression patterns and their ability to lull us into a false sense of security about what they are doing.

There are several special execution paths that can be included in the program. In place of the pattern you can include the reserved words BEGIN or END in order to execute routine before or after the file processing occurs. This is particularly useful for doing something like calculating a MEAN, shown below:

>awk '{sum+=$1; count+=1} END {print sum/count}' FILENAME

By now you should be seeing the appeal of AWK. You can manipulate your data quickly with small scripts that do not require loading an enormous file into a spreadsheet, or writing a more complicated JAVA or PYTHON program.

Finally here are a few of the kinds of tasks that I do with AWK all the time

1) Convert some file with 10 or more columns into one with a sum of a few and reformating the others:

>awk '{print toupper($1) "," ($3/100) "," ($2+$4-$5)}' FILENAME

2) Calculate the Mean and Standard Deviation on a column. (The following is fora sample, just change the n-1 to n for a complete population.

> awk 'pass==1 {sum+=$1; n+=1} pass==2 {mean=sum/(n-1); ssd+=($1-mean)*($1-mean)} END {print sqrt(ssd/(n-1))}' pass=1 FILENAME pass=2 FILENAME

3) Calculate the Pearson correlation coefficient between a pair of columns. Again for a sample of the data. Change n-1 to n to do the calculation on the entire population data.

> awk 'pass==1 {sx+=$1; sy+=$2; n+=1} pass==2 {mx=sx/(n-1); my=sy/(n-1); cov+=($1-mx)*($2-my); ssdx+=($1-mx)*($1-mx); ssdy+=($2-my)*($2-my);} END {print cov / ( sqrt(ssdx) * sqrt(ssdy) ) }' pass=1 FILENAME pass=2 FILENAME

If you have any great tips for getting more out of AWK let me know, I am always looking for shortcuts.

Friday, June 21, 2013

Configure Chef to Install Packages from a Custom Repo

Going Nuts

This was driving me completely nuts last week. I could write Chef recipes to install packages from standard repos but I could not get Chef set up so that the recipe would add a new repo and then install packages using the Chef package configuration.

Truth be told I could do this in a really crude way. I could add a repo by writing a file on the node, and then run a command to install a package directly. It just wouldn't be managed by Chef.

The wrong way

I first created a new cookbook called myinstaller. Then in the templates directory I created a file called custom.repo.erb (The Chef template format) with the following contents:

[custom]
name=MyPackages
baseurl=http://chef.vb:8088/yum/Redhat/6/x86_64
enabled=1
gpgcheck=0

Note that the baseurl parameter is pointing to a yum repository I have created on one of the virtual machines running on my virtual network.

I then edited the recipe file recipes/default.rb
and added the following:

template "custom.repo" do
path "/etc/yum.repos.d/custom.repo"
source "custom.repo.erb"
end

execute "installjdk" do

command "yum -y --disablerepo='*' --enablerepo='bmchef' install jdk.x86_64"

end

This works. But it is crude because I am not using Chef to manage the packages that are installed.

The right way

When you search around into this topic you will come across pages like this one that talk about adding packages with the yum_package command. However if you change the install section above to use this command it will not work. It seems to be related to the fact that simply adding a file to the yum repos directory on the node (while recognized on the machine itself) is not recognized by Chef.

I dug deeper and tried many different versions of adding that repo configuration and I eventually started finding references to the command 'yum_repository'. However, if you try to whack this command into your recipe it doesn't bloody work. It turns out that this is because it is not a command that is built into Chef (unlike 'package' and 'yum_package') it is in fact a command that comes from this open source cookbook for installing yum packages.

If you do not want to use this entire cookbook the critical files to grab are as follows

yum/resources/repository.rb

yum/providers/repository.rb

yum/templates/default/*

(I took the three files from this last directory, which may not be strictly necessary).

Now before you can use the command there are a couple of gotchas.

1) If you copy all of this to a new cookbook called myyum, then the repository command will now be 'myyum_repository'

2) You will need to edit the file yum/providers/repository.rb

go to the bottom where the repo config is being written and change the line:

cookbook "yum"

So that the name of your cookbook appears there instead.

You will now be able to add a repository by putting the following in a recipe

myyum_repository "custom" do

description "MyRepo"

url "http://chef.vb:8088/yum/Redhat/6/x86_64"

repo_name "custom"

action :add

end

yum_package "mypackage" do

action :install

flush_cache [:before]

end

Just upload your new cookbook:

sudo knife cookbook upload myyum

Add the recipe to your node:

knife node run_list add node2.vb 'recipe[myyum::default]'

And execute:

knife ssh name:node2.vb -x <USER> -P <PASSWORD> "sudo chef-client"

Amazing

Tuesday, June 18, 2013

Configuration Management with Chef

Have you ever been through a long tedious process of setting up a server, messing with configuration files all over the machine trying to get some poorly documented piece of software working. Finally it works, but you have no idea what you did. This is a constant frustration for loads of people. Configuration Management (maybe) the answer.

Stated simply, you write scripts that will configure the server. Need to change something, you modify the script and and rerun. Branch the script and use it in version control so that you keep a track of multiple experiments. All of the advantages of managed code are brought over to managing a server.

We are currently investigating using Chef, which as brilliant as it appears to be, is sorely lacking in straightforward, complete and accurate tutorials. What I need with every new tool I use is a bare bones get up and running walk-through. I don't need to see a highly branched and complete set of instructions designed to tutor people who already know what they are doing. This blog post is my attempt at a bare bones attack.

So here we go.

Configuration

In this walk-through we are creating an entire networked Chef system using virtual machines. To do this we need to set up a local DNS server that will map names to IP addresses on the local virtual network. <<THIS PART IS ASSUMED>>

DNS server config

Once you have the DNS server set up you need to make a few modifications.
Set it so that it will forward unknown names to your Gateway DNS server
Change the named configuration to forward to <<Gateway DNS>>

Add entries for all components of the Chef networks. The following are assumed to exist

dns.vb (DNS server) 192.168.56.200

chef.vb (Server) 192.168.56.199

node.vb (Node) 192.168.56.101

Host Configuration

Configure your host machine

1) Add dns.vb to /etc/hosts
2) Disable wireless (or other connections)
3) Fix the DNS server in your wired connection
- In the IPV4 setting tab add the IP address of dns.vb

Virtual Machine Config

Once this is done you will need to configure each and every machine that is added to the system (Chef Server and Nodes )

1) Give the machine a name
A) Edit /etc/sysconfig/network file and change the HOSTNAME: field
B) Run the command hostname <myhostname>

2) Give the machine a static IP address and set the DNS server

vi /etc/sysconfig/network-scripts/ifcfg-eth1
BOOTPROTO=none
IPADDR=<<IP ADDRESS>>
NETWORK=255.255.255.0
DNS1=192.168.56.200

3) Add the machine's IP address and hostname combination to the DNS server
A) Edit the file /etc/named/db.vb and add a line at the bottom for each hostname IP combination
B) Restart the DNS server : service named restart

4) Prevent the eth0 connectuon from setting the DNS
vi /etc/sysconfig/network-scripts/ifcfg-eth0
BOOTPROTO=dhcp
PEERDNS=no

Set up a Server

This blog entry pretty much covers it
http://www.opscode.com/blog/2013/03/11/chef-11-server-up-and-running/

You basically grab the right version of Chef Server
wget https://opscode-omnibus-packages.s3.amazonaws.com/el/6/x86_64/chef-server-11.0.8-1.el6.x86_64.rpm

Install it
sudo yum localinstall chef-server-11.0.8-1.el6.x86_64.rpm --nogpgcheck

Configure and start
sudo chef-server-ctl reconfigure

Set up a Workstation

The workstation is your host machine, where you will write recipes and from which you will deploy them to the nodes.

I started following the instructions here: http://docs.opscode.com/install_workstation.html
But that got confusing and inaccurate pretty quickly.

In summary, what I did was:

Start up a new virtuals machine (configure network settings as above), then:
sudo curl -L https://www.opscode.com/chef/install.sh | bash

When that is finished check the install with

chef-client -v

There are three config files the workstation needs

knife.rb

knife configure --initial

admin.pem
scp root@chef.vb:/etc/chef-server/admin.pem ~/.chef
chef-validator.pem
scp root@chef.vb:/etc/chef-server/chef-validator.pem ~/.chef

Set up a Node

A Node is a machine for which you will manage the configuration using Chef.
Start up a new virtual machine (configure network settings as above), then:
install the Chef client onto the Node using the bootstrap process.
To do this run the command on the workstation:

knife bootstrap node1.vb -x <username> -P <password> --sudo

Once this is done you can add recipes to the node and deploy them.

Create your first Cookbook

Create your first cookbook using the following command on your workstation:

sudo knife cookbook create mytest

There will now be a cookbook in the following location
/var/chef/cookbooks/mytest

You can go in and edit the default recipe file:
/var/chef/cookbooks/mytest/recipes/default.rb

Add something simple, for example we will write out a file from a template.

template "test.template" do
path "~/test.txt"
source "test.template.erb"
end

Then create the template file
/var/chef/cookbooks/mytest/templates/default/test.template.erb

add whatever text you like to the file.

Applying the Cookbook

First thing to do is upload the cookbook to the server

sudo knife cookbook upload mytest

Then add the cookbook to the node

knife node run_list add mynode 'recipe[mytest]'

Then use Knife to apply the cookbook using the Chef-client on the node

knife ssh name:mynode -x <username> -P <password> "sudo chef-client"

Done!!!!