John Hawkins: 2013

Thursday, October 31, 2013

Maintaining Constant Probability of Data Loss with Increased Cluster Size in HDFS

In a conversation with a colleague some months ago I was asked if I knew how to scale the replication factor of a Hadoop Distributed File System (HDFS) cluster as the number of nodes increased in order to keep the probability of experiencing any data loss below a certain threshold. My initial reaction to the question was that it would not be affected, I was naively thinking the data loss probability was a product of the replication factor only.

Thankfully, it didn't take me long to realize I was wrong. What is confusing, is that for a constant replication factor as the cluster grows the probability of data loss increases, but the quantity of data lost decreases (if the quantity of data remains constant).

To see why consider a situation in which we have N nodes in a cluster with replication factor K. We let the probability of a single node failing in a given time period be X. This time period needs to be sufficiently small so that we know that the server administrator will not have enough time to replace the machine or drive and recover the data. The probability of experiencing data loss in that time period is the probability of getting K or more nodes failing. The exact value of which is calculated with the following sum:

Although in general a good approximation (a consistent overestimate) is simply:

Clearly as the size of N increases this probability must get bigger.

This got me thinking about how to determine the way the replication factor should scale with the cluster size in order to keep the probability of data loss constant (ignoring the quantity). This problem may have been solved elsewhere, but it was an enjoyable mathematical exercise to go through.

In essence we want to know if the number of nodes in the cluster increases by some value n, then what is the minimum number k such that the probability of data loss remains the same or smaller. Using the approximation from above we can express this as:

Now if we substitute in the formulas for N-choose-K and perform some simplifications we can transform this into:

I optimistically thought that it might be possible to simplify this using Stirling's Approximation, but I am now fairly certain that this is not possible. Ideally we would be able to express k in terms of N,n,K,X, but I do not think that it is possible. If you are reading this and can see that I am wrong please show me how.

In order to get a sense of the relationship between n and k I decided to do some quick numerical simulations in R to have a look at how k scales with n.

I tried various combinations of X, N and K. Interestingly for a constant X the scaling was fairly robust when you varied the initial values of N and K. I have plotted the results for three different values of X so you can see the effect of different probability of machine failure. In all three plots the baseline case was a cluster of 10 nodes with a replication factor of 3.

You can grab the R code used to generate these plots from my GitHub repository.

Tuesday, October 29, 2013

Philosophical Zombies and the Physical Basis of Consciousness

Given that the Walking Dead is back with a new season and World War Z has just ripped through the public conscious I thought that the philosophical implications of zombies would be a worthwhile subject for a post. I would not be the first person to have thought about what the notion of a zombie means for consciousness, and in fact the zombie has a well entrenched place in a series of arguments about the nature of the relationship between mind and matter.

Before we get under way, it is worth noting that to philosophers versed in the age old tradition of the Thought Experiment, a zombie is not a flesh eating monster that shambles around slowly decomposing and smelling disgusting. A philosophical zombie is merely a person without consciousness. I can hear you all respond "Que?" followed by quizzical silence. The idea is to ask if we can conceive of a person who acts and behaves like we do without the mental qualia of consciousness, that is the internal experience of seeing red, smelling roses and the pain of being pricked by a thorn.

The way this is taken to impact on our understanding of mind and brain relies on a second philosophical trick: the notion of a conceivability argument. This is the idea that if we can conceive of something then it is in some sense at least possible. Usually this is taken as metaphysical possibility, i.e. that it may not be possible in this universe, but in some other universe. If you think this is a pretty slippery way to argue, then you are in good company. Nevertheless, it persists as a philosophical tool, and for the sake of this post I am going to grant it temporary validity.

Ok. So.

The argument goes as follows: physicalist explanations of consciousness require that there be some configuration of matter that corresponds to conscious states. If we can conceive of a zombie, then it is metaphysically possible that a being could exist that can act as we do, yet is not conscious. As that means that the zombie must have the configuration of brain matter that allows the specific conscious-like behavior, therefore that configuration of brain matter cannot be the source of consciousness.

However, even allowing the conceivability argument, this is still an invalid argument. The reason is that just because for homo sapiens we observe certain configurations of brain matter that give rise to the set of behaviors and conscious states, it does not preclude the existence of other arrangements that have the former but not the latter. It is equivalent to observing a bunch of four legged tables and concluding that table-ness and four-legged-ness are a necessary combination. In reality other arrangements of legs can also make tables, and four legs does not always a table make.

Strengthening this objection is the fact that we know that the micro-structure of our brains are different between individuals. In fact, this is the source of our individuality. While the macro-structural features of our brains are shared (thalamus, hypothalamus, corpus callosum, regions of the cerebral cortex and their inter-connectedness), the fine grained structures that control our thoughts and actions are (virtually) unique. This means that in reality there is no a single configuration of brain matter that gives rise to a given set of behaviors and their corresponding conscious states, but rather a family of configurations.

There is nothing preventing this family of configurations being broader than we know them to be, and a certain (as of yet unobserved) set of them having the property of giving rise to behaviors without conscious states. This might seem far-fetched, but as I can conceive of it, it must be meta-physically possible.

Tuesday, September 3, 2013

Using AWK for Data Science

Over the years I have become convinced that one of the essential tools needed by anyone whose job consists of working with data is the unix scripting language AWK. It will save you an awful lot of time when it comes to processing raw text data files.

For example, taking a large delimited file of some sort and pre-processing its columns to pull out just the data you want, perform basic calculations or prepare it for entry into a program that requires a specific format.

AWK has saved me countless hours over the years, so now I am writing a super brief primer that should not only convince you it is worth learning but show you some examples.

The first thing you need to know about AWK is that it is data driven, unlike most other languages for which execution is constrained largely by procedural layout of the instructions. AWK instructions are defined by patterns in the data to which actions should be applied. If you are familiar with the regular expression type control structures available in PERL then this should seem like a comfortable idea.

The programs are also data driven in the sense that the entire program is applied to every line of the file (as long as there are patterns that match) and furthermore the program has inbuilt access to the columns of data inside the file through the $0, $1, 2 ... variables: where $0 contains the entire line and $1 upwards has the data from individuals columns. By default the columns are expected to be TAB separated, but you can follow your AWK script with FS=',' to use a comma or any other field separator.

To run a simple AWK script type:

   >awk 'AWK_SCRIPT_HERE' FILE_TO_PROCESS

The basic syntax of the scripts themselves consists of mutiple pattern action pairs defined like this:
PATTERN {ACTION}

One need not include an PATTERN, in which case the action will be applied to every line inside the file to which the program is applied.

So for example, the following program will out the sum of columns 3 and 4

>awk '{print $3+$4}' FILENAME

If we only wanted this to happen when column 1 contained the value 'COSTS' we have a number of options. We could simply use the pattern equivalent of an IF statement as follows:

>awk '$1=="COSTS" {print $3+$4} FILENAME

Alternatively we could use a PATTERN expression as follows

>awk '/COSTS/ {print $3+$4} FILENAME

The problem with the second solution is that it if for some reason the word COSTS can appear in other fields or places in the file then may not get the results you are looking for. There can be a trade off for using the power and flexibility of the regular expression patterns and their ability to lull us into a false sense of security about what they are doing.

There are several special execution paths that can be included in the program. In place of the pattern you can include the reserved words BEGIN or END in order to execute routine before or after the file processing occurs. This is particularly useful for doing something like calculating a MEAN, shown below:

>awk '{sum+=$1; count+=1} END {print sum/count}' FILENAME

By now you should be seeing the appeal of AWK. You can manipulate your data quickly with small scripts that do not require loading an enormous file into a spreadsheet, or writing a more complicated JAVA or PYTHON program.

Finally here are a few of the kinds of tasks that I do with AWK all the time

1) Convert some file with 10 or more columns into one with a sum of a few and reformating the others:

>awk '{print toupper($1) "," ($3/100) "," ($2+$4-$5)}' FILENAME

2) Calculate the Mean and Standard Deviation on a column. (The following is fora sample, just change the n-1 to n for a complete population.

> awk 'pass==1 {sum+=$1; n+=1} pass==2 {mean=sum/(n-1); ssd+=($1-mean)*($1-mean)} END {print sqrt(ssd/(n-1))}' pass=1 FILENAME pass=2 FILENAME

3) Calculate the Pearson correlation coefficient between a pair of columns. Again for a sample of the data. Change n-1 to n to do the calculation on the entire population data.

> awk 'pass==1 {sx+=$1; sy+=$2; n+=1} pass==2 {mx=sx/(n-1); my=sy/(n-1); cov+=($1-mx)*($2-my); ssdx+=($1-mx)*($1-mx); ssdy+=($2-my)*($2-my);} END {print cov / ( sqrt(ssdx) * sqrt(ssdy) ) }' pass=1 FILENAME pass=2 FILENAME

If you have any great tips for getting more out of AWK let me know, I am always looking for shortcuts.

Friday, June 21, 2013

Configure Chef to Install Packages from a Custom Repo

Going Nuts

This was driving me completely nuts last week. I could write Chef recipes to install packages from standard repos but I could not get Chef set up so that the recipe would add a new repo and then install packages using the Chef package configuration.

Truth be told I could do this in a really crude way. I could add a repo by writing a file on the node, and then run a command to install a package directly. It just wouldn't be managed by Chef.

The wrong way

I first created a new cookbook called myinstaller. Then in the templates directory I created a file called custom.repo.erb (The Chef template format) with the following contents:

[custom]
name=MyPackages
baseurl=http://chef.vb:8088/yum/Redhat/6/x86_64
enabled=1
gpgcheck=0

Note that the baseurl parameter is pointing to a yum repository I have created on one of the virtual machines running on my virtual network.

I then edited the recipe file recipes/default.rb
and added the following:

template "custom.repo" do
path "/etc/yum.repos.d/custom.repo"
source "custom.repo.erb"
end

execute "installjdk" do

command "yum -y --disablerepo='*' --enablerepo='bmchef' install jdk.x86_64"

end

This works. But it is crude because I am not using Chef to manage the packages that are installed.

The right way

When you search around into this topic you will come across pages like this one that talk about adding packages with the yum_package command. However if you change the install section above to use this command it will not work. It seems to be related to the fact that simply adding a file to the yum repos directory on the node (while recognized on the machine itself) is not recognized by Chef.

I dug deeper and tried many different versions of adding that repo configuration and I eventually started finding references to the command 'yum_repository'. However, if you try to whack this command into your recipe it doesn't bloody work. It turns out that this is because it is not a command that is built into Chef (unlike 'package' and 'yum_package') it is in fact a command that comes from this open source cookbook for installing yum packages.

If you do not want to use this entire cookbook the critical files to grab are as follows

yum/resources/repository.rb

yum/providers/repository.rb

yum/templates/default/*

(I took the three files from this last directory, which may not be strictly necessary).

Now before you can use the command there are a couple of gotchas.

1) If you copy all of this to a new cookbook called myyum, then the repository command will now be 'myyum_repository'

2) You will need to edit the file yum/providers/repository.rb

go to the bottom where the repo config is being written and change the line:

cookbook "yum"

So that the name of your cookbook appears there instead.

You will now be able to add a repository by putting the following in a recipe

myyum_repository "custom" do

description "MyRepo"

url "http://chef.vb:8088/yum/Redhat/6/x86_64"

repo_name "custom"

action :add

end

yum_package "mypackage" do

action :install

flush_cache [:before]

end

Just upload your new cookbook:

sudo knife cookbook upload myyum

Add the recipe to your node:

knife node run_list add node2.vb 'recipe[myyum::default]'

And execute:

knife ssh name:node2.vb -x <USER> -P <PASSWORD> "sudo chef-client"

Amazing

Tuesday, June 18, 2013

Configuration Management with Chef

Have you ever been through a long tedious process of setting up a server, messing with configuration files all over the machine trying to get some poorly documented piece of software working. Finally it works, but you have no idea what you did. This is a constant frustration for loads of people. Configuration Management (maybe) the answer.

Stated simply, you write scripts that will configure the server. Need to change something, you modify the script and and rerun. Branch the script and use it in version control so that you keep a track of multiple experiments. All of the advantages of managed code are brought over to managing a server.

We are currently investigating using Chef, which as brilliant as it appears to be, is sorely lacking in straightforward, complete and accurate tutorials. What I need with every new tool I use is a bare bones get up and running walk-through. I don't need to see a highly branched and complete set of instructions designed to tutor people who already know what they are doing. This blog post is my attempt at a bare bones attack.

So here we go.

Configuration

In this walk-through we are creating an entire networked Chef system using virtual machines. To do this we need to set up a local DNS server that will map names to IP addresses on the local virtual network. <<THIS PART IS ASSUMED>>

DNS server config

Once you have the DNS server set up you need to make a few modifications.
Set it so that it will forward unknown names to your Gateway DNS server
Change the named configuration to forward to <<Gateway DNS>>

Add entries for all components of the Chef networks. The following are assumed to exist

dns.vb (DNS server) 192.168.56.200

chef.vb (Server) 192.168.56.199

node.vb (Node) 192.168.56.101

Host Configuration

Configure your host machine

1) Add dns.vb to /etc/hosts
2) Disable wireless (or other connections)
3) Fix the DNS server in your wired connection
- In the IPV4 setting tab add the IP address of dns.vb

Virtual Machine Config

Once this is done you will need to configure each and every machine that is added to the system (Chef Server and Nodes )

1) Give the machine a name
A) Edit /etc/sysconfig/network file and change the HOSTNAME: field
B) Run the command hostname <myhostname>

2) Give the machine a static IP address and set the DNS server

vi /etc/sysconfig/network-scripts/ifcfg-eth1
BOOTPROTO=none
IPADDR=<<IP ADDRESS>>
NETWORK=255.255.255.0
DNS1=192.168.56.200

3) Add the machine's IP address and hostname combination to the DNS server
A) Edit the file /etc/named/db.vb and add a line at the bottom for each hostname IP combination
B) Restart the DNS server : service named restart

4) Prevent the eth0 connectuon from setting the DNS
vi /etc/sysconfig/network-scripts/ifcfg-eth0
BOOTPROTO=dhcp
PEERDNS=no

Set up a Server

This blog entry pretty much covers it
http://www.opscode.com/blog/2013/03/11/chef-11-server-up-and-running/

You basically grab the right version of Chef Server
wget https://opscode-omnibus-packages.s3.amazonaws.com/el/6/x86_64/chef-server-11.0.8-1.el6.x86_64.rpm

Install it
sudo yum localinstall chef-server-11.0.8-1.el6.x86_64.rpm --nogpgcheck

Configure and start
sudo chef-server-ctl reconfigure

Set up a Workstation

The workstation is your host machine, where you will write recipes and from which you will deploy them to the nodes.

I started following the instructions here: http://docs.opscode.com/install_workstation.html
But that got confusing and inaccurate pretty quickly.

In summary, what I did was:

Start up a new virtuals machine (configure network settings as above), then:
sudo curl -L https://www.opscode.com/chef/install.sh | bash

When that is finished check the install with

chef-client -v

There are three config files the workstation needs

knife.rb

knife configure --initial

admin.pem
scp root@chef.vb:/etc/chef-server/admin.pem ~/.chef
chef-validator.pem
scp root@chef.vb:/etc/chef-server/chef-validator.pem ~/.chef

Set up a Node

A Node is a machine for which you will manage the configuration using Chef.
Start up a new virtual machine (configure network settings as above), then:
install the Chef client onto the Node using the bootstrap process.
To do this run the command on the workstation:

knife bootstrap node1.vb -x <username> -P <password> --sudo

Once this is done you can add recipes to the node and deploy them.

Create your first Cookbook

Create your first cookbook using the following command on your workstation:

sudo knife cookbook create mytest

There will now be a cookbook in the following location
/var/chef/cookbooks/mytest

You can go in and edit the default recipe file:
/var/chef/cookbooks/mytest/recipes/default.rb

Add something simple, for example we will write out a file from a template.

template "test.template" do
path "~/test.txt"
source "test.template.erb"
end

Then create the template file
/var/chef/cookbooks/mytest/templates/default/test.template.erb

add whatever text you like to the file.

Applying the Cookbook

First thing to do is upload the cookbook to the server

sudo knife cookbook upload mytest

Then add the cookbook to the node

knife node run_list add mynode 'recipe[mytest]'

Then use Knife to apply the cookbook using the Chef-client on the node

knife ssh name:mynode -x <username> -P <password> "sudo chef-client"

Done!!!!

Wednesday, May 29, 2013

Machine Learning for Hackers

I recently read "Machine Learning for Hackers" by Drew Conway and John Myles White.

I'd picked it up because I heard it was a good way to get familiar with the data mining capabilities of R. I also expected the case study based approach to be a good way to see how they approach a broad array of machine learning problems. In these respects I was reasonably well rewarded. You will find a bunch of R code scraps that can be reused with a little effort. Unfortunately the explanation of what the code does (and how) is often absent. In this sense the book is true to its name: you will learn some recipes for tackling certain problems, but you may not understand how the code works, let alone the technique being applied.

The one issue I found unforgivable is that in the instances where the authors talk about machine learning theory, or use its terms, they are often wrong. One example is the application of naive Bayes to spam classification. The scoring function they use is the commonly used likelihood times the prior, leaving off the evidence divisor.

As a method of scoring in Bayesian methods this is appropriate because it is proportional to calculating the full posterior probability, and much more efficient to compute. However, the resulting score is not a probability, yet the authors continuously refer to it as one. This may seem minor, but to me it undermined my confidence in their ability to communicate necessary details about the techniques they are applying.

Another example: in the section on distance metrics the authors state that multiplying a matrix by its transpose computes “the correlation between every pair of columns in the original matrix.” This is also wrong. What they want to say is that it produces a matrix of scores that indicate the correlation between the rows. It is an approximation because the score depends on the length of the columns and whether they have been normalised. These values would not be comparable between matrices. What would be comparable between matrices is a correlation coefficient, but this is not what is being computed.

I am not suggesting that a hacker's guide to machine learning should include a thorough theoretical treatment of the subject. I think only that where terms and theory are introduced they should be used correctly. By this criteria this book is a failure. However, for my purposes (grabbing some code snippets for doing analysis with R) it was moderately successful. My largest disappointment was that given the mistakes I noticed regarding the topics about which I have reasonable knowledge, I have no confidence in their explanation of those areas where I am ignorant.

Thursday, May 2, 2013

Top 8 Essential Tweaks for New Installations of Ubuntu 12.04

Having just upgraded to 12.04 there are a bunch of things that I found I needed to do to get it working how I wanted to.

1) Install the Classic Application menu

It is beyond me why the hierarchical applications menu has been removed in this version of ubuntu. It also seems that the new left hand launcher only displays apps installed from the 'Ubuntu Software Centre.' Applications installed from Synaptic are lost and don't always seem to show up in the new Dash.

So to get the classic application menu: Open a terminal ( Ctrl – Alt – T ) and add the following PPA.

sudo apt-add-repository ppa:diesch/testing

Then update and install the classic menu

sudo apt-get update && sudo apt-get install classicmenu-indicator

2) Install the restricted extras

Allows you to listen to mp3s and watch loads of encrypted video formats.

sudo apt-get install ubuntu-restricted-extras

3) Enable 'Show Remaining Space Left' Option in Nautilus File Browser

Again, why this is not on by default is beyond me. Extremely useful.

Open Nautilus. Go to View - Statusbar. Enable it, nuff said.

4) Calculator Lens/Scope for Ubuntu 12.04

One upside of the new Ubuntu Dash are a bunch of information rich widgets integrated into the OS. You can get info on weather, cities, films do calculations directly from the HUD.

sudo add-apt-repository ppa:scopes-packagers/ppa
sudo apt-get update
sudo apt-get install unity-lens-utilities unity-scope-calculator
sudo apt-get install unity-scope-rottentomatoes
sudo apt-get install unity-scope-cities

5) Open in Terminal Nautilus Extension

Allows you to open a terminal that is already inside the folder you are currently browsing with Nautilus. This saves me oodles of time.

sudo apt-get install nautilus-open-terminal

6) Install CPU/Memory Indicator Applet

Sweet little widget to view systems resource usage stats

sudo add-apt-repository ppa:indicator-multiload/stable-daily
sudo apt-get update
sudo apt-get install indicator-multiload

7) Install Spotify

Music streaming service desktop client. This info comes directly from their laboratories:
https://www.spotify.com/au/download/previews/

Add the spotify repo by editing /etc/apt/sources.list
Add the line:
deb http://repository.spotify.com stable non-free

sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 94558F59

sudo apt-get update && sudo apt-get install spotify-client

8) Install Synergy

Synergy is an application that lets you share you mouse and keyboard across computers. More than that it also shares your clipboard, you can copy text between machines. You can't copy files for the moment but maybe if <a href="http://synergy-foss.org/download/?donate">we all donate to the cause</a> we can request that feature.

You can download a Debian package here:

http://synergy-foss.org/download/

Then just install it with

sudo dpkg -i <synergy package name here>

Monday, April 22, 2013

Ubuntu on Toshiba Satellite P870 / 05P

I just bought a new Toshiba Satellite P870 / 05P laptop, amazing specs, but ten minutes of playing with windows 8 convinced me that I don't even want to dual boot. I just wanted it off my machine.

Unfortunately I then discovered that Ubuntu 12.04 has no device drivers for the wireless or Ethernet cards on this machine. Loads of head scratching and searching and I eventually found a link to a device driver on various Ubuntu forums that can be installed.

I followed the instruction about halfway down this page

However I had to use this archive instead of the one listed because Ubuntu 12.04 uses the 3.5 Kernel.

I really should read the source myself to make sure nothing nasty has been inserted into this code, but for now I am depending on the goodwill of my fellow Ubuntu users.

Tuesday, April 9, 2013

iOS Renewal Process

You would think that renewing your development membership would be all that you need to do on a yearly basis to keep working as an Apple developer.
It should be just: pay the fee and keep on developing. Unfortunately it is not that simple.

Your certificates and provisioning profiles need to be renewed, regenerated, and installed before you can continue. As I have not found a reasonable walk-through for this process, either from Apple or on their forums, I will quickly sketch it out here:

1) Clear out your old Provisioning profiles in Xcode

Open the XCode Organizer. Select "Provisioning Profiles," go through the list and delete all the expired certificates.

2) Remove your existing certificates in Keychain Access

Open the Utilities fold in your Mac's Applications. Open Keychain Access and then select "My Certificates." You will see your expired certificates (Dev and Dist) listed. Remove them both.

3) Create new certificates

Keeping Keychain Access open, click:
Keychain Access>Certificate Assistant>Request a Certifcate From a Certificate Authority.
Choose "Save to Disk" and save the request file.

Open the Certificates section of the iOS Provisioning Portal.

Delete the existing Development Certificate.
Click the "+" Symbol to create a new development certificate.
Select the top option "iOS App Development"
Click Continue.
Upload the certificate request file you created and finish.

Click "+" again to create a distribution certificate.
Choose the "App Store and Ad Hoc" option and continue.
Upload the certificate request file and finish.

4) Regenerate the Provisioning Profiles

Click the "Provisioning Profiles" in the iOS Provisioning Portal.
Go through each of your development and distribution profiles and edit them.
When you edit them you will see an option to select the new certificate that you generated. Once selected the "Generate" button will become active, click to generate and download the new profiles.

5) Install the Certificates and Provisioning Profiles

Install the downloaded certificates and provisioning profiles by dragging them into Keychain Access and XCode respectively.

You can now test your development apps and distribute them to the app store just like before you renewed. Just remember to select the correct profile when you are building your app.

Friday, March 8, 2013

Configuring Apache for a Local Site on Ubuntu

Introduction

This is a simple task, setting up my local machine so that I can browse directly to a hostname such as "mysite" and Apache will find the right project. This is a great way to test that paths will work as you expect when a site goes onto the production server. You will just need a config file in the site so it knows when it is on the development server (your local machine) and when it is live.

As simple as this is, I always have to look it up every time I do it.

So, in the interests of improving my own efficiency and maybe helping someone else I am blogging my process.

Dependencies

Ubuntu Lucid: 10.04(Check this with: cat /etc/lsb-release )

Apache/2.2.14 (Ubuntu)
(Check this with: /usr/sbin/apache2 -v )

Process

First go add the new site to yours hosts file, edit

sudo vi /etc/hosts

Then change or add the line:

127.0.0.1       localhost mysite

Next you need to configure Apache to recognise the site. You need to create a config file for your site in the sites-available directory:

/etc/apache2/sites-available/mysite

with something like the following contents:

<VirtualHost *:80>
    ServerName mysite
    DocumentRoot /home/username/mysite/www
    <Directory /home/username/mysite/www>
        Options Indexes FollowSymLinks Includes
        AllowOverride All
        Order allow,deny
        Allow from all
    </Directory>
    RewriteEngine On
    RewriteOptions inherit
</VirtualHost>

Then, you just need to enable the site with the apache script a2ensite, like thus:

sudo a2ensite mysite

Then reload apache

sudo /etc/init.d/apache2 reload

...and voila!!! You can now browse directly to http://mysite

Wednesday, January 30, 2013

Maximum Likelihood Estimation

Maximum Likelihood Estimation is widely applicable method for estimating the parameters of a probabilistic model.

Developed by R.A.Fisher in the 1920s the principle behind it is that the ideal parameter settings of a model are the ones that make the observed data most likely.

It is applicable in any situation in which the model can be specified such that the probability of the desired variable y can be expressed as a parameterised function over the vector of observed variables (X).

P(y|X) = f(X,φ)

The parameters φ of the function f(X, φ) are what we want to estimate.

The model is designed to be a function in which the parameters are set and we get back a probability value for a given x. However, we need a process to determine these model parameters. The Likelihood function is defined to be equal to this function, but operating as a function over the parameter space of φ.

L(φ | y,X )= P(y|X,φ)

It is important to recognise that Likelihood is not the probability of the parameters, it is just equal to the probability of y given the parameters. As such it is not a probability distribution over φ.

If we have N observations in our data set, and we let D represent all N of these observations of X and y, then we can express the Likelihood function for this entire data set D as :

L(φ | D ) = ∏^N_i=1 P(y_i|X_i,φ )

Maximum Likelihood is then simply defined as Argmax φ over this function. Finding the value of φ that maximises this function can be done a number of ways.

To find an analytical solution to the Likelihood equation we find the partial derivative of the function with respect to each of the paramters. We then solve this series of equations for the parameter values such the the partial derivatives are equal to zero. This gives us a position that is either a max or min. We then find the second partial derivative with respect to each parameter and make sure it is negative at the points found in the first step. This will give us an analytical peak on the Likelihood surface.

The reality of maximising the Likelihood by searching the parameter space depends a great deal on the problem. Numerous tricks occur to simplify the problem. The natural logarithm of the Likelihood function is often taken because they are monotonically related, so the MLE can be obtained by maximising the log of the Likelihood. In addition, taking the log turns the product into a Sum and can improves the chance of finding an analytical solution, and improve the computational tractability of finding a numerical solution.

In the next post I will summarise the use of the Expectation Maximisation algorithm for situations in which the Likelihood function cannot be solved analytically.