John Hawkins: 2014

Wednesday, December 3, 2014

Programmatic is the Future of All Advertising

If you do not have contact with the world of media and advertising you might never have heard the term programmatic. Even if you do it is likely that you have no idea what it means. To put it simply programmatic refers to a radical overhaul in the way advertising is bought and sold. Specifically, it refers to advertisers (or media agencies) being able to bid on ad positions in real time, enabling them to shift ad spend around between publishers at their own discretion.

Understandably many people in the publishing industry are very nervous about this. Advertising has historically been a very closed game, publishers have demanded high premiums for ad spaces that they could provide very little justification for. The digital age has forced them to compete with bloggers, app developers and a multitude of other Internet services that all require ad revenue to survive.

Programmatic began with the rise of text based search ads. Innovators like DoubleClick honed in on the idea that advertisers should be able to pay for what they want (a click), and the AdServer should be able to calculate expected returns for all possible ad-units by looking at the historical click through rate for each ad-unit and the amount being bid. The idea soon moved to display advertising to allow advertisers to bid a fixed cost per thousand ad impressions for display ads (designed for brand awareness more than attracting clicks). Those ideas have now spawned an industry that contains thousands of technology providers and ad networks. All the Ad space from the biggest publishers in the world down to the most obscure bloggers are all available to be bought and sold across a range of different bidding platforms.

Exactly the same thing is starting to happen with digital radio stations and it is coming for billboard displays as well. Some of the new crop of music distribution channels (like Spotify and Pandora) will be rolling out products that allow them to coordinate audio ads and display ads within their apps. Behind the scenes they are developing technology to schedule these things like an online banner ad, and once that happens selling those slots in an online bidding auction that combines audio and display is not far away.

The video ads you see on YouTube and many other publisher sites are already able to be bought this way using tools like TubeMogul. In the not too distant future people will be watching ads on their televisions that are being placed there by the bidding actions of a media buying specialist. US based Ad tech company Turn is already investigating this possibility. Sure there will be latency problems, video and audio are large files, so they will need to be uploaded to an adserver close to where they will be delivered. But these technologies are already being developed to cope with the increasing complexity of display ads with rich media capability.

The rise of programmatic advertising is changing what it means to be a media agency. It is no longer sufficient to build monopoly relationships with publishers and then employ a suite of young professionals who build relationships with brands. Instead, media agencies need a room full of a new kind of media geek that specializes in understanding how to buy media on a variety of platforms called Demand Side Platforms (DSPs).

These new divisions within agencies are called trading desks, they are staffed with people whose job it is to understand all the kinds of media that are available to buy, how much you can expect to pay for it, and what kinds of ads will work where. It is a new job, and to be perfectly honest people still have a lot to learn. That learning curve will only increase, at the moment they are just buying display ads on desktop and mobile. The ad capabilities of mobile will increase as the field matures, and then they will have to deal with buying video advertising and audio. At some point that will spread beyond just YouTube, first to other online video services, then to smaller niche digital TV channels, then to set-top boxes and cable TV. Finally, broadcast television (if it is still in business) will grudgingly accept that they need to make their inventory available.

Before any of this happens, most of the advertising on social networks will have become available programmatically. Facebook is making this transition, and twitter will follow, as will the others. They will each struggle with the balance of keeping their unique ad formats and maximizing the return on their inventory. Everything we have seen in desktop display indicates that this problem can be solved with online auctions, which means fully programmatic social media is coming.

This is an awe inspiring future for digital advertising. The DSP of the far future will be a tool that is able to buy inventory on desktop, mobile web, mobile apps, radio stations, TV channels, electronic billboards and a suite of social media. Ideally it will contain rules that allow the purchase of media between these channels to be co-ordinated and optimized in real time.

For example, imagine a system with configuration rules that allow TV time to be purchased when twitter activity for certain key hashtags reaches threshold volumes (an independent way of checking that TV viewership is what the networks claim it is). Following that with social media advertising, and digital billboards the following morning during the daily commute. The possibilities for marketers to test and learn what actually works will be immense.

When you contemplate the possibilities for investigating and improving ROI using this approach to media planning and spending you really need to ask yourself:

Why would anyone buy advertising the old fashioned way?

Monday, November 10, 2014

Basic Guide to Setting Up Single Node Hadoop 2.5.1 Cluster on Ubuntu

So, you have decided you are interested in big data and data science and exploring what you can do with Hadoop and Map Reduce.

But... you find most of the tutorials too hard to wade through, inconsistent, or you simply encounter problems that you just can't solve. Hadoop is evolving so fast that often the documentation is unable to keep up.

Here I will run you through the process I followed to get the latest version of Hadoop (2.5.1) running so I could use it to test my Map Reduce programs.

You can see the official Apache Docs here.

Part One: Java

You need to make sure you have a compatible version of Java on your machine.

Jump into your terminal and type

java -version

You preferably need an installation of Java 7.
When I run this I get:

java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.12.04.2)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Part Two: Other Software

You will need ssh and rsync installed. Chances are that they already are, but if not just run:

sudo apt-get install ssh
sudo apt-get install rsync

Part Three: Grab a Release

Head to the Apache Hadoop Releases page, choose a mirror and grab the tarball (.tar.gz). Make sure you do not grab the source file by mistake (src).

Remember: in this walk-through I have grabbed release: 2.5.1

Part Four: Unpack & Configure

Copy the tarball to wherever you want Hadoop to reside. For me I like to put it in the directory

/usr/local/hadoop

and then extract the contents with

tar -xvf hadoop-2.5.1.tar.gz

Then you will need to do some configuration. Open the file

vi hadoop-2.5.1/etc/hadoop/hadoop-env.sh

You will need to modify the line that currently looks like this
export JAVA_HOME=${JAVA_HOME}

You need to point this to your java installation. If you are not sure where that it just run

which java

and then copy the path (minus the bin/java at the end) into the hadoop config file to replace the text ${JAVA_HOME}.

Part Five: Test

First run a quick to check that you have configured java correctly. The following command should show you the version of hadoop and its compilation information.

hadoop-2.5.1/bin/hadoop version

Part Six: Run Standalone

The simplest thing you can do with hadoop is run a map reduce job as a stand alone script.

The Apache Docs give a great simple example: grepping a collection of files.

Run these commands:

mkdir input
cp hadoop-2.5.1/etc/hadoop/*.xml input
hadoop-2.5.1/bin/hadoop jar hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar grep input output 'dfs[a-z.]+'

When hadoop completes that process you can open up the results file and have a look.

vi output/part-r-00000

You should see a single line for each match of the regular expression. Trying changing the expression and seeing what you get. Now you can use this installation to test your map reduce jars against Hadoop 2.5.1.

Coming Next: Running Hadoop 2.5.1 in Pseudo Distributed Mode

Sunday, November 9, 2014

Wittgenstein's Beetle Book Review

Wittgenstein's Beetle by Martin Cohen

Summary: Very disappointing.

What could have been a great primer on one of the essential tools of philosophy, is held back by the author's mediocre understanding of many of the issues he discusses. The prime example is the 'thought experiment' by Wittgenstein that serves as the name of the book. Wittgenstein held that the idea of private language was incoherent because languages were games played between people. His beetle experiment was designed to make this idea concrete by proposing a world in which we all owned a private box containing a beetle. Mr Cohen provides a direct quote from Wittgenstein's Investigations in which he (Wittgenstein) clearly states that the word beetle, if used in such a society, could not be referring to the thing in the box. Mr Cohen then turns around and tells us that the point of Wittgenstein's experiment is to show that we assume that because we use the same word as other people we are talking about the same thing. This is not what Wittgenstein said, and he says this clearly in the text.

To make matters worse, Mr Cohen returns to pick on Wittgenstein's Beetle at the end of the book as an example of a poorly done thought experiment. It fails to meet several of Mr Cohen's criteria for successful thought experiments. One needs to note that it is Mr Cohen who has massaged the definition of a thought experiment to get Wittgenstein's beetle in, and then he criticises its performance, all the while failing to understand it.

I am not going to mention the numerous fallacies the author pens on many topics of science, and his horrendous attempts at jokes. The only reason I am giving the book 2 stars is because the discussion of Searle's Chinese room argument is excellent. Read this chapter and then throw the book away.

Saturday, November 1, 2014

Appcelerator Titanium Android Woes on Mac OSX

I have been having ongoing problems getting Appcelerator to build and install Android Apps again.

The very first time I built an Android App it took me some time to get the configuration right. Now that I have been through system upgrades I seem to have come back to step one again. Like before the official Appcelerator Guide helps me refresh how you get the device itself configured. However, it will not prepare you for the grand cluster of configuration issues you will face getting all the toys to play nicely together.

Problem

Appcelerator does not recognize your android device.
Even though if you run adb devices you can see it listed.

Solution

I still don't have a solution for this (most people suggest uninstalling everything and starting again, which to my mind constitutes giving up not solving it). I do have a work around though: Build the app without installing it and then use adb to install it independently. This definitely works in the absence of a better solution.

To build

Try the command titanium build,
- or -
Just use the distribute app dialog in Titanium Studio.
You can generate a signed APK easily this way.

To install

Just use the adb command line utility:

   adb install ../Desktop/MyApp.apk

Problem solved,... sort of.

Problem

adb does not even recognize your android device.
This seems to happen randomly, depending on what I had for breakfast.

Solution

I generally find this requires a little fiddling around. This particular combination is currently working for me:
1) Unplug your device.
2) Kill the adb server.
3) Plug your device back in
4) Run adb devices
This seems to kickstart the adb server in such a way that it correctly finds the attached devices.

Problem

Your android App almost builds an APK but red errors flash up at the end. Appcelerator tells you it was built but there is nothing in the build directory. You see a bunch of uninformative python errors codes referring to problems with the file: builder.py, for example:

line 2528, in <module>
[ERROR]     builder.build_and_run(False, avd_id, debugger_host=debugger_host, profiler_host=profiler_host)

For me it turned out that this is all because of the fact that some executables got moved around between distributions of the android SDK.

This problem is outlined in this note from the Appcelerator forums fixed it for me.

Solution

Create symlinks to aapt and dx in /Applications/Android-sdk/platform-tools:

ln -s /Applications/Android-sdk/build-tools/17.0.0/aapt aapt

ln -s /Applications/Android-sdk/build-tools/17.0.0/dx dx

Friday, October 3, 2014

Logistic Regression with R

Logistic Regression

Regression is performed when you want to produce a function that will predict the value of something you don't know (the dependent variable) on the basis of a collection of things you do know (the independent variables).

The problem is that regression is typically done with a linear function and very few real world processes are linear. Hence, a great deal of statistics and machine learning research concerns methods for fitting non-linear functions, but controlling for the explosion in complexity that comes with it.

Logistic Regression is one of the methods that tries to solve this problem. In particular, Logistic Regression produces an output between 0 and 1 which can be interpreted as the probability of your target event happening.

Let's look at the form of Logistic Regression to get a better understanding:

You start with the goal of a function that approximates the probability of the target T for any input vector X :

p(T) = F(X)

In order to assure that F(X) takes the form of a valid probability (i.e. always between 0 and 1) we make us of the logistic function 1/(1+e^-K). If K is a big number the e^-K approaches 0 and hence the output of the logistic function approaches 1. If on the other hand K is a very small number the e^-K becomes very large and the output of the logistic function approaches 0.

So we are fitting the following function:

p(T) = 1 / [ 1 + e^-g(X) ]

We have added the function g(X) to afford us some flexibility in how we feed the input vector X into the logistic function. Here is where we place our usual linear regression function. We say

g(X) = B_0 + B_1 * X_1 + B_2 * X_2 + ...... + B_N * X_N

i.e. a linear function over all the dimensions of the input X.

Now, in order to perform our linear regression, we need to transform the function definition. You can do the transformation yourself if you like. What you will find is that with some re-arrangement you find that the function g(X) is equal to:

g(X) = - ln [ (1-p) / p ]

And by exploiting the properties of the logarithm you can further re-arrange to get the log odds ratio.

g(X) = ln [ p / (1-p) ]

An astute reader might notice a problem. For a target value of 1 (i.e p=1) the fraction is undefined. Luckily we can use the properties of the logarithm again and define our target as

ln [ p / (1-p) ] = ln(p) - ln(1-p)

...and this is the target value onto which you perform the linear regression.

In other words you fit the value of the parameters (the Bs) so that

B_0 + B_1*X_1 + B_2*X_2 + ...... + B_N*X_N = ln(p) - ln(1-p)

That is all well and good, how can we do that with R ? you might ask.

Well, I have gone ahead and converted some code from a bunch of different tutorials into a little R workbook that will take you through applied Logistic Regression in R. You can find the Logistic Regression Code Example in my GitHub account right here.

It all boils down to using the Generalised Linear Model function.

This R function will fit your Logistic Regression for you.

If you follow that code example to the end you will get a plot like the one below, which shows you the original data in green, the model fitted to that data in black, and some predictions for unseen parts of the input space in red.

Logistic Regression allows you a great deal of flexibility in your model. The parameterized linear model can be changed how you want, adding or removing independent variables. You can even add higher order combinations of the independent variables.

A common Machine Learning process is to experiment with different forms of this model and examine how the statistical significance of the fit changes.

Just be wary of the pernicious problem of over-fitting.

Friday, May 9, 2014

Copyright on APIs is a bad idea.

If you have not heard yet, there has been a change in the Google Vs Oracle case. The original ruling that an API could not be copyrighted has been reversed. In essence it means that a company can release a description of a set of functions that they provide for developers, and no one is allowed to create an alternative implementation of those functions without permission.

To understand what this means to the future of software engineering you need to understand two things.

1) APIs are not complex pieces of code (which are and should be subject to copyright). They are very simple descriptions of what a piece of code will do and how to make it do it.

In essence a single API is just one word (the name of the function) and a list of pieces data that should be given to it. it then specifies what will happen and the data that will be returned. An API is not its implementation, it is a high level description of what an implementation should do.

It is equivalent to me copyrighting the sentence

"I am going to the shop, do you want anything?"

When it is combined with the reply

"Yes, some milk."

It is really that simple. Imagine if novelists needed to pay a fee when they used that combination of sentences. Of course they could use "I will go to the shop, do you need me to get something?" Or whatever other variant they need to produce in order to avoid infringing. But suggesting those sidesteps misses the point of copyright. Such small atomic combinations of the basic elements of a language are not significant pieces of work. They are not what copyright laws are designed to protect.

2) Secondly you need to understand the purpose of APIs. They exist so that software programs are easier to write and easier to make communicate with each other. Their purpose is to let one programmer know how to interface with software written by someone else, someone they may have never met, and yet have it function perfectly. The API is a simple contract that says if you want my code to do this, this is how you make it happen.

Another advantage of APIs (well used by software developers everywhere) is that if there are multiple competing programs that do the same thing, then if they all use the same API a software developer can switch between them (almost) effortlessly.

If you are ever frustrated by software not working, Internet sites being unable to perform some task, apps not working on your phone, then I have some bad news for you. If copyrightable APIs become the legal norm, then everything will get much, much worse. Start-up companies and device manufacturers alike will need to protect themselves by ensuring that their APIs are unique and not infringing any one else's copyright. In order to make software that is compatible with something else there will need to be long term financial agreements in place. This will mean that the number of things (devices and programs) that just work together will begin to decrease.

The economic impact is the creation of significant barriers to entry for new technology companies. For the simple reason you cannot create some great new product that will work with products people already have without infringing copyright. Consequently many technical product possibilities will not be explored because of their legal risk. In general copyright on APIs will result in an overall reduction in the pace of innovation.

To you as a consumer it will mean less things will just work out of the box together. It will mean that if you want devices and software to work with each other, then you will need to buy them all from the same vendor. This will be good for the large incumbents in the market place, but for consumers it is very bad.

The sad truth is that if this ruling is upheld you can look forward to less choice and less functionality in your digital world.

Monday, March 3, 2014

The Relative Proportion of Factors of Odd Versus Even Numbers

As I was riding home from work today I was thinking about odd and even numbers.

It is a funny thing that the product of two even numbers must be an even number, while the product of an even number with an odd number must also be even. Only two odd numbers will always give an odd number when multiplied.

If you don't believe me, think about what makes the number odd or even, it is whether there is a remainder of one after you divide by two. When you multiply an odd by an odd, it is the same as multiplying the first odd number by the second number minus one, and then adding the first number. The first operation must give you an even number (odd times even) so that then adding an odd number must give you an odd.

This tells us some interesting things, firstly only even numbers can have factors that are both odd and even. Odd numbers will only ever have odd factors.

It also means that if you take two random numbers then the probability of the product being odd is just 1/4. The reason is that there are 4 possible ways to draw two random numbers: odd+odd, odd+even, even+odd, even+even. Only one of those 4 options can produce an odd number.

This result could also mean that in general even numbers have more factors than odd numbers. I don't have an argument for it, but it seems to me to be the kind of thing for which there might be a formal proof, perhaps I was even shown it and have forgotten. If you know of one please point it out in the comments.

Anyway, these thoughts passed the time as I rode home today and helped me clear my mind of other things. Who would have thought that amateur number theory could be so satisfying.