tag:blogger.com,1999:blog-78911601469334254022024-03-14T01:13:23.794-07:00John HawkinsData Science & App DevAnonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.comBlogger33125tag:blogger.com,1999:blog-7891160146933425402.post-18623106323734793202017-09-26T23:13:00.000-07:002017-10-02T19:33:33.146-07:00Mean Imputation in Apache Spark<div dir="ltr" style="text-align: left;" trbidi="on">
If you are interested in building predictive models on Big Data, then there is a good chance you are looking to use Apache Spark. Either with MLLib or one of the growing number of machine learning extensions built to work with Spark such as <a href="https://github.com/maxpumperla/elephas">Elephas which lets you use Keras and Spark together</a>.<br />
<div>
<br /></div>
<div>
For all its flexibility, you will find certain things about working with Spark a little tedious.</div>
<div>
<br /></div>
<div>
For one thing you will have to write a lot of boilerplate code to clean and prepare features before training models. MLLib only accepts numerical data, and will not handle NULL values. There are some great library functions for String indexing and One Hot Encoding. However, you will still need to explicitly apply all these functions yourself. Compared with using a product like <a href="http://h2o.ai/">H2O</a>, which does all this via configuration options in the model, writing this code can be a chore.</div>
<div>
<br /></div>
<div>
As frustrating as it is, the flexibility of being able to build reusable Machine Learning Pipelines in which you can make your feature generation part of the Meta-Parameters, will make it all worthwhile in the end.</div>
<div>
<br /></div>
<div>
So, to this end, here is a little bit of re-usable code that will help with one common data preparation task: Mean Imputation of Missing Values.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
If you are lucky enough to be on Spark 2.2.0 then there is a library function you can use to do this. This was lifted directly from <a href="https://stackoverflow.com/questions/40057563/replace-missing-values-with-mean-spark-dataframe">this StackOverflow post</a>:</div>
<div>
<br /></div>
<div>
<div>
<pre class="codeblock">import org.apache.spark.ml.feature._
def imputeMeans(df: org.apache.spark.sql.DataFrame, feet: Array[String]): (org.apache.spark.sql.DataFrame, Array[String]) = {
val outcols = feet.map(c => s"${c}_imputed")
val imputer = new Imputer().setInputCols(feet).setOutputCols(outcols).setStrategy("mean")
val resdf = imputer.fit(df).transform(df)
(resdf, outcols)
}
</pre>
</div>
</div>
<div>
<br /></div>
<div>
Which should be fairly self explanatory, you simply pass the function a dataframe and an array of column names. You will get back a new data frame with additional columns in which the '_imputed' suffix has been added.</div>
<div>
<br /></div>
<div>
<br /></div>
<div>
If you are on an older version of Spark then you can do the following</div>
<div>
<br /></div>
<pre class="codeblock">def imputeToMean(df: org.apache.spark.sql.DataFrame, col: String): org.apache.spark.sql.DataFrame = {
val meanVal = df.select(mean(df(col))).collectAsList().get(0).getDouble(0)
df.na.fill(meanVal, Seq(col))
}
def imputeMeans(df: org.apache.spark.sql.DataFrame, cols: Array[String]): org.apache.spark.sql.DataFrame = {
cols.foldLeft(df)( (dfin, col) => imputeToMean(dfin, col) )
}
</pre>
<div>
<br /></div>
<div>
Again the imputeMeans function takes a DataFrame and an array of column names. This time it just returns a new DataFrame in which the required columns have had their NULLs replaced with the column mean. This version can be time consuming to run, so I suggest that you cache it once it is done.</div>
<div>
<br /></div>
<div>
Hope that helps.</div>
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com6tag:blogger.com,1999:blog-7891160146933425402.post-54959799264067800182017-09-06T21:08:00.001-07:002017-09-06T21:10:59.997-07:00Apache Spark - Cumulative Aggregations over Groups using Window Functions<div dir="ltr" style="text-align: left;" trbidi="on">
There are many tasks you want to perform over a data set that correspond to some form of aggregation over groups of your data, but in a way that takes some kind of ordering into consideration.<br />
<div>
<br /></div>
<div>
For example calculating the cumulative total sales per salesperson over the course of each financial year. You want the data grouped by salesperson and the financial year, but then transformed to calculate the cumulative total up to and including each sale.</div>
<div>
<br /></div>
<div>
These kinds of functions are possible in SQL syntax, but they are complicated, and difficult to read. Apache Spark by comparison has a very elegant and easy to use API for generating these kinds of results.</div>
<div>
<br /></div>
<div>
Here is a simple example in which we want the average sales per salesperson, leading up to and including their current sale.</div>
<div>
<br />
First lets create a dummy data set and look at it. (In reality you will usually be loading an enormous set from your cluster, but this lets us experiment with the API).</div>
<div>
<br /></div>
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">val redy = sc.parallelize(
Seq((201601, "Jane", 10), (201602, "Tim", 20), (201603, "Jane", 30),
(201604, "Tim", 40), (201605, "Jane", 50), (201606, "Tim", 60),
(201607, "Jane", 70), (201608, "Jane", 80), (201609, "Tim", 90),
(201610, "Tim", 100), (201611, "Jane", 110), (201612, "Tim", 120)
)
)
case class X(id: Int, name: String, sales: Int)
val redy2 = redy.map( in => X(in._1, in._2, in._3) )
val df = sqlContext.createDataFrame(redy2)
df.show
</pre>
</div>
<div>
<br />
...and this it what it looks like</div>
<div>
<br /></div>
<pre style="line-height: 125%; margin: 0;">+------+----+-----+
| id|name|sales|
+------+----+-----+
|201601|Jane| 10|
|201602| Tim| 20|
|201603|Jane| 30|
|201604| Tim| 40|
|201605|Jane| 50|
|201606| Tim| 60|
|201607|Jane| 70|
|201608|Jane| 80|
|201609| Tim| 90|
|201610| Tim| 100|
|201611|Jane| 110|
|201612| Tim| 120|
+------+----+-----+
</pre>
<div>
<br />
To add our additional column with the aggregation over the grouped and ordered data, it is as simple as:</div>
<div>
<br /></div>
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">df.withColumn("avg_sales", avg(df("sales"))
.over( Window.partitionBy("name").orderBy("id") )
).show
</pre>
</div>
<div>
<br />
<br />
...which will produce the following output:<br />
<br />
<br /></div>
<pre style="line-height: 125%; margin: 0;">+------+----+-----+------------------+
| id|name|sales| avg_sales|
+------+----+-----+------------------+
|201602| Tim| 20| 20.0|
|201604| Tim| 40| 30.0|
|201606| Tim| 60| 40.0|
|201609| Tim| 90| 52.5|
|201610| Tim| 100| 62.0|
|201612| Tim| 120| 71.66666666666667|
|201601|Jane| 10| 10.0|
|201603|Jane| 30| 20.0|
|201605|Jane| 50| 30.0|
|201607|Jane| 70| 40.0|
|201608|Jane| 80| 48.0|
|201611|Jane| 110|58.333333333333336|
+------+----+-----+------------------+
</pre>
<div>
<br />
<br />
Voila. An additional column containing the average sales per salesperson leading up to and including the current sale. You can modify this to change the aggregation function, add additional columns to the grouping or the ordering. It is clean and readable, and fast.<br />
<br />
<br /></div>
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com2tag:blogger.com,1999:blog-7891160146933425402.post-92108257334299585462017-05-23T22:08:00.001-07:002017-05-23T22:08:11.910-07:00Inspecting Changed Files in a Git Commit<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
When you are getting set up with using GIT for data science work you will likely have some teething issues with getting your process and flow right.<br />
<br />
One common one is that you have a GIT repo that controls your production code, but you make config changes in the prod code, and code changes in your working copy. If you have not graduated to PULL requests from DEV branches, then you might find yourself unable to push changes made without merging into PROD.<br />
<br />
Here are some small things that are useful in understanding what has changed before you flick that switch.<br />
<br />
Look at what the commits are that not in your current working version<br />
<!-- HTML generated using hilite.me --><br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">git log HEAD..origin/master
</pre>
</div>
<br />
This will give you something like<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">commit 30d1cb6a3564e09753078288a9317f1b2309ab81
Author: John Hawkins <johawkins@blahblahblah.com>
Date: Tue May 23 10:03:10 2017 +1000
clean up
commit d69a1765bcfa4257573707e9af2bb014c412d7d8
Author: Added on 2016-12-06i <johawkins@blahblahblah.com>
Date: Tue May 16 05:39:14 2017 +0000
adding deployment notes
</pre>
</div>
<br />
You can then inspect the changes in the individual commits using:<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;"> git show --pretty="" --name-only d69a1765bcfa4257573707e9af2bb014c412d7d8
</pre>
</div>
<br />
Which will give you something like<br />
<br />
<div style="background: #ffffff; border-width: 0.1em 0.1em 0.1em 0.8em; border: solid gray; overflow: auto; padding: 0.2em 0.6em; width: auto;">
<pre style="line-height: 125%; margin: 0;">commit d69a1765bcfa4257573707e9af2bb014c412d7d8
Author: Added on 2016-12-06i <johawkins@blahblahblah.com>
Date: Tue May 16 05:39:14 2017 +0000
adding deployment notes
drill_udfs/README.md
</pre>
</div>
<br />
Note that it will show you the changed files at the end, which is often the most important thing to know. At least this help me understand what I am bringing in.</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-57218524727948344762016-02-02T02:23:00.002-08:002016-02-02T02:23:21.128-08:00Creating PEM files for an APNS Push Notification Server<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<br />
If you are curious enough to want to build your own push notification backend system, then you will come to a point in time when you need to get the certificates for your server to communicate with the Apple APNS server. When you start searching the web for tutorials on this subject you will find lots of out of date material showing screen shots from previous versions of iTunes Connect, Xcode or the Key-Chain software.<br />
<br />
As of right now the only accurate account I have found is this <a href="http://stackoverflow.com/questions/21250510/generate-pem-file-used-to-setup-apple-push-notification">Excellent Stack Overflow post</a>.<br />
<br />
Hope that helps save someone some time.</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com1tag:blogger.com,1999:blog-7891160146933425402.post-90754555579800871392015-07-04T21:37:00.001-07:002017-05-23T22:19:15.227-07:00Development Process with Git<div dir="ltr" style="text-align: left;" trbidi="on">
I have been using various version control tools for years, however it has taken me a long time to make version control a core part of my work process. All of the ideas in this post come from other people... so as usual I am indebted to my colleagues for making me a more productive coder.<br />
<br />
I now generally create a repo in my remote master, and then push the initial commit. This requires running the following on you local machine.<br />
<br />
<pre class=" language-bash"><code class=" language-bash"><code class=" language-bash">git init </code></code></pre>
<pre class=" language-bash"><code class=" language-bash"><code class=" language-bash"></code>git add <span class="token operator">*</span>
git commit <span class="token operator">-</span>m <span class="token string">"Initial Commit"</span>
git remote add origin git@remote<span class="token punctuation">.</span>com<span class="token punctuation">:</span>project<span class="token punctuation">.</span>git
git push <span class="token operator">-</span>u origin master</code></pre>
<br />
<br />
If you want other people to work with you, then they can now clone the project.<br />
<br />
<pre class=" language-bash"><code class=" language-bash">git clone git@remote<span class="token punctuation">.</span>com<span class="token punctuation">:</span>project<span class="token punctuation">.</span>git</code></pre>
<br />
Now for the interesting part: when you want to commit something into the repo but your branch has diverged from master. You can of course just run git pull and it will merge the results, unless there are drastic conflicts you need to resolve. But this creates a non-linear commit lineage, you can't look at this list of commits as one single history. To get a linear commit history you need to do the following.<br />
<br />
<pre class=" language-bash"><code class=" language-bash"><code class=" language-bash">git fetct </code></code></pre>
<pre class=" language-bash"><code class=" language-bash"><code class=" language-bash"></code>git rebase<span class="token operator"></span></code></pre>
<br />
This will rewind your commit history to what it was before the changes in master were applied. It will apply the changes from the master branch, and then apply your sequence of changes on top. Voila, linear commit history.<br />
<br />
There are a number of other key ideas in effectively using git.<br />
<br />
<h3 style="text-align: left;">
Tag your releases</h3>
This is pretty simple, every time I push a major piece of code to production, release an App on the App Store etc, then I make sure to tag the contents of the repository. So that release can be recovered with minimal stuffing around.<br />
<br />
<pre data-code-language="console" data-type="programlisting">git tag -a v1.2 -m <code class="s1">'Version 1.2 - Better than Version 1.1'</code></pre>
<br />
<h3 style="text-align: left;">
Create Branches</h3>
<br />
Branching can be a scary experience the first time you do it. However, if you want to do major refactorizations of code that is in production and needs to be supported while those changes are being made, then this is the only way to do it. In fact I don't know how people did these before tools like git.<br />
<br />
Create a branch and check it out in a single command<br />
<br />
<pre data-code-language="console" data-type="programlisting">git checkout -b refactor</pre>
<pre data-code-language="console" data-type="programlisting"> </pre>
<pre data-code-language="console" data-type="programlisting"> </pre>
Go back to master if you need to patch something quickly<br />
<br />
<pre data-code-language="console" data-type="programlisting">git checkout master</pre>
<br />
When you are done go back to your branch<br />
<br />
<pre data-code-language="console" data-type="programlisting">git checkout refactor</pre>
<br />
If you changed things in the master that you need inside your refactorization work, then merge master into it:<br />
<br />
<code>git merge master</code> <br />
<br />
Once you are done with all of your changes, run your tests etc, then your can merge that branch back into your production code.<br />
<br />
<pre data-code-language="console" data-type="programlisting">git checkout master</pre>
<pre data-code-language="console" data-type="programlisting">git merge refactor</pre>
<pre data-code-language="console" data-type="programlisting"> </pre>
<pre data-code-language="console" data-type="programlisting"> </pre>
Once it is all merged and deployed, you don't need the branch so delete it.<br />
<br />
<pre data-code-language="console" data-type="programlisting">git branch -d refactor</pre>
<pre data-code-language="console" data-type="programlisting"> </pre>
<pre data-code-language="console" data-type="programlisting"> </pre>
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-34665707106642891762014-12-03T22:40:00.004-08:002014-12-03T22:40:33.551-08:00Programmatic is the Future of All Advertising<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_Y5_SXrp80xriHWIF2XPMXaW_96IO1INlJ643nc2p3V-fyVamSahW5uKL6FYQKnS5ziOh6e7cHuCxLgx-_dQDnmAF2MIYDqL3ZeLqzi-EYJ2id9qVqy_nT5ngiPFNc-DXV8gGsxPDMnn2/s1600/Programmatic-Is-The-Future-of-Advertising.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj_Y5_SXrp80xriHWIF2XPMXaW_96IO1INlJ643nc2p3V-fyVamSahW5uKL6FYQKnS5ziOh6e7cHuCxLgx-_dQDnmAF2MIYDqL3ZeLqzi-EYJ2id9qVqy_nT5ngiPFNc-DXV8gGsxPDMnn2/s1600/Programmatic-Is-The-Future-of-Advertising.jpg" height="223" width="320" /></a></div>
<br />
<br />
If you do not have contact with the world of media and advertising you might never have heard the term <i>programmatic</i>. Even if you do it is likely that you have <a href="http://www.imediaconnection.com/content/37419.asp">no idea what it means</a>. To put it simply programmatic refers to a radical overhaul in the way advertising is bought and sold. Specifically, it refers to advertisers (or media agencies) being able to bid on ad positions in real time, enabling them to shift ad spend around between publishers at their own discretion.<br />
<br />
Understandably many people in the publishing industry are very nervous about this. Advertising has historically been a very closed game, publishers have demanded high premiums for ad spaces that they could provide very little justification for. The digital age has forced them to compete with bloggers, app developers and a multitude of other Internet services that all require ad revenue to survive.<br />
<br />
Programmatic began with the rise of text based search ads. Innovators like DoubleClick honed in on the idea that advertisers should be able to pay for what they want (a click), and the AdServer should be able to calculate expected returns for all possible ad-units by looking at the historical click through rate for each ad-unit and the amount being bid. The idea soon moved to display advertising to allow advertisers to bid a fixed cost per thousand ad impressions for display ads (designed for brand awareness more than attracting clicks). Those ideas have now spawned an industry that contains thousands of technology providers and ad networks. All the Ad space from the biggest publishers in the world down to the most obscure bloggers are all available to be bought and sold across a range of different bidding platforms.<br />
<br />
Exactly the same thing is starting to happen with <a href="http://www.adexchanger.com/data-driven-thinking/programmatic-a-big-part-of-the-future-of-radio/">digital radio stations</a> and it is coming for <a href="http://www.adnews.com.au/adnews/sydney-to-get-big-digital-billboards">billboard displays as well</a>. Some of
the new crop of music distribution channels (like Spotify and Pandora)
will be rolling out products that allow them to coordinate audio ads and
display ads within their apps. Behind the scenes they are developing
technology to schedule these things like an online banner ad, and once
that happens selling those slots in an online bidding auction that combines audio and display is not far
away.<br />
<br />
The video ads you see on YouTube and many other publisher sites are already able to be bought this way using tools like <a href="http://www.tubemogul.com/">TubeMogul</a>. In the not too distant future people will be watching ads on their televisions that are being placed there by the bidding actions of a media buying specialist. US based Ad tech company Turn is <a href="http://www.adweek.com/news/technology/programmatic-revolution-will-be-televised-157534">already investigating this possibility</a>. Sure there will be latency problems, video and audio are large files, so they will need to be uploaded to an adserver close to where they will be delivered. But these technologies are already being developed to cope with the increasing complexity of display ads with rich media capability.<br />
<br />
The rise of programmatic advertising is changing what it means to be a media agency. It is no longer sufficient to build monopoly relationships with publishers and then employ a suite of young professionals who build relationships with brands. Instead, media agencies need a room full of a new kind of media geek that specializes in understanding how to buy media on a variety of platforms called Demand Side Platforms (DSPs).<br />
<br />
These new divisions within agencies are called trading desks, they are staffed
with people whose job it is to understand all the kinds of media that
are available to buy, how much you can expect to pay for it, and what
kinds of ads will work where. It is a new job, and to be perfectly
honest people still have a lot to learn. That learning curve will only
increase, at the moment they are just buying display ads on desktop and
mobile. The ad capabilities of mobile will increase as the field
matures, and then they will have to deal with buying video advertising
and audio. At some point that will spread beyond just YouTube, first to
other online video services, then to smaller niche digital TV channels,
then to set-top boxes and cable TV. Finally, broadcast television (if
it is still in business) will grudgingly accept that they need to make
their inventory available.<br />
<br />
Before any of this happens, most of the advertising on social networks will have become available <span class="il">programmatically</span>.
Facebook is making this transition, and twitter will follow, as will
the others. They will each struggle with the balance of keeping their
unique ad formats and maximizing the return on their inventory. Everything we have seen in desktop display indicates that this problem can be solved with online auctions, which means fully programmatic social media is coming.<br />
<br />
This is an awe inspiring future for digital advertising. The DSP of the far future will be a tool that is able to buy inventory on desktop, mobile web, mobile apps, radio stations, TV channels, electronic billboards and a suite of social media. Ideally it will contain rules that allow the purchase of media between these channels to be co-ordinated and optimized in real time.<br />
<br />
For example, imagine a system with configuration rules that allow TV time to be purchased when twitter activity for certain key hashtags reaches threshold volumes (an independent way of checking that TV viewership is what the networks claim it is). Following that with social media advertising, and digital billboards the following morning during the daily commute. The possibilities for marketers to test and learn what actually works will be immense.<br />
<br />
When you contemplate the possibilities for investigating and improving ROI using this approach to media planning and spending you really need to ask yourself:<br />
<br />
<i>Why would anyone buy advertising the old fashioned way?</i><br />
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-91592287179907132652014-11-10T23:27:00.003-08:002014-11-10T23:27:43.165-08:00Basic Guide to Setting Up Single Node Hadoop 2.5.1 Cluster on Ubuntu<div dir="ltr" style="text-align: left;" trbidi="on">
<div style="text-align: left;">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFFUiz5Yecst-uPrDcyN-PET4at09qTf7SppLRcHoNJvX1t6FIf6_evMEg4fuHPHNa3C3yOIpGxcxh_IZALOa7FZ7bYGwsriKL_sGEeL06bEppKalywjoYRljcD-XmQQiGSHDU7fu_FP10/s1600/25ea7a2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgFFUiz5Yecst-uPrDcyN-PET4at09qTf7SppLRcHoNJvX1t6FIf6_evMEg4fuHPHNa3C3yOIpGxcxh_IZALOa7FZ7bYGwsriKL_sGEeL06bEppKalywjoYRljcD-XmQQiGSHDU7fu_FP10/s1600/25ea7a2.png" /></a></div>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">So, you have decided you are interested in big data and data science and exploring what you can do with Hadoop and Map Reduce.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">But... you find most of the tutorials too hard to wade through, inconsistent, or you simply encounter problems that you just can't solve. Hadoop is evolving so fast that often the documentation is unable to keep up. </span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">Here I will run you through the process I followed to get</span><span style="font-family: Arial, Helvetica, sans-serif;"> the latest version of Hadoop (2.5.1) running so I could use it to test my Map Reduce programs. </span></div>
<div style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span></div>
<div style="text-align: left;">
</div>
<span style="font-family: Arial, Helvetica, sans-serif;">You can see the official <a href="http://hadoop.apache.org/docs/r2.5.1/hadoop-project-dist/hadoop-common/SingleCluster.html">Apache Docs here</a>.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<br />
<h3 style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif;">Part One: Java</span></h3>
<span style="font-family: Arial, Helvetica, sans-serif;">You need to make sure you have a compatible version of Java on your machine.</span><br />
<br />
<span style="font-family: Arial, Helvetica, sans-serif;">Jump into your terminal and type</span><br />
<pre class="codeblock">java -version</pre>
<span style="font-family: Arial, Helvetica, sans-serif;">You preferably need an installation of Java 7.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;">When I run this I get:</span><br />
<br />
<pre class="codeblock">java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1~0.12.04.2)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)
</pre>
<br />
<br />
<h3 style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif;">Part Two: Other Software</span></h3>
<span style="font-family: Arial, Helvetica, sans-serif;">You will need ssh and rsync installed. Chances are that they already are, but if not just run:</span>
<br />
<pre class="codeblock">sudo apt-get install ssh
sudo apt-get install rsync
</pre>
<br />
<br />
<h3 style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif;">Part Three: Grab a Release</span></h3>
<span style="font-family: Arial, Helvetica, sans-serif;">Head to the <a href="http://www.apache.org/dyn/closer.cgi/hadoop/common/">Apache Hadoop Releases</a> page, choose a mirror and grab the tarball (.tar.gz). Make sure you do not grab the source file by mistake (src).</span>
<br />
<pre><span style="font-family: Arial, Helvetica, sans-serif;"><span style="white-space: normal;">Remember: in this walk-through I have grabbed release: 2.5.1</span></span></pre>
<br />
<h3 style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif;">Part Four: Unpack & Configure</span></h3>
<span style="font-family: Arial, Helvetica, sans-serif;">Copy the tarball to wherever you want Hadoop to reside. For me I like to put it in the directory</span><br />
<pre class="codeblock">/usr/local/hadoop
</pre>
<span style="font-family: Arial, Helvetica, sans-serif;">and then extract the contents with</span>
<br />
<pre class="codeblock">tar -xvf hadoop-2.5.1.tar.gz
</pre>
<span style="font-family: Arial, Helvetica, sans-serif;">Then you will need to do some configuration. Open the file</span>
<br />
<pre class="codeblock">vi hadoop-2.5.1/etc/hadoop/hadoop-env.sh
</pre>
<span style="font-family: Arial, Helvetica, sans-serif;">You will need to modify the line that currently looks like this</span>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">export JAVA_HOME=${JAVA_HOME}</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">You need to point this to your java installation. If you are not sure where that it just run</span>
<br />
<pre class="codeblock">which java
</pre>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">and then copy the path (minus the bin/java at the end) into the hadoop config file to replace the text </span><span style="font-family: Arial, Helvetica, sans-serif;">${JAVA_HOME}.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<br />
<h3 style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif;">Part Five: Test</span></h3>
<span style="font-family: Arial, Helvetica, sans-serif;">First run a quick to check that you have configured java correctly. The following command should show you the version of hadoop and its compilation information.</span><br />
<br />
<pre class="codeblock">hadoop-2.5.1/bin/hadoop version
</pre>
<br />
<h3 style="text-align: left;">
<span style="font-family: Arial, Helvetica, sans-serif;">Part Six: Run Standalone</span></h3>
<span style="font-family: Arial, Helvetica, sans-serif;">The simplest thing you can do with hadoop is run a map reduce job as a stand alone script.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">The Apache Docs give a great simple example: grepping a collection of files.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">Run these commands:</span>
<br />
<pre class="codeblock">mkdir input
cp hadoop-2.5.1/etc/hadoop/*.xml input
hadoop-2.5.1/bin/hadoop jar hadoop-2.5.1/share/hadoop/mapreduce/hadoop-mapreduce-examples-2.5.1.jar grep input output 'dfs[a-z.]+'
</pre>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">When hadoop completes that process you can open up the results file and have a look.</span>
<br />
<pre class="codeblock">vi output/part-r-00000
</pre>
<span style="font-family: Arial, Helvetica, sans-serif;">You should see a single line for each match of the regular expression. Trying changing the expression and seeing what you get. Now you can use this installation to test your map reduce jars against Hadoop 2.5.1.</span><br />
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;"><br /></span>
<span style="font-family: Arial, Helvetica, sans-serif;">Coming Next: Running Hadoop 2.5.1 in Pseudo Distributed Mode</span></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-27383817116785871762014-11-09T21:54:00.000-08:002017-10-02T21:37:45.111-07:00Wittgenstein's Beetle Book Review<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipYYn7E8I3ZaSqB0PeWnZDt_bzTGhqod2tYGJDyE4NW_Pj8owsM1ZLi5xLEWCNrVBQXaKwKSx87eMcMjm7uE27bTRafxgGbmmdEIs73HjahZYS7nQim4XwYVOF3Y4PlQ3wPJHoTdt1c0g7/s1600/51tskc2xupL._SY344_BO1,204,203,200_.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="320" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEipYYn7E8I3ZaSqB0PeWnZDt_bzTGhqod2tYGJDyE4NW_Pj8owsM1ZLi5xLEWCNrVBQXaKwKSx87eMcMjm7uE27bTRafxgGbmmdEIs73HjahZYS7nQim4XwYVOF3Y4PlQ3wPJHoTdt1c0g7/s1600/51tskc2xupL._SY344_BO1,204,203,200_.jpg" width="213" /></a></div>
<span style="-webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); -webkit-tap-highlight-color: rgba(26, 26, 26, 0.296875); font-family: HelveticaNeue-Light; line-height: 22px; text-align: -webkit-auto; white-space: nowrap;"><br /></span>
<span style="-webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); -webkit-tap-highlight-color: rgba(26, 26, 26, 0.296875); font-family: HelveticaNeue-Light; line-height: 22px; text-align: -webkit-auto; white-space: nowrap;"><br /></span>
<span style="-webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); -webkit-tap-highlight-color: rgba(26, 26, 26, 0.296875); font-family: HelveticaNeue-Light; line-height: 22px; text-align: -webkit-auto; white-space: nowrap;">Wittgenstein's Beetle by Martin </span><span style="-webkit-composition-fill-color: rgba(175, 192, 227, 0.230469); -webkit-composition-frame-color: rgba(77, 128, 180, 0.230469); -webkit-tap-highlight-color: rgba(26, 26, 26, 0.292969); font-family: HelveticaNeue-Light; line-height: 22px; text-align: -webkit-auto; white-space: nowrap;">Cohen </span><br />
<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: rgba(255, 255, 255, 0);"><br /></span></div>
<div dir="ltr" style="text-align: left;" trbidi="on">
<span style="background-color: rgba(255, 255, 255, 0);">Summary: Very disappointing.</span><br />
<span style="background-color: rgba(255, 255, 255, 0);"><br /></span>
<span style="background-color: rgba(255, 255, 255, 0);">What could have been a great primer on one of the essential tools of philosophy, is held back by the author's mediocre understanding of many of the issues he discusses. The prime example is the 'thought experiment' by Wittgenstein that serves as the name of the book. Wittgenstein held that the idea of private language was incoherent because languages were games played between people. His beetle experiment was designed to make this idea concrete by proposing a world in which we all owned a private box containing a beetle. Mr Cohen provides a direct quote from Wittgenstein's Investigations in which he (Wittgenstein) clearly states that the word beetle, if used in such a society, could not be referring to the thing in the box. Mr Cohen then turns around and tells us that the point of Wittgenstein's experiment is to show that we assume that because we use the same word as other people we are talking about the same thing. This is not what Wittgenstein said, and he says this clearly in the text.</span><br />
<span style="background-color: rgba(255, 255, 255, 0);"><br /></span>
<span style="background-color: rgba(255, 255, 255, 0);">To make matters worse, Mr Cohen returns to pick on Wittgenstein's Beetle at the end of the book as an example of a poorly done thought experiment. It fails to meet several of Mr Cohen's criteria for successful thought experiments. One needs to note that it is Mr Cohen who has massaged the definition of a thought experiment to get Wittgenstein's beetle in, and then he criticises its performance, all the while failing to understand it.</span><br />
<span style="background-color: rgba(255, 255, 255, 0);"><br /></span>
<span style="background-color: rgba(255, 255, 255, 0);">I am not going to mention the numerous fallacies the author pens on many topics of science, and his horrendous attempts at jokes. The only reason I am giving the book 2 stars is because the discussion of Searle's Chinese room argument is excellent. Read this chapter and then throw the book away.</span></div>
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-39354068736047985772014-11-01T22:07:00.001-07:002014-11-01T22:07:03.872-07:00Appcelerator Titanium Android Woes on Mac OSX<div dir="ltr" style="text-align: left;" trbidi="on">
I have been having ongoing problems getting Appcelerator to build and install Android Apps again.<br />
<br />
The very first time I built an Android App it took me some time to get the configuration right. Now that I have been through system upgrades I seem to have come back to step one again. Like before the <a href="https://wiki.appcelerator.org/display/guides2/Deploying+to+Android+devices">official Appcelerator Guide</a> helps me refresh how you get the device itself configured. However, it will not prepare you for the grand cluster of configuration issues you will face getting all the toys to play nicely together.
<br />
<br />
<h3 style="text-align: left;">
Problem </h3>
Appcelerator does not recognize your android device.<br />
Even though if you run adb devices you can see it listed.
<br />
<br />
<h3 style="text-align: left;">
Solution </h3>
I still don't have a solution for this (most people suggest uninstalling everything and starting again, which to my mind constitutes giving up not solving it). I do have a work around though: Build the app without installing it and then use adb to install it independently. This definitely works in the absence of a better solution.<br />
<br />
<h4 style="text-align: left;">
To build</h4>
Try the command <b>titanium build</b>,<br />
<i> - or - </i><br />
Just use the distribute app dialog in Titanium Studio.<br />
You can generate a signed APK easily this way.
<br />
<br />
<h4 style="text-align: left;">
To install</h4>
Just use the adb command line utility:<br />
<br />
<pre class="codeblock"> adb install ../Desktop/MyApp.apk
</pre>
<br />
Problem solved,... sort of.<br />
<br />
<br />
<h3 style="text-align: left;">
Problem</h3>
adb does not even recognize your android device.<br />
This seems to happen randomly, depending on what I had for breakfast.<br />
<br />
<br />
<h3 style="text-align: left;">
Solution </h3>
I generally find this requires a little fiddling around. This particular combination is currently working for me:<br />
1) Unplug your device.<br />
2) Kill the adb server.<br />
3) Plug your device back in<br />
4) Run adb devices <br />
This seems to kickstart the adb server in such a way that it correctly finds the attached devices.
<br />
<br />
<h3 style="text-align: left;">
Problem</h3>
Your android App almost builds an APK but red errors flash up at the end. Appcelerator tells you it was built but there is nothing in the build directory. You see a bunch of uninformative python errors codes referring to problems with the file: builder.py, for example:<br />
<br />
<pre class="javascript codeblock"><span class="st0"><span class="es0"></span></span><span class="sy0"></span>line <span class="nu0">2528</span><span class="sy0">,</span> <span class="kw1">in</span> <span class="sy0"><</span>module<span class="sy0">></span>
<span class="br0">[</span>ERROR<span class="br0">]</span> builder.<span class="me1">build_and_run</span><span class="br0">(</span><span class="kw2">False</span><span class="sy0">,</span> avd_id<span class="sy0">,</span> debugger_host<span class="sy0">=</span>debugger_host<span class="sy0">,</span> profiler_host<span class="sy0">=</span>profiler_host<span class="br0">)</span></pre>
<br />
For me it turned out that this is all because of the fact that some executables got moved around between distributions of the android SDK.<br />
<br />
This problem is outlined in this note from the <a href="http://developer.appcelerator.com/question/152497/titanium-sdk-310-error-typeerror-argument-of-type-nonetype-is-not-iterable-on-building-android-app">Appcelerator forums</a> fixed it for me.<br />
<br />
<h3 style="text-align: left;">
Solution </h3>
Create symlinks to aapt and dx in
/Applications/Android-sdk/platform-tools:
<br />
<br />
<pre class="codeblock">ln -s /Applications/Android-sdk/build-tools/17.0.0/aapt aapt
ln -s /Applications/Android-sdk/build-tools/17.0.0/dx dx
</pre>
<br />
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-69026910397723790702014-10-03T20:15:00.001-07:002017-09-14T16:03:46.982-07:00Logistic Regression with R<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEij-KHFM4crWHO9kt26ddnyTcqLnAskyQEmCqySu8JY4u8CG-GTDCVHdLa_G6selJgeSzC_xhq4GnTQDPzFGGUjAaFHKt7Lvv72E2kU52tnMotk7QWcH1s1Rpd1gfvWk077PKa6jhKnAnIl/s1600/LogReg_1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="171" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEij-KHFM4crWHO9kt26ddnyTcqLnAskyQEmCqySu8JY4u8CG-GTDCVHdLa_G6selJgeSzC_xhq4GnTQDPzFGGUjAaFHKt7Lvv72E2kU52tnMotk7QWcH1s1Rpd1gfvWk077PKa6jhKnAnIl/s320/LogReg_1.png" width="320" /></a></div>
<br />
<h2 style="text-align: left;">
Logistic Regression</h2>
<br />
Regression is performed when you want to produce a function that will predict the value of something you don't know (the dependent variable) on the basis of a collection of things you do know (the independent variables).<br />
<br />
The problem is that regression is typically done with a linear function and very few real world processes are linear. Hence, a great deal of statistics and machine learning research concerns methods for fitting non-linear functions, but controlling for the explosion in complexity that comes with it.<br />
<br />
Logistic Regression is one of the methods that tries to solve this problem. In particular, Logistic Regression produces an output between 0 and 1 which can be interpreted as the probability of your target event happening.<br />
<br />
Let's look at the form of Logistic Regression to get a better understanding:<br />
<br />
You start with the goal of a function that approximates the probability of the target T for any input vector X :<br />
<br />
p(T) = F(X)<br />
<br />
In order to assure that F(X) takes the form of a valid probability (i.e. always between 0 and 1) we make us of the logistic function 1/(1+e^-K). If K is a big number the e^-K approaches 0 and hence the output of the logistic function approaches 1. If on the other hand K is a very small number the e^-K becomes very large and the output of the logistic function approaches 0.<br />
<br />
<br />
So we are fitting the following function:<br />
<br />
p(T) = 1 / [ 1 + e^-g(X) ]<br />
<br />
We have added the function g(X) to afford us some flexibility in how we feed the input vector X into the logistic function. Here is where we place our usual linear regression function. We say<br />
<br />
g(X) = B_0 + B_1 * X_1 + B_2 * X_2 + ...... + B_N * X_N<br />
<br />
i.e. a linear function over all the dimensions of the input X.<br />
<br />
Now, in order to perform our linear regression, we need to transform the function definition. You can do the transformation yourself if you like. What you will find is that with some re-arrangement you find that the function g(X) is equal to:<br />
<br />
g(X) = - ln [ (1-p) / p ]<br />
<br />
And by exploiting the properties of the logarithm you can further re-arrange to get the log odds ratio.<br />
<br />
g(X) = ln [ p / (1-p) ]<br />
<br />
An astute reader might notice a problem. For a target value of 1 (i.e p=1) the fraction is undefined. Luckily we can use the properties of the logarithm again and define our target as<br />
<br />
ln [ p / (1-p) ] = ln(p) - ln(1-p)<br />
<br />
...and this is the target value onto which you perform the linear regression.<br />
<br />
In other words you fit the value of the parameters (the Bs) so that <br />
<br />
B_0 + B_1*X_1 + B_2*X_2 + ...... + B_N*X_N = ln(p) - ln(1-p)<br />
<br />
<br />
That is all well and good, how can we do that with R ? you might ask.<br />
<br />
Well, I have gone ahead and converted some code from a bunch of different tutorials into a little R workbook that will take you through applied Logistic Regression in R. You can find the Logistic Regression Code Example in my GitHub account <a href="https://github.com/john-hawkins/MLWorkBook/blob/master/src/logisticRegression.R">right here</a>.<br />
<br />
It all boils down to using the Generalised Linear Model function.<br />
<br />
This R function will fit your Logistic Regression for you.<br />
<br />
If you follow that code example to the end you will get a plot like the one below, which shows you the original data in green, the model fitted to that data in black, and some predictions for unseen parts of the input space in red.<br />
<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7IXGOdHs9M0gz11fPffAXDt3Y5wtGPBskjqDCFb5urxrOuII45HMZIXINuz1z8hVscjymi_sA06wSSO3QKtF26DJ0QBhe5_vCDfRkzxDCoux4eDptISHBijTq-vdPef_tSoLPSWQXVI-b/s1600/CoalMiners.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="317" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEj7IXGOdHs9M0gz11fPffAXDt3Y5wtGPBskjqDCFb5urxrOuII45HMZIXINuz1z8hVscjymi_sA06wSSO3QKtF26DJ0QBhe5_vCDfRkzxDCoux4eDptISHBijTq-vdPef_tSoLPSWQXVI-b/s1600/CoalMiners.png" width="320" /></a></div>
<br />
Logistic Regression allows you a great deal of flexibility in your model. The parameterized linear model can be changed how you want, adding or removing independent variables. You can even add higher order combinations of the independent variables.<br />
<br />
A common Machine Learning process is to experiment with different forms of this model and examine how the statistical significance of the fit changes.<br />
<br />
Just be wary of the pernicious problem of over-fitting.<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-1511743356058527122014-05-09T13:37:00.001-07:002014-05-09T15:01:19.335-07:00Copyright on APIs is a bad idea.If you have not heard yet, there has been a change in the Google Vs Oracle case. The original ruling that an API could not be copyrighted has been reversed. In essence it means that a company can release a description of a set of functions that they provide for developers, and no one is allowed to create an alternative implementation of those functions without permission.<div><br></div><div>To understand what this means to the future of software engineering you need to understand two things.</div><div><br></div><div>1) <span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);">APIs are not complex pieces of code (which are and should be subject to copyright). They are very simple descriptions of what a piece of code will do and how to make it do it. </span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);"><br></span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);">In essence a single API is just one word (the name of the function) and a list of pieces data that should be given to it. it then specifies what will happen and the data that will be returned. An API is not its implementation, it is a high level description of what an implementation should do.</span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);"><br></span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);">It is equivalent to me copyrighting the sentence </span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);">"I am going to the shop, do you want anything?" </span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);">When it is combined with the reply </span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);">"Yes, some milk." </span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);"><br></span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);">It is really that simple. Imagine if novelists needed to pay a fee when they used that combination of sentences. Of course they could use "I will go to the shop, do you need me to get something?" Or whatever other variant they need to produce in order to avoid infringing. But suggesting those sidesteps misses the point of copyright. Such small atomic combinations of the basic elements of a language are not significant pieces of work. They are not what copyright laws are designed to protect.</span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);"><br></span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);">2) Secondly you need to understand the purpose of APIs. They exist so that software programs are easier to write and easier to make communicate with each other. Their purpose is to let one programmer know how to interface with software written by someone else, someone they may have never met, and yet have it function perfectly. The API is a simple contract that says if you want my code to do this, this is how you make it happen.</span></div><div><span style="-webkit-text-size-adjust: auto; background-color: rgba(255, 255, 255, 0);"><br></span></div><div><span style="-webkit-text-size-adjust: auto;">Another advantage of APIs (well used by software developers everywhere) is that if there are multiple competing programs that do the same thing, then if they all use the same API a software developer can switch between them (almost) effortlessly.</span></div><div><span style="-webkit-text-size-adjust: auto;"><br></span></div><div><span style="-webkit-text-size-adjust: auto;">If you are ever frustrated by software not working, Internet sites being unable to perform some task, apps not working on your phone, then I have some bad news for you. If copyrightable APIs become the legal norm, then everything will get much, much worse. Start-up companies and device manufacturers alike will need to protect themselves by ensuring that their APIs are unique and not infringing any one else's copyright. In order to make software that is compatible with something else there will need to be long term financial agreements in place. This will mean that the number of things (devices and programs) that just work together will begin to decrease.</span></div><div><span style="-webkit-text-size-adjust: auto;"><br></span></div><div><span style="-webkit-text-size-adjust: auto;">The economic impact is the creation of significant barriers to entry for new technology companies. For the simple reason you cannot create some great new product that will work with products people already have without infringing copyright. Consequently many technical product possibilities will not be explored because of their legal risk. In general copyright on APIs will result in an overall reduction in the pace of innovation.</span></div><div><span style="-webkit-text-size-adjust: auto;"><br></span></div><div><span style="-webkit-text-size-adjust: auto;">To you as a consumer it will mean less things will just work out of the box together. It will mean that if you want devices and software to work with each other, then you will need to buy them all from the same vendor. This will be good for the large incumbents in the market place, but for consumers it is very bad. </span></div><div><span style="-webkit-text-size-adjust: auto;"><br></span></div><div><span style="-webkit-text-size-adjust: auto;">The sad truth is that if this ruling is upheld you can look forward to less choice and less functionality in your digital world.</span></div><div><span style="-webkit-text-size-adjust: auto;"><br></span></div><div><span style="-webkit-text-size-adjust: auto;"><br></span></div>Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-77404016819728714372014-03-03T04:19:00.001-08:002014-11-12T01:04:50.767-08:00The Relative Proportion of Factors of Odd Versus Even Numbers<div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSZtyidQKTL_UpMsGRtx1iq1pCMej5If1y8rF6K-Cevkec5zkFko27OYLqzzxHu4M5RJVRS7gLC1mekEu76EnIzMxZc3RVAn_s5_H1KkRxssNTkMUAwyPQpgG3CGzf2xyArl1UljqkMMA8/s1600/number-theory-1.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="213" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgSZtyidQKTL_UpMsGRtx1iq1pCMej5If1y8rF6K-Cevkec5zkFko27OYLqzzxHu4M5RJVRS7gLC1mekEu76EnIzMxZc3RVAn_s5_H1KkRxssNTkMUAwyPQpgG3CGzf2xyArl1UljqkMMA8/s320/number-theory-1.jpg" width="320" /></a></div>
<br />
<br />
As I was riding home from work today I was thinking about odd and even numbers.<br />
<div>
<br /></div>
<div>
It is a funny thing that the product of two even numbers must be an even number, while the product of an even number with an odd number must also be even. Only two odd numbers will always give an odd number when multiplied.</div>
<div>
<br /></div>
<div>
If you don't believe me, think about what makes the number odd or even, it is whether there is a remainder of one after you divide by two. When you multiply an odd by an odd, it is the same as multiplying the first odd number by the second number minus one, and then adding the first number. The first operation must give you an even number (odd times even) so that then adding an odd number must give you an odd.</div>
<div>
<br /></div>
<div>
This tells us some interesting things, firstly only even numbers can have factors that are both odd and even. Odd numbers will only ever have odd factors.</div>
<div>
<br /></div>
<div>
It also means that if you take two random numbers then the probability of the product being odd is just 1/4. The reason is that there are 4 possible ways to draw two random numbers: odd+odd, odd+even, even+odd, even+even. Only one of those 4 options can produce an odd number. </div>
<div>
<br /></div>
<div>
This result could also mean that in general even numbers have more factors than odd numbers. I don't have an argument for it, but it seems to me to be the kind of thing for which there might be a formal proof, perhaps I was even shown it and have forgotten. If you know of one please point it out in the comments.</div>
<div>
<br /></div>
<div>
Anyway, these thoughts passed the time as I rode home today and helped me clear my mind of other things. Who would have thought that amateur number theory could be so satisfying.</div>
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0Ashfield Ashfield-33.889163 151.123379tag:blogger.com,1999:blog-7891160146933425402.post-42460386851152285632013-10-31T02:01:00.001-07:002013-10-31T02:01:50.981-07:00Maintaining Constant Probability of Data Loss with Increased Cluster Size in HDFS<div dir="ltr" style="text-align: left;" trbidi="on">
In a conversation with a colleague some months ago I was asked if I knew how to scale the replication factor of a Hadoop Distributed File System (HDFS) cluster as the number of nodes increased in order to keep the probability of experiencing any data loss below a certain threshold. My initial reaction to the question was that it would not be affected, I was naively thinking the data loss probability was a product of the replication factor only.<br />
<br />
Thankfully, it didn't take me long to realize I was wrong. What is confusing, is that for a constant replication factor as the cluster grows the probability of data loss increases, but the quantity of data lost decreases (if the quantity of data remains constant).<br />
<br />
To see why consider a situation in which we have N nodes in a cluster with replication factor K. We let the probability of a single node failing in a given time period be X. This time period needs to be sufficiently small so that we know that the server administrator will not have enough time to replace the machine or drive and recover the data. The probability of experiencing data loss in that time period is the probability of getting K or more nodes failing. The exact value of which is calculated with the following sum:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeSzeGtBVZpCQLwcVZxjXbsV7qQwZA2LbF2XFYcqE5xYMcEk2JSsKg2oQ4fkn9nLbiilKfIBkznNZ-GCuWJovpUZTtz0KZ6VfgOnX5exl2T5DJRWVnd6nlXa1l6KYt_dNlvQ2mcWPmd4PR/s1600/math1.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="108" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjeSzeGtBVZpCQLwcVZxjXbsV7qQwZA2LbF2XFYcqE5xYMcEk2JSsKg2oQ4fkn9nLbiilKfIBkznNZ-GCuWJovpUZTtz0KZ6VfgOnX5exl2T5DJRWVnd6nlXa1l6KYt_dNlvQ2mcWPmd4PR/s320/math1.png" width="320" /></a></div>
Although in general a good approximation (a consistent overestimate) is simply:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjydZcSg8Q_b77Z5u7MC3GbKvdQ3Zatsgoeu9bG_uNCz3wrwKYYQLD0Fn08qAmkiru77OGhSNygILINMpa0-ca0cV4rSOeScT84OtfYPtlm84PvpTI0Yc-2ggubvcHJvmGjO8C4HVDldQ6Z/s1600/formula2.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjydZcSg8Q_b77Z5u7MC3GbKvdQ3Zatsgoeu9bG_uNCz3wrwKYYQLD0Fn08qAmkiru77OGhSNygILINMpa0-ca0cV4rSOeScT84OtfYPtlm84PvpTI0Yc-2ggubvcHJvmGjO8C4HVDldQ6Z/s1600/formula2.png" /></a></div>
<br />
Clearly as the size of N increases this probability must get bigger.<br />
<br />
This got me thinking about how to determine the way the replication factor should scale with the cluster size in order to keep the probability of data loss constant (ignoring the quantity). This problem may have been solved elsewhere, but it was an enjoyable mathematical exercise to go through.<br />
<br />
In essence we want to know if the number of nodes in the cluster increases by some value n, then what is the minimum number k such that the probability of data loss remains the same or smaller. Using the approximation from above we can express this as:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjm1udEMSHemHf5kHzWUYpRyHaf7tEIzytj9-B0VOXsanOLNZCavZuCn-wOOsSrHKthgc5ka7uK6jGqm4pHCciD_0nTLGHHzaAF5uwbdj7g7T8ZEKxWIZVRROfE_IuCjDBChel3ywPNf-5x/s1600/formula3.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="89" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEjm1udEMSHemHf5kHzWUYpRyHaf7tEIzytj9-B0VOXsanOLNZCavZuCn-wOOsSrHKthgc5ka7uK6jGqm4pHCciD_0nTLGHHzaAF5uwbdj7g7T8ZEKxWIZVRROfE_IuCjDBChel3ywPNf-5x/s320/formula3.png" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
Now if we substitute in the formulas for N-choose-K and perform some simplifications we can transform this into:<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSmz_R89trormU1AdWnMUOB68Uh2ngEdxGq0VT2frCPlwwll6rLcnjTGDADNpKskg_58V-fY-SvViQpSJXswzj-N1NlSwx69s7jWdzmuMQDigukaMsEjlejNxxfbY38R8CneqE5De4mPwQ/s1600/formula4.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="50" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEhSmz_R89trormU1AdWnMUOB68Uh2ngEdxGq0VT2frCPlwwll6rLcnjTGDADNpKskg_58V-fY-SvViQpSJXswzj-N1NlSwx69s7jWdzmuMQDigukaMsEjlejNxxfbY38R8CneqE5De4mPwQ/s320/formula4.png" width="320" /></a></div>
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
I optimistically thought that it might be possible to simplify this using <a href="http://en.wikipedia.org/wiki/Stirling%27s_approximation">Stirling's Approximation</a>, but I am now fairly certain that this is not possible. Ideally we would be able to express k in terms of N,n,K,X, but I do not think that it is possible. If you are reading this and can see that I am wrong please show me how.<br />
<br />
In order to get a sense of the relationship between n and k I decided to do some quick numerical simulations in R to have a look at how k scales with n.<br />
<br />
I tried various combinations of X, N and K. Interestingly for a constant X the scaling was fairly robust when you varied the initial values of N and K. I have plotted the results for three different values of X so you can see the effect of different probability of machine failure. In all three plots the baseline case was a cluster of 10 nodes with a replication factor of 3.<br />
<br />
You can grab the <a href="https://github.com/john-hawkins/Experiments/blob/master/ReplicationFactorScaling/HDFS_replication_factor_scaling.r">R code used to generate these plots from my GitHub repository</a>.<br />
<br />
<div class="separator" style="clear: both; text-align: center;">
</div>
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbIPzSzbAwurW_g2tyEUpL1ZuPeZCVmi0pTRcXWtG07yZkXRDzK1cllLPxfmAOWAstpryL63Vz5rJwaoz0g4EHBonGVz39H25PFhGxQ4sirL7MZ0DEmodEMSIiSHa1SJ1GcyaDDNojfqn_/s1600/Plot.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="640" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEgbIPzSzbAwurW_g2tyEUpL1ZuPeZCVmi0pTRcXWtG07yZkXRDzK1cllLPxfmAOWAstpryL63Vz5rJwaoz0g4EHBonGVz39H25PFhGxQ4sirL7MZ0DEmodEMSIiSHa1SJ1GcyaDDNojfqn_/s640/Plot.png" width="636" /></a></div>
<br />
<br />
<br />
<br />
<br />
<br />
<br />
<br />
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com1tag:blogger.com,1999:blog-7891160146933425402.post-32261624356551609082013-10-29T04:50:00.000-07:002014-11-12T01:10:03.689-08:00Philosophical Zombies and the Physical Basis of Consciousness <div dir="ltr" style="text-align: left;" trbidi="on">
<div class="separator" style="clear: both; text-align: center;">
<a href="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiK0Slfu44BCj1aVh7QlvMeb3psLeYEJBH92KQoo83WJFpxNCjQDyfhibBWFBDFBpQnMwAsB0bD_pVGwT72CPe7dx2h5TS3mJ2RK_f_GJYIX_YOyAvVjDcWCQH6bXoJW3X7QXo-SwWzrq_F/s1600/p_zombies.jpg" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" src="https://blogger.googleusercontent.com/img/b/R29vZ2xl/AVvXsEiK0Slfu44BCj1aVh7QlvMeb3psLeYEJBH92KQoo83WJFpxNCjQDyfhibBWFBDFBpQnMwAsB0bD_pVGwT72CPe7dx2h5TS3mJ2RK_f_GJYIX_YOyAvVjDcWCQH6bXoJW3X7QXo-SwWzrq_F/s1600/p_zombies.jpg" /></a></div>
<br />
<br />
Given that the Walking Dead is back with a new season and World War Z has just ripped through the public conscious I thought that the philosophical implications of zombies would be a worthwhile subject for a post. I would not be the first person to have thought about what the notion of a zombie means for consciousness, and in fact the zombie has a well entrenched place in a series of arguments about the nature of the relationship between mind and matter.<br />
<br />
Before we get under way, it is worth noting that to philosophers versed in the age old tradition of the <i>Thought Experiment,</i> a zombie is not a flesh eating monster that shambles around slowly decomposing and smelling disgusting. A philosophical zombie is merely a person without consciousness. I can hear you all respond "Que?" followed by quizzical silence. The idea is to ask if we can conceive of a person who acts and behaves like we do without the mental <i>qualia</i> of consciousness, that is the internal experience of seeing red, smelling roses and the pain of being pricked by a thorn.<br />
<br />
The way this is taken to impact on our understanding of mind and brain relies on a second philosophical trick: the notion of a <i>conceivability argument</i>. This is the idea that if we can conceive of something then it is in some sense at least possible. Usually this is taken as metaphysical possibility, i.e. that it may not be possible in this universe, but in some other universe. If you think this is a pretty slippery way to argue, then you are in good company. Nevertheless, it persists as a philosophical tool, and for the sake of this post I am going to grant it temporary validity.<br />
<br />
Ok. So.<br />
<br />
The argument goes as follows: physicalist explanations of consciousness require that there be some configuration of matter that corresponds to conscious states. If we can conceive of a zombie, then it is metaphysically possible that a being could exist that can act as we do, yet is not conscious. As that means that the zombie must have the configuration of brain matter that allows the specific conscious-like behavior, therefore that configuration of brain matter cannot be the source of consciousness.<br />
<br />
However, even allowing the conceivability argument, this is still an invalid argument. The reason is that just because for <i>homo sapiens</i> we observe certain configurations of brain matter that give rise to the set of behaviors and conscious states, it does not preclude the existence of other arrangements that have the former but not the latter. It is equivalent to observing a bunch of four legged tables and concluding that <i>table-ness </i>and <i>four-legged-ness</i> are a necessary combination. In reality other arrangements of legs can also make tables, and four legs does not always a table make.<br />
<br />
Strengthening this objection is the fact that we know that the micro-structure of our brains are different between individuals. In fact, this is the source of our individuality. While the macro-structural features of our brains are shared (thalamus, hypothalamus, corpus callosum, regions of the cerebral cortex and their inter-connectedness), the fine grained structures that control our thoughts and actions are (virtually) unique. This means that in reality there is no a single configuration of brain matter that gives rise to a given set of behaviors and their corresponding conscious states, but rather a family of configurations.<br />
<br />
There is nothing preventing this family of configurations being broader than we know them to be, and a certain (as of yet unobserved) set of them having the property of giving rise to behaviors without conscious states. This might seem far-fetched, but as I can conceive of it, it must be meta-physically possible.<br />
<br />
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com1tag:blogger.com,1999:blog-7891160146933425402.post-11430110581036719702013-09-03T06:44:00.001-07:002014-11-01T04:11:53.386-07:00Using AWK for Data Science<div dir="ltr" style="text-align: left;" trbidi="on">
Over the years I have become convinced that one of the essential tools needed by anyone whose job consists of working with data is the unix scripting language AWK. It will save you an awful lot of time when it comes to processing raw text data files.<br />
<br />
For example, taking a large delimited file of some sort and pre-processing its columns to pull out just the data you want, perform basic calculations or prepare it for entry into a program that requires a specific format.<br />
<br />
AWK has saved me countless hours over the years, so now I am writing a super brief primer that should not only convince you it is worth learning but show you some examples.<br />
<br />
The first thing you need to know about AWK is that it is data driven, unlike most other languages for which execution is constrained largely by procedural layout of the instructions. AWK instructions are defined by patterns in the data to which actions should be applied. If you are familiar with the regular expression type control structures available in PERL then this should seem like a comfortable idea.<br />
<br />
The programs are also data driven in the sense that the entire program is applied to every line of the file (as long as there are patterns that match) and furthermore the program has inbuilt access to the columns of data inside the file through the $0, $1, 2 ... variables: where $0 contains the entire line and $1 upwards has the data from individuals columns. By default the columns are expected to be TAB separated, but you can follow your AWK script with FS=',' to use a comma or any other field separator.<br />
<br />
To run a simple AWK script type:<br />
<br />
<pre class='codeblock'>
>awk 'AWK_SCRIPT_HERE' FILE_TO_PROCESS
</pre>
<br />
The basic syntax of the scripts themselves consists of mutiple pattern action pairs defined like this:<br />
PATTERN {ACTION}<br />
<br />
One need not include an PATTERN, in which case the action will be applied to every line inside the file to which the program is applied.<br />
<br />
So for example, the following program will out the sum of columns 3 and 4<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">>awk '{print $3+$4}' FILENAME</span><br />
<br />
If we only wanted this to happen when column 1 contained the value 'COSTS' we have a number of options. We could simply use the pattern equivalent of an IF statement as follows:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">>awk '$1=="COSTS" {print $3+$4} FILENAME</span><br />
<br />
Alternatively we could use a PATTERN expression as follows<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">>awk '/COSTS/ {print $3+$4} FILENAME</span><br />
<br />
The problem with the second solution is that it if for some reason the word COSTS can appear in other fields or places in the file then may not get the results you are looking for. There can be a trade off for using the power and flexibility of the regular expression patterns and their ability to lull us into a false sense of security about what they are doing.<br />
<br />
There are several special execution paths that can be included in the program. In place of the pattern you can include the reserved words BEGIN or END in order to execute routine before or after the file processing occurs. This is particularly useful for doing something like calculating a MEAN, shown below:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">>awk '{sum+=$1; count+=1} END {print sum/count}' FILENAME</span><br />
<br />
By now you should be seeing the appeal of AWK. You can manipulate your data quickly with small scripts that do not require loading an enormous file into a spreadsheet, or writing a more complicated JAVA or PYTHON program.<br />
<br />
Finally here are a few of the kinds of tasks that I do with AWK all the time<br />
<br />
<b>1) </b> Convert some file with 10 or more columns into one with a sum of a few and reformating the others:<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">>awk '{print toupper($1) "," ($3/100) "," ($2+$4-$5)}' FILENAME</span><br />
<br />
<br />
<b>2) </b>Calculate the Mean and Standard Deviation on a column. (The following is fora sample, just change the n-1 to n for a complete population.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">> awk 'pass==1 {sum+=$1; n+=1} pass==2 {mean=sum/(n-1); ssd+=($1-mean)*($1-mean)} END {print sqrt(ssd/(n-1))}' pass=1 FILENAME pass=2 FILENAME</span><br />
<span style="font-family: Courier New, Courier, monospace;"><br /></span>
<br />
<b>3) </b>Calculate the Pearson correlation coefficient between a pair of columns. Again for a sample of the data. Change n-1 to n to do the calculation on the entire population data.<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">> awk 'pass==1 {sx+=$1; sy+=$2; n+=1} pass==2 {mx=sx/(n-1); my=sy/(n-1); cov+=($1-mx)*($2-my); ssdx+=($1-mx)*($1-mx); ssdy+=($2-my)*($2-my);} END {print cov / ( sqrt(ssdx) * sqrt(ssdy) ) }' pass=1 FILENAME pass=2 FILENAME</span><br />
<div>
<br /></div>
<div>
If you have any great tips for getting more out of AWK let me know, I am always looking for shortcuts.</div>
<div>
<br /></div>
<div>
<br /></div>
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com7tag:blogger.com,1999:blog-7891160146933425402.post-34455101058762271302013-06-21T00:07:00.000-07:002013-06-21T00:07:30.472-07:00Configure Chef to Install Packages from a Custom Repo<div dir="ltr" style="text-align: left;" trbidi="on">
<h3 style="text-align: left;">
Going Nuts</h3>
This was driving me completely nuts last week. I could write Chef recipes to install packages from standard repos but I could not get Chef set up so that the recipe would add a new repo and then install packages using the Chef package configuration.<br />
<br />
Truth be told I could do this in a really crude way. I could add a repo by writing a file on the node, and then run a command to install a package directly. It just wouldn't be managed by Chef.<br />
<br />
<h3 style="text-align: left;">
The wrong way</h3>
I first created a new cookbook called myinstaller. Then in the templates directory I created a file called custom.repo.erb (The Chef template format) with the following contents:<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">[custom]</span><br />
<span style="font-family: Courier New, Courier, monospace;">name=MyPackages</span><br />
<span style="font-family: Courier New, Courier, monospace;">baseurl=http://chef.vb:8088/yum/Redhat/6/x86_64</span><br />
<span style="font-family: Courier New, Courier, monospace;">enabled=1</span><br />
<span style="font-family: Courier New, Courier, monospace;">gpgcheck=0</span><br />
<div>
<br /></div>
<div>
Note that the baseurl parameter is pointing to a yum repository I have created on one of the virtual machines running on my virtual network.</div>
<br />
<br />
I then edited the recipe file <b>recipes/default.rb</b><br />
and added the following:<br />
<br />
<br />
<span style="font-family: Courier New, Courier, monospace;">template "custom.repo" do</span><br />
<span style="font-family: Courier New, Courier, monospace;"> path "/etc/yum.repos.d/custom.repo"</span><br />
<span style="font-family: Courier New, Courier, monospace;"> source "custom.repo.erb"</span><br />
<span style="font-family: Courier New, Courier, monospace;">end</span><br />
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">execute "installjdk" do</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> command "yum -y --disablerepo='*' --enablerepo='bmchef' install jdk.x86_64"</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">end</span></div>
<div>
<br /></div>
<div>
This works. But it is crude because I am not using Chef to manage the packages that are installed.</div>
<div>
<br /></div>
<h3 style="text-align: left;">
The right way</h3>
<div>
When you search around into this topic you will come across pages <a href="http://docs.opscode.com/resource_yum.html">like this one</a> that talk about adding packages with the <b>yum_package</b> command. However if you change the install section above to use this command it will not work. It seems to be related to the fact that simply adding a file to the yum repos directory on the node (while recognized on the machine itself) is not recognized by Chef.</div>
</div>
<div>
<br /></div>
<div>
I dug deeper and tried many different versions of adding that repo configuration and I eventually <a href="http://tickets.opscode.com/browse/COOK-2860">started finding references</a> to the command '<b>yum_repository</b>'. However, if you try to whack this command into your recipe it doesn't bloody work. It turns out that this is because it is not a command that is built into Chef (unlike '<b>package</b>' and '<b>yum_package</b>') it is in fact a command that comes from <a href="https://github.com/opscode-cookbooks/yum">this open source cookbook for installing yum packages</a>.</div>
<div>
<br /></div>
<div>
If you do not want to use this entire cookbook the critical files to grab are as follows</div>
<div>
<b>yum/resources/repository.rb</b></div>
<div>
<b>yum/providers/repository.rb</b></div>
<div>
<b>yum/templates/default/*</b></div>
<div>
(I took the three files from this last directory, which may not be strictly necessary).</div>
<div>
<br /></div>
<div>
Now before you can use the command there are a couple of gotchas. </div>
<div>
1) If you copy all of this to a new cookbook called myyum, then the repository command will now be '<b>myyum_repository</b>'</div>
<div>
2) You will need to edit the file <b>yum/providers/repository.rb</b> </div>
<div>
go to the bottom where the repo config is being written and change the line:</div>
<div>
cookbook "yum"</div>
<div>
So that the name of your cookbook appears there instead.</div>
<div>
<br /></div>
<div>
You will now be able to add a repository by putting the following in a recipe</div>
<div>
<br /></div>
<div>
<div>
<span style="font-family: Courier New, Courier, monospace;">myyum_repository "custom" do</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> description "MyRepo"</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> url "http://chef.vb:8088/yum/Redhat/6/x86_64"</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> repo_name "custom"</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> action :add</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">end</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"><br /></span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">yum_package "mypackage" do</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> action :install</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;"> flush_cache [:before]</span></div>
<div>
<span style="font-family: Courier New, Courier, monospace;">end</span></div>
<div>
<br /></div>
</div>
<div>
Just upload your new cookbook: </div>
<div>
<b>sudo knife cookbook upload myyum</b></div>
<div>
<br /></div>
<div>
Add the recipe to your node: </div>
<div>
<b>knife node run_list add node2.vb 'recipe[myyum::default]'</b></div>
<div>
<br /></div>
<div>
And execute: </div>
<div>
<b>knife ssh name:node2.vb -x <USER> -P <PASSWORD> "sudo chef-client"</b></div>
<div>
<br /></div>
<div>
Amazing</div>
<div>
<br /></div>
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com3tag:blogger.com,1999:blog-7891160146933425402.post-38994072955154581042013-06-18T01:01:00.001-07:002013-06-18T01:01:59.433-07:00Configuration Management with Chef<div dir="ltr" style="text-align: left;" trbidi="on">
<h2 style="text-align: left;">
<b>Configuration Management with Chef</b></h2>
Have you ever been through a long tedious process of setting up a server, messing with configuration files all over the machine trying to get some poorly documented piece of software working. Finally it works, but you have no idea what you did. This is a constant frustration for loads of people. Configuration Management (maybe) the answer.<br />
<br />
Stated simply, you write scripts that will configure the server. Need to change something, you modify the script and and rerun. Branch the script and use it in version control so that you keep a track of multiple experiments. All of the advantages of managed code are brought over to managing a server.<br />
<br />
We are currently investigating using Chef, which as brilliant as it appears to be, is sorely lacking in straightforward, complete and accurate tutorials. What I need with every new tool I use is a bare bones get up and running walk-through. I don't need to see a highly branched and complete set of instructions designed to tutor people who already know what they are doing. This blog post is my attempt at a bare bones attack.<br />
<br />
So here we go.<br />
<h3 style="text-align: left;">
Configuration</h3>
In this walk-through we are creating an entire networked Chef system using virtual machines. To do this we need to set up a local DNS server that will map names to IP addresses on the local virtual network. <<THIS PART IS ASSUMED>><br />
<br />
<h3>
DNS server config</h3>
Once you have the DNS server set up you need to make a few modifications.<br />
Set it so that it will forward unknown names to your Gateway DNS server<br />
Change the <b>named</b> configuration to forward to <<Gateway DNS>><br />
<div>
Add entries for all components of the Chef networks. The following are assumed to exist</div>
<br class="Apple-interchange-newline" /><div>
dns.vb (DNS server) 192.168.56.200</div>
<div>
chef.vb (Server) 192.168.56.199</div>
<div>
node.vb (Node) 192.168.56.101</div>
<div>
<h3>
Host Configuration</h3>
Configure your host machine<br />
<br />
1) Add dns.vb to /etc/hosts<br />
2) Disable wireless (or other connections)<br />
3) Fix the DNS server in your wired connection<br />
- In the IPV4 setting tab add the IP address of dns.vb<br />
<br /></div>
<br />
<h3 style="text-align: left;">
Virtual Machine Config</h3>
Once this is done you will need to configure each and every machine that is added to the system (Chef Server and Nodes )<br />
<br />
1) Give the machine a name<br />
A) Edit /etc/sysconfig/network file and change the HOSTNAME: field<br />
B) Run the command hostname <myhostname><br />
<br />
2) Give the machine a static IP address and set the DNS server<br />
<br />
vi /etc/sysconfig/network-scripts/ifcfg-eth1<br />
BOOTPROTO=none<br />
IPADDR=<<IP ADDRESS>><br />
NETWORK=255.255.255.0<br />
DNS1=192.168.56.200<br />
<br />
<br />
3) Add the machine's IP address and hostname combination to the DNS server<br />
A) Edit the file /etc/named/db.vb and add a line at the bottom for each hostname IP combination<br />
B) Restart the DNS server : service named restart<br />
<br />
4) Prevent the eth0 connectuon from setting the DNS<br />
vi /etc/sysconfig/network-scripts/ifcfg-eth0<br />
BOOTPROTO=dhcp<br />
PEERDNS=no<br />
<br />
<h2 style="text-align: left;">
<b>Set up a Server</b></h2>
<br />
This blog entry pretty much covers it<br />
http://www.opscode.com/blog/2013/03/11/chef-11-server-up-and-running/<br />
<br />
You basically grab the right version of Chef Server<br />
wget https://opscode-omnibus-packages.s3.amazonaws.com/el/6/x86_64/chef-server-11.0.8-1.el6.x86_64.rpm<br />
<br />
Install it<br />
sudo yum localinstall chef-server-11.0.8-1.el6.x86_64.rpm --nogpgcheck<br />
<br />
Configure and start<br />
sudo chef-server-ctl reconfigure<br />
<br />
<h2 style="text-align: left;">
<b>Set up a Workstation</b></h2>
The workstation is your host machine, where you will write recipes and from which you will deploy them to the nodes.<br />
<br />
I started following the instructions here: http://docs.opscode.com/install_workstation.html<br />
But that got confusing and inaccurate pretty quickly.<br />
<br />
In summary, what I did was:<br />
<br />
Start up a new virtuals machine (configure network settings as above), then:<br />
<b>sudo curl -L https://www.opscode.com/chef/install.sh | bash</b><br />
<br />
When that is finished check the install with<br />
<br />
<b>chef-client -v</b><br />
<div>
<br /></div>
<div>
There are three config files the workstation needs</div>
<div>
<b>knife.rb</b></div>
<div>
knife configure --initial</div>
<b>admin.pem</b><br />
scp root@chef.vb:/etc/chef-server/admin.pem ~/.chef<br />
<b>chef-validator.pem</b><br />
scp root@chef.vb:/etc/chef-server/chef-validator.pem ~/.chef<br />
<br />
<h2 style="text-align: left;">
<b>Set up a Node</b></h2>
A Node is a machine for which you will manage the configuration using Chef.<br />
Start up a new virtual machine (configure network settings as above), then:<br />
install the Chef client onto the Node using the bootstrap process.<br />
To do this run the command on the workstation:<br />
<br />
<b>knife bootstrap node1.vb -x <username> -P <password> --sudo</b><br />
<br />
Once this is done you can add recipes to the node and deploy them.<br />
<br />
<h2 style="text-align: left;">
<b>Create your first Cookbook</b></h2>
Create your first cookbook using the following command on your workstation:<br />
<br />
<b>sudo knife cookbook create mytest</b><br />
<br />
There will now be a cookbook in the following location<br />
/var/chef/cookbooks/mytest<br />
<br />
You can go in and edit the default recipe file:<br />
/var/chef/cookbooks/mytest/recipes/default.rb<br />
<br />
Add something simple, for example we will write out a file from a template.<br />
<br />
template "test.template" do<br />
path "~/test.txt"<br />
source "test.template.erb"<br />
end<br />
<br />
<br />
Then create the template file<br />
/var/chef/cookbooks/mytest/templates/default/test.template.erb<br />
<br />
add whatever text you like to the file.<br />
<br />
<h2 style="text-align: left;">
<b>Applying the Cookbook</b></h2>
<br />
First thing to do is upload the cookbook to the server<br />
<b><br /></b>
<b>sudo knife cookbook upload mytest</b><br />
<b><br /></b>
<div style="text-align: left;">
Then add the cookbook to the node</div>
<b><br /></b>
<b>knife node run_list add mynode 'recipe[mytest]'</b><br />
<b><br /></b>
Then use <b>Knife</b> to apply the cookbook using the <b>Chef-client</b> on the node<br />
<b><br /></b>
<b>knife ssh name:mynode </b><b> -x <username> -P <password> </b><b>"sudo chef-client"</b><br />
<b><br /></b>
Done!!!!<br />
<b><br /></b></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-29449609343682974662013-05-29T23:28:00.000-07:002017-10-02T21:36:21.413-07:00Machine Learning for Hackers<div dir="ltr" style="text-align: left;" trbidi="on">
<div id=":4zh" style="overflow: hidden;">
<div style="text-align: left;">
<span style="font-family: inherit; font-size: 18px; line-height: 24px;">I recently read "Machine Learning for Hackers" by Drew Conway and John Myles White. </span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;"><span style="font-size: 18px; line-height: 24px;">I'd picked it up because I heard it was a good way to get familiar
with the data mining capabilities of </span><span style="font-size: 18px; line-height: 24px;"><span style="font-size: 18px; line-height: 24px;">R</span>. I also expected the case
study based approach to be a good way to see how they approach a broad
array of machine learning problems. In these respects I was reasonably
well rewarded. You will find a bunch of R code scraps that can be reused
with a little effort. Unfortunately the explanation of what the code
does (and how) is often absent. In this sense the book is true to its
name: you will learn some recipes for tackling certain problems, but you
may not understand how the code works, let alone the technique being
applied.</span></span><br />
<div style="font-size: 18px; line-height: 24px;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-size: 18px; line-height: 24px;">
<span style="font-family: inherit;">The
one issue I found unforgivable is that in the instances where the
authors talk about machine learning theory, or use its terms, they are
often wrong. One example is the application of naive Bayes to spam
classification. The scoring function they use is the commonly used likelihood
times the prior, leaving off the evidence divisor.</span><br />
<span style="font-family: inherit;"><br /></span>
<span style="font-family: inherit;">As a method of scoring in Bayesian methods this is appropriate because it is proportional to calculating the full posterior
probability, and much more efficient to compute. However, the resulting
score is not a probability, yet the authors continuously refer to it as
one. This may seem minor, but to me it undermined my confidence in their
ability to communicate necessary details about the techniques they are
applying.</span></div>
<div style="font-size: 18px; line-height: 24px;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-size: 18px; line-height: 24px;">
<span style="font-family: inherit;">Another
example: in the section on distance metrics the authors state that
multiplying a matrix by its transpose computes “the correlation between
every pair of columns in the original matrix.” This is also wrong. What
they want to say is that it produces a matrix of scores that indicate
the correlation between the rows. It is an approximation because the
score depends on the length of the columns and whether they have been
normalised. These values would not be comparable between matrices. What
would be comparable between matrices is a correlation coefficient, but
this is not what is being computed.</span></div>
<div style="font-size: 18px; line-height: 24px;">
<span style="font-family: inherit;"><br /></span></div>
<div style="font-size: 18px; line-height: 24px;">
<span style="font-family: inherit;">I
am not suggesting that a hacker's guide to machine learning should
include a thorough theoretical treatment of the subject. I think only
that where terms and theory are introduced they should be used
correctly. By this criteria this book is a failure. However, for my
purposes (grabbing some code snippets for doing analysis with R) it was
moderately successful. My largest disappointment was that given the
mistakes I noticed regarding the topics about which I have reasonable
knowledge, I have no confidence in their explanation of those areas
where I am ignorant.</span></div>
</div>
</div>
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-71967291067395622532013-05-02T19:59:00.001-07:002013-05-02T19:59:24.065-07:00Top 8 Essential Tweaks for New Installations of Ubuntu 12.04<div dir="ltr" style="text-align: left;" trbidi="on">
<div>
<br /></div>
Having just upgraded to 12.04 there are a bunch of things that I found I needed to do to get it working how I wanted to.<br />
<br />
<h3 style="text-align: left;">
1) Install the Classic Application menu</h3>
It is beyond me why the hierarchical applications menu has been removed in this version of ubuntu. It also seems that the new left hand launcher only displays apps installed from the 'Ubuntu Software Centre.' Applications installed from Synaptic are lost and don't always seem to show up in the new Dash.<br />
<br />
So to get the classic application menu: Open a terminal ( Ctrl – Alt – T ) and add the following PPA.<br />
<br />
sudo apt-add-repository ppa:diesch/testing<br />
<br />
Then update and install the classic menu<br />
<br />
sudo apt-get update && sudo apt-get install classicmenu-indicator<br />
<br />
<h3 style="text-align: left;">
2) Install the restricted extras</h3>
Allows you to listen to mp3s and watch loads of encrypted video formats.<br />
<br />
sudo apt-get install ubuntu-restricted-extras<br />
<br />
<br />
<h3 style="text-align: left;">
3) Enable 'Show Remaining Space Left' Option in Nautilus File Browser</h3>
Again, why this is not on by default is beyond me. Extremely useful.<br />
<br />
Open Nautilus. Go to View - Statusbar. Enable it, nuff said.<br />
<br />
<br />
<h3 style="text-align: left;">
4) Calculator Lens/Scope for Ubuntu 12.04</h3>
One upside of the new Ubuntu Dash are a bunch of information rich widgets integrated into the OS. You can get info on weather, cities, films do calculations directly from the HUD.<br />
<br />
sudo add-apt-repository ppa:scopes-packagers/ppa<br />
sudo apt-get update<br />
sudo apt-get install unity-lens-utilities unity-scope-calculator<br />
sudo apt-get install unity-scope-rottentomatoes<br />
sudo apt-get install unity-scope-cities<br />
<br />
<br />
<h3 style="text-align: left;">
5) Open in Terminal Nautilus Extension</h3>
Allows you to open a terminal that is already inside the folder you are currently browsing with Nautilus. This saves me oodles of time.<br />
<br />
sudo apt-get install nautilus-open-terminal<br />
<br />
<br />
<h3 style="text-align: left;">
6) Install CPU/Memory Indicator Applet</h3>
Sweet little widget to view systems resource usage stats<br />
<br />
sudo add-apt-repository ppa:indicator-multiload/stable-daily<br />
sudo apt-get update<br />
sudo apt-get install indicator-multiload<br />
<br />
<h3 style="text-align: left;">
7) Install Spotify</h3>
Music streaming service desktop client. This info comes directly from their laboratories:<br />
<a href="https://www.spotify.com/au/download/previews/">https://www.spotify.com/au/download/previews/</a><br />
<br />
Add the spotify repo by editing /etc/apt/sources.list<br />
Add the line:<br />
deb http://repository.spotify.com stable non-free<br />
<br />
sudo apt-key adv --keyserver keyserver.ubuntu.com --recv-keys 94558F59<br />
<br />
sudo apt-get update && sudo apt-get install spotify-client<br />
<br />
<h3 style="text-align: left;">
8) Install Synergy</h3>
Synergy is an application that lets you share you mouse and keyboard across computers. More than that it also shares your clipboard, you can copy text between machines. You can't copy files for the moment but maybe if <a href="http://synergy-foss.org/download/?donate">we all donate to the cause</a> we can request that feature.<br />
<br />
You can download a Debian package here:<br />
<br />
http://synergy-foss.org/download/<br />
<br />
Then just install it with<br />
<br />
sudo dpkg -i <synergy package name here><br />
<br />
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-50540805624517529362013-04-22T20:48:00.000-07:002013-04-22T20:48:47.330-07:00Ubuntu on Toshiba Satellite P870 / 05P<div dir="ltr" style="text-align: left;" trbidi="on">
I just bought a new Toshiba Satellite P870 / 05P laptop, amazing specs, but ten minutes of playing with windows 8 convinced me that I don't even want to dual boot. I just wanted it off my machine.<br />
<br />
Unfortunately I then discovered that Ubuntu 12.04 has no device drivers for the wireless or Ethernet cards on this machine. Loads of head scratching and searching and I eventually found a link to a device driver on various Ubuntu forums that can be installed.<br />
<br />
I followed the instruction about halfway down <a href="http://askubuntu.com/questions/139632/wireless-card-realtek-rtl8723ae-bt-is-not-recognized">this page</a><br />
<br />
However I had to use <a href="http://www.liteon.com/UserFiles/driver/Module/Network/WLAN/RTL/rtl_92ce_92se_92de_8723ae_linux_mac80211_0007.0809.2012.tar.gz">this archive</a> instead of the one listed because Ubuntu 12.04 uses the 3.5 Kernel.<br />
<br />
I really should read the source myself to make sure nothing nasty has been inserted into this code, but for now I am depending on the goodwill of my fellow Ubuntu users.</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-66825170762381239702013-04-09T04:15:00.001-07:002013-04-09T04:15:17.421-07:00iOS Renewal Process<div dir="ltr" style="text-align: left;" trbidi="on">
You would think that renewing your development membership would be all that you need to do on a yearly basis to keep working as an Apple developer.<br />
It should be just: pay the fee and keep on developing. Unfortunately it is not that simple.<br />
<br />
Your certificates and provisioning profiles need to be renewed, regenerated, and installed before you can continue. As I have not found a reasonable walk-through for this process, either from Apple or on their forums, I will quickly sketch it out here:<br />
<br />
<br />
1) <b>Clear out your old Provisioning profiles in Xcode</b><br />
<br />
Open the XCode Organizer. Select "Provisioning Profiles," go through the list and delete all the expired certificates.<br />
<br />
<br />
2) <b>Remove your existing certificates in Keychain Access</b><br />
<br />
Open
the Utilities fold in your Mac's Applications. Open Keychain Access and
then select "My Certificates." You will see your expired certificates
(Dev and Dist) listed. Remove them both.<br />
<br />
<br />
3) <b>Create new certificates</b><br />
<br />
Keeping Keychain Access open, click: <br />
Keychain Access>Certificate Assistant>Request a Certifcate From a Certificate Authority.<br />
Choose "Save to Disk" and save the request file.<br />
<br />
Open the Certificates section of the iOS Provisioning Portal.<br />
<br />
Delete the existing Development Certificate.<br />
Click the "+" Symbol to create a new development certificate.<br />
Select the top option "iOS App Development"<br />
Click Continue.<br />
Upload the certificate request file you created and finish.<br />
<br />
Click "+" again to create a distribution certificate.<br />
Choose the "App Store and Ad Hoc" option and continue.<br />
Upload the certificate request file and finish.<br />
<br />
<br />
4) <b>Regenerate the Provisioning Profiles</b><br />
<br />
Click the "Provisioning Profiles" in the iOS Provisioning Portal.<br />
Go through each of your development and distribution profiles and edit them.<br />
When you edit them you will see an option to select the new certificate that you generated. Once selected the "Generate" button will become active, click to generate and download the new profiles.<br />
<br />
<br />
5) <b>Install the Certificates and Provisioning Profiles</b><br />
<b> </b> <br />
Install the downloaded certificates and provisioning profiles by dragging them into Keychain Access and XCode respectively.<br />
<br />
You can now test your development apps and distribute them to the app store just like before you renewed. Just remember to select the correct profile when you are building your app. <br />
<br />
<br />
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com2tag:blogger.com,1999:blog-7891160146933425402.post-40079461817790739932013-03-08T03:15:00.000-08:002013-03-08T03:15:50.619-08:00Configuring Apache for a Local Site on Ubuntu<h2>
Introduction </h2>
This is a simple task, setting up my local machine so that I can browse directly to a hostname such as "mysite" and Apache will find the right project. This is a great way to test that paths will work as you expect when a site goes onto the production server. You will just need a config file in the site so it knows when it is on the development server (your local machine) and when it is live.<br />
<br />
As simple as this is, I always have to look it up every time I do it.<br />
<br />
So, in the interests of improving my own efficiency and maybe helping someone else I am blogging my process.<br />
<br />
<h3>
Dependencies</h3>
Ubuntu Lucid: 10.04(Check this with: cat /etc/lsb-release ) <br />
<br />
Apache/2.2.14 (Ubuntu)<br />
(Check this with: /usr/sbin/apache2 -v )<br />
<br />
<h3>
Process</h3>
First go add the new site to yours hosts file, edit<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">sudo vi /etc/hosts</span><br />
<br />
Then change or add the line:<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">127.0.0.1 localhost mysite</span><br />
<br />
<br />Next you need to configure Apache to recognise the site. You need to create a config file for your site in the sites-available directory:<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">/etc/apache2/sites-available/mysite </span><br />
<br />
with something like the following contents:<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;"><VirtualHost *:80><br /> ServerName mysite<br /> DocumentRoot /home/username/mysite/www<br /> <Directory /home/username/mysite/www><br /> Options Indexes FollowSymLinks Includes<br /> AllowOverride All<br /> Order allow,deny<br /> Allow from all<br /> </Directory><br /> RewriteEngine On<br /> RewriteOptions inherit<br /></VirtualHost></span><br /><br />
<br />
Then, you just need to enable the site with the apache script a2ensite, like thus:<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">sudo a2ensite mysite</span><br />
<br />
Then reload apache<br />
<br />
<span style="font-family: "Courier New",Courier,monospace;">sudo /etc/init.d/apache2 reload</span><br />
<br />
...and voila!!! You can now browse directly to http://mysite<br />
<br />
<br />
<br />Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-9860070059561780942013-01-30T20:58:00.000-08:002013-01-30T21:29:45.902-08:00Maximum Likelihood Estimation<div dir="ltr" style="text-align: left;" trbidi="on">
<br />
<div style="margin-bottom: 0cm;">
Maximum Likelihood Estimation is widely applicable method for
estimating the parameters of a probabilistic model.</div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
Developed by R.A.Fisher in the 1920s
the principle behind it is that the ideal parameter settings of a
model are the ones that make the observed data most likely.</div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
It is applicable in any situation in
which the model can be specified such that the probability of the
desired variable y can be expressed as a parameterised function over
the vector of observed variables (X).</div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">P(y|X</span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">)</span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> = f(X,</span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ</span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">)</span></div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
<span class="Apple-style-span">The parameters </span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ</span><span class="Apple-style-span"> of the function f(X, </span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ</span><span class="Apple-style-span">) are what we want to estimate.</span></div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
The model is designed to be a function in which the parameters are set and we get back a probability value for a given x. However, we need a process to determine these model parameters. The Likelihood function is defined to be equal to this function, but operating as a function over the parameter space of <span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ.</span><br />
<div style="margin-bottom: 0cm;">
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<br /></div>
</div>
<div style="margin-bottom: 0cm;">
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">L(</span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ | </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">y,X</span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;"> </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">)= </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">P(y|X,</span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ</span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">)</span></div>
</div>
<div style="margin-bottom: 0cm;">
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
<br /></div>
</div>
<div style="margin-bottom: 0cm;">
<div style="margin-bottom: 0px; margin-left: 0px; margin-right: 0px; margin-top: 0px;">
It is important to recognise that Likelihood is not the probability of the parameters, it is just equal to the probability of y given the parameters. As such it is not a probability distribution over <span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ</span><span class="Apple-style-span" style="color: #6e380a; font-family: 'Courier New', Courier, monospace; font-size: 16px; line-height: 20px;">.</span></div>
</div>
</div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
If we have N observations in our data
set, and we let D represent all N of these observations of X and y, then we can express the Likelihood function for this entire data set D as :</div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
<span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">L(</span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ | </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">D</span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;"> </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">)</span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;"> = </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">∏<sup>N</sup><sub>i=1</sub> P(y<sub>i</sub>|X<sub>i</sub>,</span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ </span><span class="Apple-style-span" style="font-family: 'Courier New', Courier, monospace;">)</span></div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
<span class="Apple-style-span">Maximum Likelihood is then simply defined as Argmax </span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ</span><span class="Apple-style-span"> over this function. Finding the value of </span><span class="Apple-style-span" style="color: #6e380a; font-family: Cambria, Garamond, 'Palatino Linotype', serif; font-size: 16px; line-height: 20px;">φ</span><span class="Apple-style-span"> that maximises
this function can be done a number of ways.</span><br />
<br />
To find an analytical solution to the Likelihood equation we find the partial derivative of the function with respect to each of the paramters. We then solve this series of equations for the parameter values such the the partial derivatives are equal to zero. This gives us a position that is either a max or min. We then find the second partial derivative with respect to each parameter and make sure it is negative at the points found in the first step. This will give us an analytical peak on the Likelihood surface.</div>
<div style="margin-bottom: 0cm;">
<br /></div>
<div style="margin-bottom: 0cm;">
The reality of maximising the
Likelihood by searching the parameter space depends a great deal on
the problem. Numerous tricks occur to simplify the problem. The
natural logarithm of the Likelihood function is often taken because they are monotonically related, so the MLE can be obtained by maximising the log of the Likelihood. In addition, taking the log turns the product into a Sum and can improves the chance of finding
an analytical solution, and improve the computational tractability of finding a numerical solution.<br />
<br />
In the next post I will summarise the use of the Expectation Maximisation algorithm for situations in which the Likelihood function cannot be solved analytically.<br />
<br /></div>
<div style="margin-bottom: 0cm;">
<br /></div>
</div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-3107636303379875562012-12-05T03:33:00.000-08:002012-12-05T03:33:55.777-08:00Simplicity in the worldTo my mind one of the most puzzling aspects of science is the success of the reasoning principle known as <i>Occam's Razor</i>. This is the notion that whenever two competing theories explain the known facts equally well, then the simpler theory is generally correct.<br />
<br />
As a rule of thumb <i>Occam's Razor</i> helps us wade through the infinite number of potential theories that might be put forward to explain any given phenomenon. To demonstrate this I will use a trivial example that is not particularly deeply scientific.<br />
<br />
When trying to come up with a principle to described the observation that the sun comes over the horizon every 24 hours we could generate an infinite set of theories as follows:<br />
<br />
1) The earth rotates at a constant speed such that the sun appears on the horizon at regular intervals.<br />
<br />
Then we may add an infinite set of exceptions.<br />
<br />
2) The earth rotates at a constant speed such that the sun appears on the horizon at regular intervals. Except on Thursday the 6th December 2018, when the earth will stop rotating for 24 hours and then resume.<br />
<br />
3) The earth rotates at a constant speed such that the sun appears on the horizon at regular intervals. Except on Thursday the 6th December 2018, when the earth will stop rotating for 24 hours and then rotate backwards.<br />
<br />
4) The earth rotates at a constant speed such that the sun appears on the horizon at regular intervals. Except after when the New York Yankees have won the world series 100 times in a row, then it will slow down to half its speed.<br />
<br />
Etc, etc.<br />
<br />
Finding a set of theories that all equally explain the given evidence is easy. In this case we could decide between these theories easily through empirical means, because they make slightly different predictions, we just wait for the predicted outcomes to diverge. However, as there are in fact an infinite number of these alternate theories in practice it is not possible. Instead, we rely on the rule of thumb known as <i>Occam's Razor</i> to remove all alternative theories.<br />
<br />
It turns out that in the realm of data mining Occam's Razor turns out to be incredibly practical. If you can fit multiple models to a data set with approximately equal error, then the simplest model will more often than not produce the best predictions. This principle has been critical in the design of many modern machine learning algorithms.
<br />
<br />
Interestingly the predictive power of simplicity extends beyond this. As google research director Peter Norvig discusses in the presentation below: we are finding that the critical factor in solving many modern computer science problems is data volume. As our data sets grow in size we see that the best predictive models come not from painstakingly building custom models for certain data, but from just using an array of simple models and letting the data speak.<br />
<br />
<br />
<p><object width="500" height="284"><param name="movie" value="http://www.youtube.com/v/yvDCzhbjYWs?version=3&hl=en_US&rel=0"></param><param name="allowFullScreen" value="true"></param><param name="allowscriptaccess" value="always"></param><embed src="http://www.youtube.com/v/yvDCzhbjYWs?version=3&hl=en_US&rel=0" type="application/x-shockwave-flash" width="500" height="284" allowscriptaccess="always" allowfullscreen="true"></embed></object></p>
<p><span id="more-7296"></span></p>
<br />
Read more about this issue in the paper <a href="http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/35179.pdf">The Unreasonable Effectiveness of Data.</a> Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0tag:blogger.com,1999:blog-7891160146933425402.post-70384457369206750322012-11-10T20:09:00.002-08:002012-12-12T02:46:39.212-08:00Python Please<div dir="ltr" style="text-align: left;" trbidi="on">
I have friends who swear by Python.
<br />
<br />
They rarely program with any other language, and I understand in theory the appeal. You have a language that forces you to write nicely formatted code just to make it work. You do away with redundant structure imposing syntax like braces and semicolons. No longer do you have to waste time formatting someone else's crappy code before you can work with it.<br />
<br />
What I can't understand is how they manage to live with its horrendous approach to string processing. Python is a late bound weakly typed interpreted language with an approach to string processing that belongs in C or Assembly language. At that, all the purists are going to cry:<br />
<br />
"Just because you don't understand encodings!!!"
<br />
<br />
Sigh. Yes. Every time I have to work with Python I go and reread Joel's fanatstic article: <a href="http://www.joelonsoftware.com/articles/Unicode.html">The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)</a>, just to make sure I haven't missed something. Every time, I confirm that I am not a complete idiot, and I come back to wrestle with python and try and work out where in the process of passing a string around I went wrong.
<br />
<br />
The problem is partly that I use Python predominantly to scrape webpages. This means that I am always loading up badly formatted text with incomplete or missing meta-data. So to be fair maybe programmers who do not engage in this process never see the problems I see. But it is not only me, look at <a href="http://stackoverflow.com/questions/6616476/python-problem-with-accented-chars-when-scraping-data-from-website">this thread on Stackoverflow</a> to see how ridiculous the situation is.<br />
<br />
I want to propose something to Python enthusiasts. Just say you are right, and the problems are entirely mine (real python programmers love having to monitor string encodings continuously). Ok sure, then:
<br />
<br />
Why not have a mode for the language that will just force all strings to be a single encoding, say UTF-8 ?
<br />
<br />
The chorus will yell back, we do. You just do: X-Y-Z
<br />
<br />
Well people I have tried all of those X-Y-Zs and they do not work. Perhaps again it something to do with my approach. I use a bunch of libraries to process the data, maybe urllib library, or beautiful soup which I use to parse things. I don't know, I am not an expert, and I shouldn't need to be just to parse strings reliably.
<br />
<br />
I don't understand why it just doesn't work. I have never wasted so much time dealing with string coding problems with any other language than I have with Python.
<br />
<br />
It should not be so hard. It really shouldn't.
<br />
<br />
<br /></div>
Anonymoushttp://www.blogger.com/profile/00154374715549563752noreply@blogger.com0