# Blog

## Farewell to astronomy

Recently I took the decision to shift my life onto a new track and accepted a job offer outside of astrophysics. Given that leaving academia has long been a frequent source of conversation among the postdocs and PhD students I interact with on a daily basis I thought that --- inspired by Marcel Haas' article titled Leaving the field, becoming an extronomer --- I'd write a few words about why I'm leaving, and where I'm going.

Jessica Kirkpatrick has written a lot about the difference between academia and industry (Astronomer to data scientist, Astronomy vs. data science) and I'd encourage anybody who is interested to read those articles. They reflect my feelings about academia more coherently than I can myself. Therefore, I'm only going to cover my personal circumstances and say that as much as I love doing science for a living, I do feel that I'll be just as fulfilled elsewhere.

### Where have I been?

I have been incredibly fortunate throughout my career. During my PhD at Durham University and post-doctoral positions at Leiden and Chicago I have been surrounded by wonderful and caring people who, over the course of each contract, have become close friends. Since moving to Chicago I have married a wonderful woman, we are lucky enough to live close to her family (this is by no means to be expected for those of us on short-term academic contracts!), and I have found a group of friends who make me incredibly happy to be around.

The process of moving institutions every few years is both an up- and down- side of early-career academia. It's great to meet new people, and to experience new countries. At the same time, it is exhausting. Each time I move, the process of meeting a whole new circle of friends and stocking another apartment with Ikea furniture feels less like fun and more like a horrible chore, especially with the knowledge that my time at the new institution has an expiration date just a few years in the future.

If I try to project my career forwards and objectively place myself next to others who would be competing with me for jobs, I do feel like I would likely be able to find some sort of faculty position. However, it is unlikely that I'd be able to land something at a prestigious research institution, or in a city that I'd love as much as I do Chicago. I'd probably either end up stuck in a holding-cycle where I take more postdoctoral positions in new cities until finally settling, or I would accept a faculty position in a city and department where the things I have grown to love about both the big-city and big-department would be largely unavailable to my family and me. Either way, continuing an astronomy career would, for me, involve more years of uncertainty and big life changes with no obvious path to resolution.

It is a decision that is personal to everybody, but, for me, when I weigh my academic career against my personal happiness I know it's time for me to make some decisions that prioritize my family and my non-academic life above chasing a tenure track position.

### Where am I going?

A while ago I filled out a LinkedIn profile with all of the computer and data-analysis experience that I gained doing astronomy. As a result, I get the occasional call or email from a recruiter asking if I'd be interested in opportunity X or job Y. Up until now I have always politely brushed these offers off because nothing appealed to me more than spending more time doing science, even though I already knew that in the long-term academia would not be for me.

However, a few weeks ago I received one of these emails from a recruiter on behalf of a company called Narrative Science, with the description that Narrative Science was born out of Northwestern University and "Our artificial intelligence platform automatically analyzes data and produces narratives that are contextually relevant, actionable and tailored to any audience." Basically, they turn tables full of data into stories written in the English language. Narrative Science made a bit of a splash a few years ago with software that takes the play-by-play from a baseball game and turns it into an article in English. Here are various articles about that (written by humans):

They have since widened scope to try and write stories in a huge number of subject domains, and the problems they're trying to solve in changing 'data' into 'stories' sounded both challenging and incredibly interesting. I agreed to an interview.

During this interview, in addition to being subjected to a few hours of questioning (including my first ever attempts at writing code on a white board!), I discovered a lively atmosphere full of fun people working hard on really interesting problems. Furthermore their data team are creating something cool that I think I'll be able to contribute to in a very significant way.

Happilly, they liked me too and I'm really excited to say that I'll be joining the team at Narrative Science on October 30th

## The height and weight of every active football player

While brushing up my Python skills the other week, I decided to figure out web scraping and to do a little data analysis project. Specifically, I decided to look at the height and weight of every active NFL player, with the data that is found in tabular form on nfl.com. For anybody who wants to play with this data for themselves, here is a link to an iPython notebook with the source showing how I made the dataset. This script also requires the file teams.txt, which is up on github via the link. Then, given this database, it is easy to extract the heights and weights of each player and see how they all stack up against one another. A lot of the questions about the graphs were asking about the outlying points, so I'll quickly summarize them here:

• The shortest player is Trindon Holliday, who is 5 feet 5 inches tall
• The 6'8" quarterback is Brock Osweiler

You can click here for a metric version of the chart, translated by the Polish American Football team Angels Toruń! The average weight and height of every position is as follows:

Safeties 6’0″ 207.6
Linebackers 6’2″ 246.3
Defensive Tackles 6’3″ 309.8
Defensive Ends 6’4″ 283.1
Cornerbacks 5’11″ 193.4
Centers 6’3″ 306.2
Tight Ends 6’4″ 254.7
Running Backs 5’11″ 215.3
Guards 6’4″ 314.5
Quarterbacks 6’3″ 223.8
Offensive Tackles 6’5″ 313.5

The same plot for punters and kickers is here. I moved punters and kickers into a separate graph just because the offense and defense are already sufficiently crowded that it's not entirely easy to read.

After posting these visualizations to Reddit they got picked up by the websites of a whole lot of places:

When I add up the number of people who have seen and/or shared these graphs, it's obvious that they have been viewed well over a quarter of a million times (125,000 from a link on imgur.com alone). This is more people than have ever seen the astrophysics that I've worked on for the last decade. I'm not sure how to feel about that.

## Python new-style string formatting reference

I am forever forgetting how to format strings in Python and have probably Googled questions related to Python's format specification mini-language over a thousand times now. For my own reference I put together a quick-reference for the commands I type most frequently. Maybe it will be useful to other chronic Googlers of the same thing, so feel free to use this for anything that you like.

The new-style Python string formatting quick reference is also available as a PDF Ahere

## Crowd Sourcing NFL Predictions #5: Week 3 & 4 Results

Busy couple of weeks, so I got a bit behind with producing these graphs. I'm still very busy so I'll present the graphs here and defer any detailed discussion until later in the season.

### Week 3

This week the games are

• Chargers @ Titans
• Lions @ Redskins
• Giants @ Panthers
• Falcons @ Dolphins
• Bills @ Jets
• Bears @ Steelers

The predictions from the Guardian's Pick Six competition look like this:

Not too bad, 4/6 predicted correctly. And here is how the crowd prediction stacks up against the experts of ESPN, CBS and Yahoo:

Again, the average crowd predicion seems to be holding its own against the experts quite admirably, finishing well above the average expert score this week.

### Week 4

Here are this week's games

• Steelers @ Vikings
• Cardinals @ Buccaneers
• Bears @ Lions
• Seahawks @ Texans
• Cowboys @ Chargers
• Patriots @ Falcons

Along with the Guardian Pick Six aggregate predictions

The Vikings Steelers game was predicted incredibly close, 106 votes to 105 in favour of the Steelers, although you can't see that because the bar is too small. The crowd only called 2/6 of the games correctly this month, and the experts?

A much wider spread, with many of them outperforming the crowd on this week. Weird that the distribution seems so bimodal this week, it would be interesting to check if the experts who predicted 1 correct and 3 correct are clustering around picking AFC and NFC teams or something similar. I have no idea if there was something significant about the games chosen that made this occur, or if it's all in the noise.

### How Good is the Crowd?

I showed this graph with the first few points a couple of weeks ago but have now extended it to include the data through week 4.

There definitely appears to be a positive correlation between the strength with which the crowd leans one way and the score differential, but I still don't want to try and fit anything to it until there is more data. More pressing than the correlation, though, is matplotlib's horrible default color set. I promise to do something about that before presenting further results.

## Crowd Sourcing NFL Predictions #4: Experts vs. The Crowd

I decided to see how, on a person by person basis, the predictions of the experts compared to those made by the general public. On the Guardian Pick Six page, the organizer Paolo Bandini with help from Hamza Mohamed and Andrew Deering posted a comment showing exactly how many games each person predicted correctly. It was easy to turn this comment into a histogram and compare it to the results given by the experts.

Here, the blue bars show the results from the competition entries and the red shows the predictions of the experts. The difference is really striking! We can quantify this by asking what is the probability that the expert scores come from the same parent distribution as the crowd scores with the $\chi^2$ statistic. The answer is that the probability of them coming from the same distribution is vanishingly small.

At least for this first week of results, it appears that the experts are doing worse than the average person. My guess is that because they're being put in public, the experts have a desire to make wild and unlikely predictions so they can grab at the glory if they should somehow come true, although I would be interested to hear anybody else's explanations.

## Crowd Sourcing NFL Predictions #3: Week 2 Results

This is one in my series of posts about comparing NFL predictions made by experts to those sourced from the crowd. In the last post I showed results from six games in week one of the NFL season. In this post, I extend these results to include the second weeks of games. The games that were chosen for the Pick Six competition on the Guardian this week were:

• Miami Dolphins @ Indianapolis Colts
• Carolina Panthers @ Buffalo Bills
• Dallas Cowboys @ Kansas City Chiefs
• Detroit Lions @ Arizona Cardinals
• Denver Broncos @ New York Giants
• San Fransisco 49ers @ Seattle Seahawks

To see how the data was collected, see this post. Let's look at what we find. First, how did the crowdsourced predictions fare? I show here the percentage of predictions that go to each team. Green colored bars show correct predictions, red ones show errors. The crowds didn't do too well

There were only two games for which over 50% of the predictions were correct. A complete turnaround from last week, where the 'average guess' for 5/6 of the games was correct. We don't yet have enough statistics to do anything really interesting, but I'm looking forward to seeing where this goes. Let us now compare the crowd sourced predictions to those made by a sample of 23 experts who are posting predictions publically on ESPN, CBS and Yahoo.

The experts also has a rough week, with the majority of them calling only two out of the six games correctly, although the average expert called 3.08 games correctly.

So, how does the crowd sourced prediction compare to the actual results of the game. The following scatterplot shows the actual score in each game as a function of the percentage of predictions for an away win in each game

The point color shows what week the game is from. A correlation is beginning to appear, but at this point there are only 12 points on the graph and the correlation is largely driven by the red point to the top right, which is the Denver Broncos 41-23 victory over the New York Giants in Week 2.

As we only have two weeks data (12 games) so far, I'm not going to start trying to show how accurate each of the sources of data have turned out to be. That'll have to wait a couple of weeks. For now, I'm happy that this is looking to be an interesting dataset.

## The Naive Bayes Classifier

I've needed to derive and use the naive Bayes classifier a few times over the past years, so for my own reference I'm writing down here all of the steps in the derivation of the classifier along with a couple of example use cases so that I can refresh my memory as needed.

### The Problem

The problem we are solving is as follows: Imagine a training dataset described by $n$ features, $F_1, F_2, ... F_n$ . This training set is classified by hand into one of the classes $C$ . The classifier needs to, given a new set of $n$ features, predict which of the classes $C$ the new data corresponds to. That is, we want to calculate $P(C|F_1\cap F_2\cap ... \cap F_n)$ .

### The Naive Bayesian Probabalistic Model

In the case where there are millions of features, or many values that each feature can take, looking up $P(C|F_1\cap F_2\cap ... \cap F_n)$ in probability tables is not feasible, so we turn to Bayes rule:

Only the numerator of the fraction is interesting for classification, the denominator is just a constant number that depends on the particular set of features chosen, so:

We can make the numerator that remains more tractable by noting that we can break down a bunch of joint probabilities through repeated application of the definition of conditional probability. Concretely, for the case of four variables, $A_1$ through $A_4$ :

More generally, for the case of $n$ variables:

We use this to write for our problem:

Now we need to make the assumption that each of the features $F_i$ is conditionally independent of all of the other features. That is, $P(F_i|C\cap F_j)=P(F_i|C)$ and $P(F_i|C\cap F_j\cap F_l)=P(F_i|C)$ and so on. This removes all of the conditions above and we can simplify the previous equation to

### The Binary Classification Problem

So far we have derived the naive Bayes probability model, we need to add a decision rule to make a classifier. One very common choice is just to pick the most probable class (the maximum a posteriori or MAP decision rule). In general, we do not expect the features to be independent, but the naive Bayesian model is still of a surprising amount of use. In general we don't expect the values for the probabilities of class membership to be accurate if the features are dependent, but as long as they're in the correct order the MAP decision rule will still return the correct.

Imagine now a situation where, for a given set of $n$ features $F_1, F_2, F_3,...,F_n$ we wish to decide whether a particular entry belongs to a class ( $C$ ) or does not ( $\bar{C}$ ). We begin by defining the likelihood ratio:

Here, $P(C)$ and $P(\bar{C})$ represent our knowledge of the prior probabilities. This prior probability distribution might be based on our knowledge of frequencies in the larger population, or on frequency in the training set. It is also important to note that the product is carried out over all features, including those that are negative in the sample we're considering.

If the likelihood ratio is greater than 1, the naive Bayesian classifier predicts that a given entry belongs to class $C$ . In practice, there are frequently many thousands of features and the probabilities of any given feature being present may be very small, so we work in terms of the log likelihood

### Worked Example: Sentiment Analysis

Imagine that you have the following set of messages about a mobile phone

We wish to be able to automatically classify new messages that we receive either as having a positive or negative sentiment. From the overall population, $P(+)=5000/10000=0.5$ and $P(-)=5000/10000=0.5$ . We can now calculate the conditional probabilities for each word being in a positive and negative message.

Note here that we have employed 'smoothing' or 'pseudocounting'. Where a term doesn't appear in a given class (e.g. the word hate appears in no positive messages) we don't want to return a zero as it sends the likelihood calculation haywire. Instead we make it the smallest possible number. Now consider trying to automatically classify the following message

The prefactors outside the fromt of the likelihood ratio cancel out ( $P(+)=P(-)$ ), and multiplying out the remaining products gives

I don't know which way this works out, but it highlights one big weakness with naive Bayes. That is that we have assumed the message is a bag of words, where their relative positions are unimportant. For example, the presence of no in front of pleasure makes this a negative statement that naive Bayes interprets incorrectly.

## Crowd Sourcing NFL Predictions #2: Week 1 Results

### Introduction

This is the second in my series of posts on comparing the accuracy of crowdsourced NFL game predictions from the Guardian's Pick Six competition against the predictions made by experts. This being week 1, I don't have too much data to work with, but here are a couple of graphs.

First, how did the crowdsourced predictions fare? In short, well. Of the six games, the masses selected the winner on 5/6 occasions:

Here, I show for each of the six games, the percentage of the votes that went to each team (the zero in the middle means that the vote was split exactly 50/50). Team names in bold represent the winners, and the bars in green show cases where the aggregate prediction went the same way as the result. The predictions match the actual results with the exception of the one that people felt most strongly about, the Eagles surprise win over the Redskins.

Now, what about the experts? Again, this being the first week we don't have much data, but the following graph shows how each of the 23 experts performed:

The horizontal, red line in this graph shows the crowdsourced result. Taking an average across the experts reveals an average score of 2.95, although I have not yet aggregated together the mean expert prediction to compare that to the crowdsourced result. It is clear so far, at least for this tiny dataset, the crowdsourced results are as good as the very best experts and better than 90% of them.

Oh, and the expert who only got one of six correct was Josh Katzowitz, who apparently makes a living with bold predictions including today's article about how the (2-14 last year) Chiefs are going to surprise the (8-8 last year) Cowboys in week 2

## Crowd Sourcing NFL Predictions #1: Data Collection

### Introduction

"The Wisdom of Crowds" is a concept that, since its publication in 2004, has become part of the popular imagination. In short, the wisdom of crowds is the idea that if you aggregate together a large crowd of people, the average decisions they reach in response to difficult questions are likely to be better than those given by any one individual. This comes about because each member of the crowd brings their own viewpoint and set of information to the question, and the crowd as a whole has a more complete and diverse set of information about the problem at hand than does any of its members. The quintessential example of this is the tale of a crowd of people being asked to guess how many jellybeans are in a large jar, or guess the weight of a cow. For large crowds, the average guess is usually closer to the correct answer than any one person's estimate. Indeed, I even remember doing this experiment at school as a child in school assembly and seeing it work first hand.

The start of the football season got me thinking about this, especially given the huge number of football predictions that are available on the internet. Of course, not all crowds are 'wise'. Investors in a stock market bubble, for example, all make the same errors. But it did strike me that crowdsourced NFL predictions probably do meet all of the criteria necessary for a crowd to make very good judgments (cribbed from the Wikipedia page on The Wisdom of Crowds:

• Diversity of opinion (i.e. Each person should have private information even if it's just an eccentric interpretation of the known facts.)
• Independence (i.e. people's opinions are not fixed by those of their neighbors)
• Decentralization (i.e. people can draw on local knowledge)
• Aggregation (i.e. Some mechanism exists for turning private judgments into a collective decision.)

This certainly sounds a lot like football predictions on the Internet. People from the world over have different sets of information and prejudices about their teams, they're sitting alone in front of a computer, and we can easily pull together a bunch of predictions to provide an aggregate prediction for the outcome of games.

I decided to run a little experiment. For the 2013 NFL season, I will compare the accuracy of a crowd of people predicting NFL game results (The Guardian's Pick Six competition) to the set of experts on ESPN, Fox and the like. I describe below where each dataset comes from, then in future posts I'll show the results.

### Crowd Sourced Data

This data comes from The Guardian's NFL Pick Six competition. This is a bit of fun on the website of the Guardian newspaper where six games are selected each week, and the readers are invited to make their predictions for the winner of each one. I like using this dataset because it immediately cuts out the more interesting matchups each week. The organizer of Pick Six selects the games where the outcome is less than obvious.

Getting the data from the website is a non-trivial process. People submit their predictions as comments on the article so we need to do a significant amount of cleaning to retreive the data. Here are a few examples of prediction posts from week 1:

This is not a homogeneous dataset. Some people list team names, others list cities, some leave comments or jokes, and we need to worry about typos. This is a job that a human can deal with very easily. Indeed, I tallied up the first 100 (of 700 total) predictions by hand, but it's a time consuming job that I do not have time to do every week. Here are the results for the first 100 or so predictions:

We need, therefore, to find an automated way of getting the predictions from the posts. I extracted an approximate set of predictions automatically with the following steps:

• Take a copy of the complete text of all of the comments from the webpage
• Extract and count up the number of lines that match the following criteria
• Line begins with the team name (case insensitive), AND
• Line does not include the character '@'

This is all encapsulated in this simple command line

cat week1.dat | grep -i ^teamname | grep -v '@' | wc -l

The reason for removing the @ symbol is that a number of people write predictions in the form Bengals@Bears, and use boldface text to highlight the winner. Without removing the lines with @ symbols, this simple method assigns all predictions formatted in this way to the team listed first (the away team). This is obviously simplified method that makes some errors. For example, it misses anybody who uses the city as a proxy for teamname, or makes typos, or writes variations on teamnames (e.g. a number of people purposely use 'Washington' or a euphemism in place of 'Redskins' because of the controversy surrounding this name). Nevertheless, over 85% of posts are of a form (team names, one per line) that can easily be handled by this script.

To check how accurate the results from this simple data scraper are, I compared its results with the ones that I obtained doing everything by hand. In the following graph, I show the percentage of people who predict a victory for the away team for the exact (i.e. where I read every post and tally the correct result) and the approximate (i.e. I ran the command line above to extract the number of predictions) methods

The two always agree to within a percentage point or two, So we now have an automated method of extracting how many Guardian Pick Six predictions favor each team. Throughout the rest of this series I will be using this method on every prediction posted before the kickoff of the relevant games.

### Data from "Experts"

A number of news outlets publish NFL predictions. I'll be taking the picks from the following experts: First, ESPN provides picks from 13 NFL experts at this website in tabular form:

CBS provides picks from another 8 experts at this website in tabular form:

Finally, Yahoo provides picks from 2 experts at this website in tabular form:

This gives us a total sample of 23 NFL experts, and I'll also aggregate the experts together into a single 'expert prediction'. Results to be posted soon after the Monday night football finishes tonight.

## Analysis of 250,000 geotagged tweets from Chicago

The City of Chicago provides a rather comprehensive data portal, full of datasets related to the city. I wanted to take a look at these as an excuse to learn some R, a language I don't have too much experience with as yet. I decided that on top of the official city dataset, I would add one of my own and as such wrote (in Python) a robot that watches the Twitter public stream for geotagged tweets from the Chicago area. Whenever it finds one it stores the following in an SQL database:

• The tweet text, username and date
• The GPS coordinates from which the tweet was sent
• Which of Chicago's 77 community areas the tweet was sent from
• Whether or not the tweet has a positive sentiment

The neighborhood that a tweet comes from is calculated by comparing the GPS coordinates to the shapefiles that define the edges of Chicago's neighborhoods, as supplied by the City of Chicago. The final quantity, sentiment, is calculated in a simple way. I just compare every word in the tweet against a wordlist that contains for 2,500 common words an estimate of how positive or negative each one is (full disclosure: I got this list as part of an assignment while taking Coursera's Data Science course). A few examples of words in here:

abhors -3
abilities 2
attraction 2
awesome 4
awful -3
awkward -2

The happiness of each tweet is then simply the happiness of each of its constituent words. This is obviously not a well calibrated method, but it should allow us to compare different populations of tweets.

I ran this bot for a full week in May of 2013 and ended up collecting just over a quarter of a million tweets to analyse. Here are a few examples of happy tweets (anonymized):

Had fun great time with my boyfriend... Time for bed.. I hope every had a great night because I did.. Good Night..

Good morning lake. Good morning runners. Good morning coffee shop. Good morning hipsters. Good morning Chicago.

@anonymous I love you and admire you. Thanks for making Louis happy. Have a lovely day. #HappyBirthday

I don't really feel comfortable sharing people's most upset tweets, even if they are anonymized, as there is some surprisingly personal material in there (along with a LOT of swearing). We can then put all of this together to see which of Chicago's neighborhoods have the happiest tweets. I don't think that the results will be much of a surprise to anybody who lives here:

The well-off North of the city seems to make systematically happier tweets than the neighborhoods to the south. Additionally, you can't see this from the map, but 100% of the tweets sent from on the water were happy, which I guess is a good argument for getting a boat. Anyway, this map got me interested in asking how the economic status of a neighborhood affected its conversation, and I show this here:

Here, each point is a different Chicago neighborhood. The size of each point is proportional to the number of tweets that were sent, the color of each point is proportional to the number of violent crimes per person (purple is low, yellow is high), and the axes show the average happiness of the tweets as a function of the local unemployment rate.

The correlation between average mood and unemployment rate is strikingly strong, with the exception of one neighborhood to the top-right of the graph. This is Garfield Park, and on further investigation I found that it was one single person who tweeted "Good Night, @anonymous, My Love" to over 200 people, accounting for a good fraction of the neighborhood's total tweets.

This is a pretty cool dataset, and this is only a very preliminary analysis, but with many thousands of words per neighborhood I plan to look at how word usage clusters around neighborhoods of different types at som epoint in the future.