Why are London Crime Rates rising?

A data-driven investigation into the factors associated with higher crime rates in London boroughs.

13 min readJul 6, 2020

This past year in 2019 has been labeled London’s bloodiest year with the city experiencing a wave of knife crime and a resurgence in crime rates across all of its boroughs.

This project aims to study how crime rates across London boroughs have been affected by police deployment numbers, as well as broader demographic and socio-economic factors.

Data Preparation and Processing

Initially, I sourced for various datasets from publicly available sources online. Crime data was taken from the London open-source Government data website. Socioeconomic factors were obtained from census demographic data. Policing data was taken from publicly available Metropolitan police data on force strength by London borough. The table below is a breakdown of datasets that I ultimately used to construct the variables that were included in my regression analyses.

Next, I classified the 33 different types of crime committed across London boroughs into 3 levels of severity: Low-intensity Crime, Moderate-intensity Crime, and High-intensity Crime. This classification was done based on the threat to life, extent of physical injury caused and extent of monetary value lost. I summed up crime numbers in each category based on this classification and obtained the variables y1 (Low-intensity), y2 (Moderate-intensity), and y3 (High-intensity) for each borough in each year.

Data Visualizations using Tableau

Firstly, I conducted data visualizations on Tableau to get an overview of the spread of crime data across different borough regions in London. The first plot shows a general spread of the boroughs. In total, there are 32 boroughs, each indicated by a different colour.

Following that, I decided to only visualize plots of moderate-intensity crime (y2) and high-intensity crime (y3) as there was no significant change in distribution or frequency of low-intensity crime (y1) between 2008 and 2018.

Heat map: Moderate-intensity Crime

Looking at the heat maps of the moderate-intensity crime across boroughs (Diagrams 2 and 3), I observed that boroughs like Brent and Westminster have experienced significant increases in the incidence of crime between 2008 and 2018. Within these 10 years, boroughs like Einfield, Haringey and Tower Hamlets have also experienced such significant increases in crime rates that they have displaced other boroughs to move into the top 10 rankings over time.

Top 10 London Boroughs in terms of the annual incidence of moderate-level crime (2008 vs 2018):

Heat map: High-intensity Crime

The heat maps for high-intensity crime across boroughs show a general decline in incidence of crime across the top 10 crime-prevalent boroughs in London. However, what was worrying to see was the sharp increase in high-intensity crimes in boroughs like Brent. As a result of these findings, I am intrigued to understand this data in more depth through the use of other data analysis methods to further investigate these particular trends.

Multi-linear Regression

For the latest year of my dataset (2018), I ran several multi-linear regressions, regressing each yi against all the dependent variables. The independent variables consisted of: total crime, low-severity crime as y1, moderate crime as y2, severe crime as y3) while the dependent variables consisted of: number of people under minimum wage, density, poor dwellings, medium dwellings, best dwellings, diversity proportion, police deployment numbers, white employment rates and ethnic employment rates.The relationships between the identified factors and crime rates are likely to differ significantly for varying severities of crime because offences are materially very different. Hence, I decided to run separate models for low, moderate and high-intensity crimes for more precise relationships.

For the subsequent results of the regression analyses, I looked at the p-values of each regressor to determine their statistical significance, as well as the R² values to understand the explanatory power of our regressors. The R² value measures the total variance in the crime rate that is explained by all the regressors, and increases when more regressors are added. The adjusted R² balances this by penalizing the addition of unnecessary variables — it decreases if the new term improves the model less than would be expected by chance. A large difference between multiple and adjusted R² would indicate overfitting.

2018 Low-intensity Crime

lm_low = lm(y1 ~ minwage + density + dwell_poor + dwell_mid + dwell_best + diversity + police + emp_white + emp_ethnic, data=Crime_Data_2018)
summary(lm_low)## Call:
## lm(formula = y1 ~ minwage + density + dwell_poor + dwell_mid + 
##     dwell_best + diversity + police + emp_white + emp_ethnic, 
##     data = Crime_Data_2018)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -5450.8  -803.5    67.6  1021.6  3224.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.999e+04  2.353e+04   1.275  0.21574    
## minwage      2.102e-01  5.633e-02   3.732  0.00116 ** 
## density     -3.061e+01  3.537e+01  -0.865  0.39616    
## dwell_poor  -2.886e+04  1.988e+04  -1.452  0.16074    
## dwell_mid   -4.152e+04  2.011e+04  -2.064  0.05098 .  
## dwell_best  -2.782e+04  2.539e+04  -1.096  0.28505    
## diversity    2.823e+03  3.561e+03   0.793  0.43627    
## police       2.845e+01  4.655e+00   6.110 3.77e-06 ***
## emp_white    9.781e+01  9.231e+01   1.060  0.30084    
## emp_ethnic  -8.326e+01  6.043e+01  -1.378  0.18213    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1954 on 22 degrees of freedom
## Multiple R-squared:  0.945,  Adjusted R-squared:  0.9225 
## F-statistic:    42 on 9 and 22 DF,  p-value: 8.392e-12

Collectively, the factors chosen capture most of the variability in crime rate as seen from the high R² values. As the discrepancy between multiple and adjusted R² was minimal, I concluded that there was minimal multicollinearity and the fit is appropriate. Significant variables were defined as those with p<0.1, with a lower p-value indicating stronger explanatory power. The variables with p<0.1 were minwage, dwell_mid, and police.

2018 Moderate-intensity Crime

lm_med = lm(y2 ~ minwage + density + dwell_poor + dwell_mid + dwell_best + diversity + police + emp_white + emp_ethnic, data=Crime_Data_2018)
summary(lm_med)## Call:
## lm(formula = y2 ~ minwage + density + dwell_poor + dwell_mid + 
##     dwell_best + diversity + police + emp_white + emp_ethnic, 
##     data = Crime_Data_2018)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2131.16  -776.17   -96.02   631.83  3112.75 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  2.573e+04  1.592e+04   1.616   0.1203  
## minwage      9.542e-02  3.812e-02   2.503   0.0202 *
## density      2.621e+00  2.393e+01   0.109   0.9138  
## dwell_poor  -1.149e+04  1.345e+04  -0.854   0.4025  
## dwell_mid   -1.855e+04  1.361e+04  -1.363   0.1867  
## dwell_best  -2.778e+04  1.718e+04  -1.617   0.1202  
## diversity    3.560e+03  2.410e+03   1.478   0.1537  
## police       3.896e+00  3.150e+00   1.237   0.2292  
## emp_white   -7.156e+01  6.247e+01  -1.146   0.2643  
## emp_ethnic  -1.140e+01  4.089e+01  -0.279   0.7829  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1322 on 22 degrees of freedom
## Multiple R-squared:  0.7603, Adjusted R-squared:  0.6623 
## F-statistic: 7.755 on 9 and 22 DF,  p-value: 4.544e-05

The results of this regression showed that R² was notably lower, indicating the possible presence of other confounders. Furthermore, the discrepancy between multiple and adjusted R² was notable, suggesting that the model may be overfitted, perhaps due to multicollinearity between the 11 variables. Variables with p<0.1 include only minwage, while loosening the criterion to p<0.2 will include dwell_mid, dwell_best, and diversity.

2018 High-intensity Crime

lm_high = lm(y3 ~ minwage + density + dwell_poor + dwell_mid + dwell_best + diversity + police + emp_white + emp_ethnic, data=Crime_Data_2018)
summary(lm_high)## Call:
## lm(formula = y3 ~ minwage + density + dwell_poor + dwell_mid + 
##     dwell_best + diversity + police + emp_white + emp_ethnic, 
##     data = Crime_Data_2018)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1818.0  -653.8  -157.1   460.3  3654.1 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)  
## (Intercept)  4.148e+03  1.517e+04   0.274   0.7870  
## minwage      9.850e-02  3.631e-02   2.713   0.0127 *
## density      3.437e+01  2.280e+01   1.508   0.1459  
## dwell_poor   5.898e+03  1.282e+04   0.460   0.6498  
## dwell_mid   -8.014e+02  1.297e+04  -0.062   0.9513  
## dwell_best  -5.573e+03  1.636e+04  -0.341   0.7367  
## diversity    4.285e+03  2.295e+03   1.867   0.0753 .
## police      -2.781e+00  3.001e+00  -0.927   0.3641  
## emp_white   -5.795e+01  5.950e+01  -0.974   0.3407  
## emp_ethnic   1.259e+01  3.895e+01   0.323   0.7496  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1259 on 22 degrees of freedom
## Multiple R-squared:  0.6498, Adjusted R-squared:  0.5065 
## F-statistic: 4.535 on 9 and 22 DF,  p-value: 0.001822

The results of this regression showed that overfitting due to possible multicollinearity has led to a significantly smaller adjusted R². Variables with p<0.1 included minwage and diversity. Across all severities of crime, the common variables with low p-values were minwage, diversity and dwell_mid.

Furthermore, to account for the discrepancy between multiple and adjusted R², I checked for possible multicollinearity. Multicollinearity increases the standard errors of the coefficients, and risks wrongly rendering some variables statistically insignificant. In order to investigate potential multicollinearity between our variables, I also produced a matrix plot to visualize correlations between the regressors. Strong correlations between regressors are indicative of multicollinearity and my matrix plot does indeed provide some evidence of this.

As seen from the plot above, there is a strong positive correlation for some variable pairs such as minwage and police deployment (police) of 0.47, or poor dwellings (dwell_poor) and police deployment (police) of 0.33 which corroborates with other contextual evidence that more police resources are deployed in poorer areas. Multicollinearity can also be further addressed via Instrumental Variable (IV) methods, Partial Least Squares Regression (PLS), or Principal Components Analysis.

Decision Trees

The previous regression methodology has shown statistical errors due to data weaknesses such as multicollinearity. Hence, using alternative methods to select variables will provide a more robust comparison. As a result, I subsequently constructed a regression tree. As a tree is not built on any assumptions on relationships between features but instead, splitting on single features to create terminal nodes, this circumvents the multicollinearity weakness. There is a strong risk of overfitting when only one decision tree is constructed, which would require pruning or averaging across many trees. Thus, I supplement this by using pruning, bagging, random forests, and boosting methods to improve my model.

As my consolidated dataset was relatively small (352 rows) and more data was required for an effective learning process, I deployed 80% on training data and 20% on testing data. All in all, I ran 3 separate regression trees on each of the 3 levels of crime data, to yield the following variable selections.

Regression Tree for Low-Intensity Crime

##Regression Tree for Level 1 Crime
regtree.data1 = subset(Crime_Data, select = -c(ytotal, y2, y3, Borough))
set.seed(7)
training_index_Lv1 = sample(1:nrow(regtree.data1), nrow(regtree.data1)*0.80)
training_set_Lv1 = regtree.data1[training_index_Lv1,] #dataset for training
testing_set_Lv1 = regtree.data1[-training_index_Lv1,] #dataset for testing
reg_tree1 = rpart(
  formula = y1~.,
  data = regtree.data1,
  subset = training_index_Lv1,
  method = "anova"
)
rpart.plot(reg_tree1,type=5, extra = 1)

For low-intensity crime, the 3 most significant variables were police, minwage and diversity.

Regression Tree for Moderate-Intensity Crime

#Regression Tree for Level 2 Crime
regtree.data2 = subset(Crime_Data, select = -c(ytotal, y1, y3, Borough))
set.seed(7)
training_index_Lv2 = sample(1:nrow(regtree.data2), nrow(regtree.data2)*0.80)
training_set_Lv2 = regtree.data2[training_index_Lv2,] #dataset for training
testing_set_Lv2 = regtree.data2[-training_index_Lv2,] #dataset for testing
reg_tree2 = rpart(
  formula = y2~.,
  data = regtree.data2,
  subset = training_index_Lv2,
  method = "anova"
)
rpart.plot(reg_tree2,type=5, extra = 1)

For moderate-intensity crime, the 3 most significant variables were police, dwell_poor and diversity.

Regression Tree for High-Intensity Crime

#Regression Tree for Level 3 Crime
regtree.data3 = subset(Crime_Data, select = -c(ytotal, y1, y2, Borough))
set.seed(7)
training_index_Lv3 = sample(1:nrow(regtree.data3), nrow(regtree.data3)*0.80)
training_set_Lv3 = regtree.data3[training_index_Lv3,] #dataset for training
testing_set_Lv3 = regtree.data[-training_index_Lv3,] #dataset for testing
reg_tree3 = rpart(
  formula = y3~.,
  data = regtree.data3,
  subset = training_index_Lv3,
  method = "anova"
)
rpart.plot(reg_tree3,type=5, extra = 1)

For high-intensity crime, the 4 most significant variables were police, density, dwell-mid, and diversity.

Looking at the data at an aggregrate level, I also visualized the regression tree of total crime, consisting of all crime from different intensity levels.

Based on the results, the regression tree has identified police and minwage as significant predictor variables of total crime.

Pruning of the Regression Tree

In order to validate that the decision tree I plotted was the optimal one, I applied printcp and plotcp functions to avoid any overfitting of the data. I identified which size of tree would yield the lowest cross-validation error, given by the output “xerror”, and pruned my tree as such.

par(mfrow=c(1,2)) 
plotcp(reg_tree) #visualize cross-validation results
printcp(reg_tree) #display the results

After pruning, my tree and the predictor variables identified did not change significantly — the pruned tree had 7 terminal nodes (versus the original 9 nodes), with the splits according to the same factors. This could be because the original tree was already maximally pruned in its initial construction. Indeed, the relative error of the tree decreased with a larger size, but plateaued at 7 terminal nodes. Hence, it is reasonable to assume that the best pruned tree has 7 terminal nodes.

In the use of classification trees, ROC curves and misclassification rates are usually used as indicators of a model’s accuracy. However, since total crime was a continuous variable, rather than a discrete variable in this study, I have compared the mean-squared error of our models as an indicator of accuracy. My pruned regression tree model yields a mean squared residual error of 14,825,820. To decrease this error, I employed random forest, bagging and boosting methods.

Bagging

Firstly, I used bagging to generate additional data for training from dataset using combinations with repetitions to produce multi-sets of the original data. Bagging then fits a tree for bootstrap samples taken from the original dataset, decreasing variance and improving prediction accuracy. I created 500 trees, as standard practice, which will generate sufficient trees for the bagging process. The tree from each individual bootstrap sample is grown deep and is unpruned, thus having high variance but low bias. Averaging the trees reduces this variance, allowing for both low bias and low variance.

bagging.TotalCrime = randomForest(regtree.data$ytotal~., data=regtree.data, subset=training_index, mtry=10, ntree=500, importance=T)
bagging.TotalCrime

Using bagging, the new model now yielded a mean squared residual error of 9,019,557.

Random Forest

Secondly, to improve bagging by de-correlating trees, I used the random forest function. While a tree is still built for each bootstrap sample, this time we used a different, randomly selected subset m of predictors p, where m = (square root of p) at each step.

rf.TotalCrime = randomForest(regtree.data$ytotal~., data=regtree.data, subset=training_index, mtry=5, ntree= 500, importance=T)
rf.TotalCrime #To view MSE of the Rf Model

I then applied this to the testing data, obtaining a much lower mean squared residual error of 6,818,547.

Boosting

Lastly, I conducted the iterative technique of boosting for the model to learn slowly from previously grown trees. The number of trees is a tuning parameter and construction of each tree depends strongly on the trees that have already been grown. Hence, I set the number of trees as 500 in order to prevent potential overfitting while providing a large enough sample. The relative inference statistics highlighted the most important variables as police and minwage. Next, I produced partial dependence plots to illustrate the marginal effect of these variables on crime rate after integrating out the other variables.

boost.TotalCrime = gbm(ytotal~.,data=regtree.data[training_index,], distribution = "gaussian", n.trees=500, interaction.depth = 3)
summary(boost.TotalCrime)

#Partial Dependence Plots
plot(boost.TotalCrime, i="police")
plot(boost.TotalCrime, i="minwage")

The model highlighted police and minwage as key variables, in line with the previous models. Using boosting, the model now yielded a mean squared residual error of 4,149,869.

Summary of Findings

To understand the key drivers of borough-level crime in London, I employed the following methods: linear regression, regression tree, bagging, random forest and boosting. Across all statistical methods, we saw that police, minwage, diversity and density had been highlighted consistently to be significant. Hence, these should be the focus for present and future crime reduction initiatives.

Comparing the Mean Squared Residual Errors across the different methods, I saw a 72.3% improvement overall, and a non-trivial fall with each method. Thus, I can conclude that each method has played a significant role in refining the fit of our model to the data.

Police deployment resources are expected to continue falling over the next decade due to manpower shortages. The findings of this study suggest the need to increase the national budget and manpower allocation to ramp up police resources.

However, if there are constraints that render increased policing unfeasible, solutions may need to be found in the other key variables of dwelling quality, earners of less-than minimum wage, and ethnic diversity. Authorities can consider socioeconomic policies to foster job creations, provide better housing, or targeted measures to lift low-income individuals above minimum wage. It may also be feasible to redeploy limited police resources to a few ‘hotspot’ boroughs selected according to these socioeconomic risk factors.