The code above is to help format the notebook and make the outputs appear inline and within the document.
1.0 Data Preprocessing
This is to load all the modules to be used in this analysis. The os module provides a way to interact with the Os of the machine, the glob module is used to retrieve the path names needed. Pandas,Numpy, Seaborn, Matplotlib.pyplot and Folium are modules used to interact with the dataset and plot visualizations.
import os
import glob
from os import path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium import plugins
sns.set()
We would be modify the code below gotten from here https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/ to create a function that will merge all the 22 months into one single file for each region.
To-do In this phase we are trying to prepare the dataset into a format suitable for analysis.
Warning!!!!! Since the document is being copied from one machine to another the paths specified in this documnet would not be the same as that of the user of the document so ensure to copy the right paths!
Details
- The library used is Pandas, Os and glob which needs to be imported before use
- To use this chunk of code you will need to ensure you have two folders, one containing South Wales data and the other folder holds the North Wales data. Copy the paths to these folders and keep it somewhere as they are very crucial to this program.
- When using this program you will be required to specify file paths ensure you use the path you have copied
- If you want the program below to run ensure you have deleted any file with the same name as the output file you want to create as the code wont run if there is an existing file in the directory
- The Merged file for both North and South Wales data would be merged into the specified paths
- Use the python help method to see what the function does.
#define the function that takes two arguments the path of the file and the file name
def combine_all(fpath,fname):
'''This is a function that merges all the files in a directory and compares the count of the rows of the
combined file to the sum of the rows of all the component individual files. It is used to confirm the
merged file is correct'''
#use the os module to switch to the specified directory
os.chdir(fpath)
#check of the file exists in the dir and merge if it not using glob(get file names) and concat method
if path.isfile(fname)==True:
print('The Merged file already exists in directory please delete first or proceed with existing file')
else:
#pick the file extensions
extension = 'csv'
#assign all file names with the extension csv into a list and assign to variable
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
#export to csv and name as file name inputed
combined_csv.to_csv(fname, index=False, encoding='utf-8-sig')
#read in the csvs with the specified file name
df = pd.read_csv(fname)
#confirm the count of the data
tot_row_count = df['Reported by'].count()
#get the individual counts of all the files and compare with total combine counts
file_row = []
for i in all_filenames:
if i != fname:
file_count = pd.read_csv(i)
row = file_count['Reported by'].count()
file_row.append(row)
file_row_count = sum(file_row)
return tot_row_count,file_row_count
Warning!!!!! Ensure you insert the path to the folder that contains the North wales dataset only!
#Merge all the Northern Wales files, change this directory to the directory where the files are
combine_all("/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Northwales",'North_wales_tot.csv')
The Merged file already exists in directory please delete first or proceed with existing file
Warning!!!!! Ensure you insert the path to the folder that contains the South wales dataset only!
#Merge all the Southern Wales files, change this directory to the directory where the files are
combine_all("/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Southwales",'South_wales_tot.csv')
The Merged file already exists in directory please delete first or proceed with existing file
Outcome of Data Preprocessing
We now have a unified file for each of the different regions being studied North and South wales data for the 22 months being studied has been created. just viewing the outcome of the row cout above we will see that south wales has almost double the total crime count for north wales from January 2019 to october 2020.
Data cleaning 1.0
To-do In this phase we have successfully merged our data and we will be proceeding with the data cleaning process.
Warning!!!!! Ensure to specify the path where the files have been merged from the use of the combine_all function.
Details
- The library used is Pandas which needs to be imported before use
- To use this chunk of code you will need to ensure you specified the right paths when reading in the CSV with pandas read_csv method
#reading in the Combine North wales dataset and the combined Southwales dataset
#function takes the location of the files and the file output name
n_wales = pd.read_csv('/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Northwales/North_wales_tot.csv')
s_wales = pd.read_csv('/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Southwales/South_wales_tot.csv')
#print('The shape for North wales
#checking the shape of the data and missing values
print("Shape of the North wales data is {}".format(n_wales.shape))
print(f"Number of missing values: {n_wales.isnull().sum().sum()}")
print("Shape of the South wales data is {}".format(s_wales.shape))
print(f"Number of missing values: {s_wales.isnull().sum().sum()}")
Shape of the North wales data is (132506, 12)
Number of missing values: 190634
Shape of the South wales data is (268738, 12)
Number of missing values: 427140
#getting the number of missing files for North Wales
n_wales.isnull().sum()
Crime ID 29048
Month 0
Reported by 0
Falls within 0
Longitude 8
Latitude 8
Location 0
LSOA code 8
LSOA name 8
Crime type 0
Last outcome category 29048
Context 132506
dtype: int64
#getting the number of missing files for South Wales
s_wales.isnull().sum()
Crime ID 69219
Month 0
Reported by 0
Falls within 0
Longitude 4991
Latitude 4991
Location 0
LSOA code 4991
LSOA name 4991
Crime type 0
Last outcome category 69219
Context 268738
dtype: int64
#viewing the first 3 dataset from the North wales data
n_wales.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
#viewing the first 3 dataset from the south wales data
s_wales.head(3)
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Observations from Data cleaning 1.0
- Northern Wales has around 132506 rows of data with each row reperesenting a crime committed in the region
- South Wales has around 268738 rows of data which is around twice the rows recorded for North wales, which looks like South wales has around twice as much of the northern wales crimes committed
- The shape of the data is the same for both dataset and the columns and number ofcolumns match on both dataset.
- From viewing the first 3 rows of the North wales and south wales dataset we can see that each row reperesents a crime committed with the crime ID being the unique identifier for each row. The row captures the location of the crime, month it occurred,outcome and the context in which the crime occurred
- The month column is used to aggregate the crimes committed monthly and the month for which a crime occurred is whats captured.
- Crime ID, Last outcome category and Context are responsible for the missing values in both dataset with the null values in the South wales data higher than that of the North which is expected since South wales has more rows of data
Data cleaning 2.0
To-do We are trying to compare the crimes committed between the two police jurisdictions so we would not need columns like Last outcome category,Context and crime ID since it is established that each row is a distinct crime.
Details
- The library used is Pandas which needs to be imported before use
- We will be dropping redundant columns from the dataset
#We will be dropping columns that would not impact the analysis using the drop method
n_wales = n_wales.drop(['Context', 'Last outcome category','Crime ID'], axis=1)
n_wales.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
#We will be dropping columns that would not impact the analysis using the drop method
s_wales = s_wales.drop(['Context', 'Last outcome category','Crime ID'], axis=1)
s_wales.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
#re-checking the shape of the data and missing values
print("Shape of the North wales data is {}".format(n_wales.shape))
print(f"Number of missing values: {n_wales.isnull().sum().sum()}")
print("Shape of the South wales data is {}".format(s_wales.shape))
print(f"Number of missing values: {s_wales.isnull().sum().sum()}")
Shape of the North wales data is (132506, 9)
Number of missing values: 32
Shape of the South wales data is (268738, 9)
Number of missing values: 19964
#show missing values
n_wales.isnull().sum()
Month 0
Reported by 0
Falls within 0
Longitude 8
Latitude 8
Location 0
LSOA code 8
LSOA name 8
Crime type 0
dtype: int64
#show missing values
s_wales.isnull().sum()
Month 0
Reported by 0
Falls within 0
Longitude 4991
Latitude 4991
Location 0
LSOA code 4991
LSOA name 4991
Crime type 0
dtype: int64
#confirm we have same columns in dataset of both regions
n_wales_columns = set(list(n_wales.columns))
s_wales_columns = set(list(s_wales.columns))
#using the zip method to check if we have any column in either dataset not in the other
for i,j in zip(n_wales_columns,s_wales_columns):
assert(i == j), \
'There is a column in one of the regions not present in the other'
Observations from Data cleaning 2.0
To-do We will be fixing missing and null values
Details
-
We removed 3 columns that would not contribute to our analysis and which had a bulk of the missing values. After removing this values we checked for missing values again and although the missing values have reduced we can still notice we have missing values for Longitude,Latitude, LSOA code and LSOA name which are location details.
-
We asserted that we have the same columns and number of columns in both datasets, this mean we could merge the dataset down the line. for now we use them individually so we can confirm they have similar distributions
Data cleaning 3.0 : Fixing all missing values
< In this stage of our data cleaning procedure we will be fixing all null values for the location details missing. From the analysis above we can se that although all other location details have missing values the location variable(column) has no missing values so this would be used to fix the missing values in our dataset
Details
- The library used is Pandas which needs to be imported before use
- We will be dropping redundant columns from the dataset
#checking for Null values in the longitude column
s_wales[s_wales['Longitude'].isnull()]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
#checking for Null values in the longitude column
n_wales[n_wales['Longitude'].isnull()]
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
#checking count of rows with "No location"
n_wales[n_wales['Location'] == 'No Location'].count()
Month 8
Reported by 8
Falls within 8
Longitude 0
Latitude 0
Location 8
LSOA code 0
LSOA name 0
Crime type 8
dtype: int64
#checking count of rows with "No location"
s_wales[s_wales['Location'] == 'No Location'].count()
Month 4991
Reported by 4991
Falls within 4991
Longitude 0
Latitude 0
Location 4991
LSOA code 0
LSOA name 0
Crime type 4991
dtype: int64
From the above we can see that although Location shows that it has no missing values, the inputs for the columns shows that there is no location available, we would leave this dataset as it is and group the crimes with no location as unKnown location. All the missing values has been accounted for and we will proceed with the analysis.
n_wales['Month'].count()
132506
2.0 Data Wrangling and Normalisation
In this section of the Analysis we will shape our data into formats suitable for analysis. We will be creating individual tables for the two regions and also merging the dataframes into one for analytical purposes.
The data dataset to be used for this step has 10 columns. We can divide these columns into two.
- Columns that shows crime details(Month, Reported by, Crime type)
- Columns that shows crime location details(Falls within, Longitude, Latitude, Location, LSOA code)
The Month contains the month the crime was committed, a row of the data is a crime committed in the month it was recorded. Most of the crime has a recorded location details for it but some do not. We wont be dealing with crime specific locations and as such this wont be part of the columns wrangled in this step.
Reference for Month Number to Month name conversion in code below - https://stackoverflow.com/questions/37625334/python-pandas-convert-month-int-to-month-name/37625467
We will normalise the data using population since our data is a region based dataset, population might influence our outcome. formular = (Total crime count per region/region population) x 1000
That means that the numbers shown in the normalized data represent total number of crimes per 1000 people.
Step 1: Creating dataset for count of Crime type by month and season
In this step we will create the s_wales_grouped,n_wales_grouped and n_s_wales_grouped dataframes. Most of the wrangling would be done using built in methods and attributes of the Pandas module. The data created in this step is one that each row is a Crime type, normalized crime count ot 100,000 people and its corresponding month. This was done to have a way to compare the crime type for different time periods and compare the crime types. s_wales_grouped contains the South Wales data while n_wales_grouped contains Northern wales data. The n_s_wales_grouped concatenates both dataframes into one.
steps used
- For each region the crime count was grouped by the Month and crime to give a dataset that shows crime rate of a particular crime type for a given month.
- The output of the grouping resulted two month column and as such we dropped one of them
- The column names where then renamed to sensible names that described the datapoints in the column
- We then had to reset the index to remove the month column has an index since we need it in our dataset
- We then used a string literal to tag each row for the region it reperesents
- The string split method was used to split the month column into Year and the month number which was then formatted to the corresponding month name
- A Seasons mapping dictionary was created to hold the different months and the corresponding season it falls into, this dictionary was then used to categorize the Months into the four seasons using a lambda function and the pandas apply method
- A dictionary covid_mapping was created to group the dataset into pre covid and post covid and was applied to the dataset with the same technique as above
- A crime dictionary was also created to encode each crime rate with 3 characters. This was then used to create a crime code column which is a code that reperesents each crime type
- The crime count was then normalized into total crimes per 100,000 people using the formular (Total crime count per region/region population) x 10,000
- Finally, the two datasets created for North and South wales was concatenated into one single dataset.
#Grouping data using the groupby method to group the data by month and crime type for a particuar region
#southwales data grouped and count of crime type derived
s_wales_temp = s_wales[['Month', 'Crime type']].groupby([s_wales['Month'], s_wales['Crime type']]).count()
#dropping the Month column
s_wales_temp = s_wales_temp.drop('Month',axis=1)
#renaming the crime type to column to crime count
s_wales_grouped = s_wales_temp.rename(columns={"Crime type": "Crime count"})
#resetting index to remove Crime type from being index
s_wales_grouped.reset_index(inplace=True)
#using string literal to assign Region name to dataframe
s_wales_grouped['Region'] = 'South Wales'
#splitting the month column into the year and the month number
s_wales_grouped[['Year','Month Name']] = s_wales_grouped.Month.str.split("-",expand=True,)
#converting the Month Number to Month name using referenced code above
s_wales_grouped['Month Name'] = pd.to_datetime(s_wales_grouped['Month Name'], format='%m').dt.month_name()\
.str.slice(stop=3)
#Creating a new column called season to group months into seasons
#create variable mapping for Months
seasons_mapping = {'Jan':'Winter','Dec':'Winter','Feb':'Winter','Mar':'Spring',\
'Apr':'Spring','May':'Spring','Jun':'Summer','Jul':'Summer','Aug':'Summer',\
'Sep':'Autumn','Oct':'Autumn','Nov':'Autumn'}
#use lambda function to apply function across the column top down
s_wales_grouped['season']= s_wales_grouped['Month Name'].apply(lambda x:seasons_mapping[x])
#create column to group months into pre covid and covid periods
#according to timeline in link https://bfpg.co.uk/2020/04/covid-19-timeline/ covid first patient in UK was Jan
# We will assume 2019 is pre covid and 2020 is covid
covid_mapping = {'2019':'Pre-Covid','2020':'Covid'}
#use lambda function to apply function across the column top down
s_wales_grouped['Covid']= s_wales_grouped['Year'].apply(lambda x:covid_mapping[x])
#creating a key for the crime types
crime_mapping = {'Other theft':'OT', 'Theft from the person':'TFP',
'Violence and sexual offences':'VSO', 'Anti-social behaviour':'ASB',
'Burglary':'BUG', 'Criminal damage and arson':'CDA', 'Drugs':'DRG', 'Public order':'PO',
'Other crime':'OC', 'Possession of weapons':'POW', 'Robbery':'ROB', 'Vehicle crime':'VC',
'Shoplifting':'SPL', 'Bicycle theft':'BT'}
#use lambda function to apply function across the column top down
s_wales_grouped['Crime code']= s_wales_grouped['Crime type'].apply(lambda x:crime_mapping[x])
#since our data is location based we will use the population of each region to normalise the data
s_wales_grouped['Norm Crime count']= s_wales_grouped['Crime count'].apply(lambda x:(x/299239) * 10000)
#Northwales data grouped and count of crime type derived
n_wales_temp = n_wales[['Month', 'Crime type']].groupby([n_wales['Month'], n_wales['Crime type']]).count()
n_wales_temp = n_wales_temp.drop('Month',axis=1)
n_wales_grouped = n_wales_temp.rename(columns={"Crime type": "Crime count"})
#resetting index to remove Crime type from being index
n_wales_grouped.reset_index(inplace=True)
#using string literal to assign Region name to dataframe
n_wales_grouped['Region'] = 'North Wales'
#splitting the month column into the year and the month number
n_wales_grouped[['Year','Month Name']] = n_wales_grouped.Month.str.split("-",expand=True,)
#converting the Month Number to Month name using referenced code above
n_wales_grouped['Month Name'] = pd.to_datetime(n_wales_grouped['Month Name'], format='%m').dt.month_name()\
.str.slice(stop=3)
#Creating a new column called season to group months into seasons
#create variable mapping for Months
#use lambda function to apply function across the column top down
n_wales_grouped['season']= n_wales_grouped['Month Name'].apply(lambda x:seasons_mapping[x])
#create column to group months into pre covid and covid periods
#according to timeline in link https://bfpg.co.uk/2020/04/covid-19-timeline/ covid first patient in UK was Jan
# We will assume 2019 is pre covid and 2020 is covid
#use lambda function to apply function across the column top down
n_wales_grouped['Covid']= n_wales_grouped['Year'].apply(lambda x:covid_mapping[x])
#use lambda function to apply function across the column top down
n_wales_grouped['Crime code']= n_wales_grouped['Crime type'].apply(lambda x:crime_mapping[x])
#since our data is location based we will use the population of each region to normalise the data
n_wales_grouped['Norm Crime count']= n_wales_grouped['Crime count'].apply(lambda x:(x/70073) * 10000)
#concatenating the two regions dataset
n_s_wales_grouped = pd.concat([n_wales_grouped,s_wales_grouped])
n_wales_grouped.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Step 2 Dataset for Monthly total crime count by region
This dataset was created to hold the total monthly crime rate of each region separately and also in a single dataframe. The crime rate was then normalized using the formular formular (Total crime count per region/region population) x 10,000. This is to basically hold monthly distribution of crime rate in North and South Wales. The pandas framework and its methods were used extensively in this step.
steps used
- The total crime count was grouped by the months and the columns were renamed using the rename method in pandas
- String literal was then used to assign the region to each row
- The data was normalised with the North and South wales population multiplied by a 10,000 to show crime per 10,000 people
- The index was reset to remove the month as the index
- The string split method was used to split the month column into Year and the month number which was then formatted to the corresponding month name
- Wrangled dataset for the two regions were then concatenated into a dataframe.
#getting the monthly total crime distribution of crime in north wales
#group crime count by month using group method
dist_n_wales = n_wales[['Crime type','Month']].groupby('Month').count()
#rename columns using rename method
dist_n_wales = dist_n_wales.rename({'Crime type': 'Monthly Crime count'}, axis=1)
#assign region to data
dist_n_wales['Region'] = 'North Wales'
#normalise data using population to crime per 10,000 people
dist_n_wales['Norm Monthly Crime']= dist_n_wales['Monthly Crime count'].apply(lambda x:(x/70073) * 10000)
#reset index
dist_n_wales.reset_index(inplace=True)
#split the Month column into year and month name
dist_n_wales[['Year','Month Name']] = dist_n_wales.Month.str.split("-",expand=True,)
#converting the Month Number to Month name using referenced code above
dist_n_wales['Month Name'] = pd.to_datetime(dist_n_wales['Month Name'], format='%m').dt.month_name()\
.str.slice(stop=3)
#getting the monthly total crime distribution of crime in south wales
#group crime count by month using group method
dist_s_wales = s_wales[['Crime type','Month']].groupby('Month').count()
#rename columns using rename method
dist_s_wales = dist_s_wales.rename({'Crime type': 'Monthly Crime count'}, axis=1)
#assign region to data
dist_s_wales['Region'] = 'South Wales'
#normalise data using population to crime per 10,000 people
dist_s_wales['Norm Monthly Crime']= dist_s_wales['Monthly Crime count'].apply(lambda x:(x/299239) * 10000)
#reset index
dist_s_wales.reset_index(inplace=True)
#split the Month column into year and month name
dist_s_wales[['Year','Month Name']] = dist_s_wales.Month.str.split("-",expand=True,)
#converting the Month Number to Month name using referenced code above
dist_s_wales['Month Name'] = pd.to_datetime(dist_s_wales['Month Name'], format='%m').dt.month_name()\
.str.slice(stop=3)
#concatenating both dataframes into one table using the Concat method in pandas
Monthly_crime_count_tot = pd.concat([dist_n_wales,dist_s_wales])
Monthly_crime_count_tot.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
step 3 Dataset for percentage of a crime type in a particular region
This dataset was created to hold the total crime rate per crime type for both South and Northern wales, we then create a unified dataframe to hold both datasets.
steps used
- We used the value_counts method of pandas to get the unique crime types present in our dataset then we use the dict method to convert this into a dictionary
- The pandas DataFrame method was chained with the from_dict method used to create a dataframe from a dictionary
- The columns method was used to name our new dataset
- We then derived the monthly crime percentage but this was later dropped as it was not used
- We normalised our data with the North and South wales population multiplied by a 10,000 to show crime per 10,000 people
- Finally, the two datasets created where concatenated into one.
#crime type by region
#Creating the total rate of crime in North Wales grouped by the crime types
#get the count of each crime type and then use dict method to convert it into a dictionary
crime_cnt_n_wales = dict(n_wales['Crime type'].value_counts())
#Use pandas to convert the dictionary into a dataframe
crime_cnt_n_wales = pd.DataFrame.from_dict(crime_cnt_n_wales,orient='index').reset_index()
#rename the new columns using the pandas column method
crime_cnt_n_wales.columns = ['Crime type','Tot_cnt']
#use string literal to assign region
crime_cnt_n_wales['Region']='North Wales'
#calculate crime percentage of each crime type
crime_cnt_n_wales['Percent'] = (crime_cnt_n_wales['Tot_cnt'] / crime_cnt_n_wales['Tot_cnt'].sum()) * 100
#normalise data using population
crime_cnt_n_wales['norm']= crime_cnt_n_wales['Tot_cnt'].apply(lambda x:(x/70073) * 10000)
#Creating the total rate of crime in South Wales grouped by the crime types
#get the count of each crime type and then use dict method to convert it into a dictionary
crime_cnt_s_wales = dict(s_wales['Crime type'].value_counts())
#Use pandas to convert the dictionary into a dataframe
crime_cnt_s_wales = pd.DataFrame.from_dict(crime_cnt_s_wales,orient='index').reset_index()
#rename the new columns using the pandas column method
crime_cnt_s_wales.columns = ['Crime type','Tot_cnt']
#use string literal to assign region
crime_cnt_s_wales['Region']='South Wales'
#calculate crime percentage of each crime type
crime_cnt_s_wales['Percent'] = (crime_cnt_s_wales['Tot_cnt'] / crime_cnt_s_wales['Tot_cnt'].sum()) * 100
#normalise data using populatio
crime_cnt_s_wales['norm']= crime_cnt_s_wales['Tot_cnt'].apply(lambda x:(x/299239) * 10000)
#concatenating the total rate of crime in South and North Wales grouped by the crime types
crime_cnt_tot = pd.concat([crime_cnt_n_wales,crime_cnt_s_wales])
#drop the percentage column
crime_cnt_tot =crime_cnt_tot.drop(['Percent'], axis=1)
crime_cnt_tot.head()
.dataframe tbody tr th {
vertical-align: top;
}
.dataframe thead th {
text-align: right;
}
Outputs From Wrangling
- We need the data to be grouped by two variables, which is the Month of the observations and the crime type then the count of the crime type would be taken. the output of the wrangling in Step 1 above produced a dataframe which satisfies this requirement
- A dataframe was needed to hold the distribution of the monthly measurement by the two regions. We have three outputs from this step which is the monthly data from south wales, North wales and a unified table holding all the monthly crimes grouped by the region
- A dataframe was needed to hold the distribution of the monthly measurement by the two regions. We have three outputd from this step which is the monthly data from south wales, North wales and a unified table holding all the monthly crimes grouped by the region
- The dataset was normalized using population of each region. The output gave the crime rate per 100,000 people in both regions
Output dataframes(tables) n_s_wales_grouped: This is a combined dataframe of the North and South wales dataset categorized into the crime count grouped by the crime type s_wales_grouped: This is a dataframe of the South wales dataset categorized into the crime count grouped by the crime type n_wales_grouped: This is a dataframe of the North wales dataset categorized into the crime count grouped by the crime type dist_s_wales: This shows the monthly crime count in the South Wales police district dist_n_wales: This shows the monthly crime count in the North Wales police district Monthly_crime_count_tot: This shows the combined monthly crime count in the North and South Wales police district crime_cnt_n_wales: This is the total Crime count grouped into the crime type for the North wales region crime_cnt_s_wales: This is the total Crime count grouped into the crime type for the North wales region crime_cnt_tot:This is the total Crime count grouped into the crime type combined for the North and South wales region
3.0 Data Exploration
In this section we will start exploring our dataset Graphically and also produce some summary statistics to understand the dataset. We will show plots to visualise our data.s
To-do Exploring the dataset using the different dataframes created during the wrangling phase
Details
- The library used is Pandas, Numpy, Seaborn and Matplotlib.pyplot which needs to be imported before use
- To use this chunk of code you will need to use the dataframes listed in the wrangling phase above
- Each section shows the analysis question being answered and each section as an output summarizing the results
- Ensure the Data Exploration application is not working to avoid interference
- Ensure The method of any section is called.
- the section starts with a summary statistics which just shows a summary of the data
#this datasets shows the monthly total crime rate for North and South wales
print(dist_s_wales.describe())
Monthly Crime count Norm Monthly Crime
count 22.000000 22.000000
mean 12215.363636 408.214291
std 1235.897112 41.301338
min 10322.000000 344.941669
25% 11271.250000 376.663804
50% 12066.500000 403.239551
75% 13000.250000 434.443706
max 15541.000000 519.350753
-
The standard deviation shows that on a average the monthly crime rate measured tend to be around (+ or -) 12,000 crimes from the average monthly crimes but when we factor population that comes into around (+ or -) 41
-
The month with the least crime rate had about 10,322 whih is 344 crimes per 10,000
-
The month with the maximum crime rate had 15,541 at 519 crimes per 10,000 people monthly
print(dist_n_wales.describe())
Monthly Crime count Norm Monthly Crime
count 22.000000 22.000000
mean 6023.000000 859.532202
std 468.567828 66.868527
min 5348.000000 763.204087
25% 5675.750000 809.976739
50% 5991.500000 855.036890
75% 6380.250000 910.514749
max 6964.000000 993.820730
- The standard deviation shows that on a average the monthly crime rate measured tend to be around + or - 468 crimes from the average monthly crimes but when we factor population that comes into around + or - 66
- The month with the least crime rate had about 10,322 whih is 344 crimes per 10,000
- The month with the maximum crime rate had 5348 at 855 crimes per 10,000 people monthly
-
Although overall crime rate in South Wales is double the figures noticed for North Wales. North Wales has more crime rate per 10,000 people than South Wales
-
South Wales has a lesser crime rate standard deviation than North Wales showing that values noticed for South Wales are closer to the mean than that of the North Wales to its mean
#South Wales Crime count per crime type
crime_cnt_s_wales.value_counts()
Crime type Tot_cnt Region Percent norm
Violence and sexual offences 77116 South Wales 28.695607 2577.070502 1
Vehicle crime 13155 South Wales 4.895102 439.615157 1
Theft from the person 1886 South Wales 0.701799 63.026544 1
Shoplifting 15989 South Wales 5.949661 534.322064 1
Robbery 945 South Wales 0.351644 31.580108 1
Public order 24342 South Wales 9.057893 813.463486 1
Possession of weapons 1350 South Wales 0.502348 45.114440 1
Other theft 15967 South Wales 5.941475 533.586865 1
Other crime 4057 South Wales 1.509649 135.577248 1
Drugs 8815 South Wales 3.280146 294.580586 1
Criminal damage and arson 22278 South Wales 8.289859 744.488519 1
Burglary 10363 South Wales 3.856172 346.311811 1
Bicycle theft 3256 South Wales 1.211589 108.809346 1
Anti-social behaviour 69219 South Wales 25.757057 2313.167735 1
dtype: int64
#North Wales Crime count per crime type
crime_cnt_n_wales.value_counts()
Crime type Tot_cnt Region Percent norm
Violence and sexual offences 49913 North Wales 37.668483 7123.000300 1
Vehicle crime 2960 North Wales 2.233861 422.416623 1
Theft from the person 335 North Wales 0.252819 47.807287 1
Shoplifting 6698 North Wales 5.054865 955.860317 1
Robbery 372 North Wales 0.280742 53.087494 1
Public order 11008 North Wales 8.307548 1570.933170 1
Possession of weapons 603 North Wales 0.455074 86.053116 1
Other theft 7856 North Wales 5.928788 1121.116550 1
Other crime 2171 North Wales 1.638416 309.819759 1
Drugs 2683 North Wales 2.024814 382.886418 1
Criminal damage and arson 12842 North Wales 9.691637 1832.660226 1
Burglary 5280 North Wales 3.984725 753.499922 1
Bicycle theft 737 North Wales 0.556201 105.176031 1
Anti-social behaviour 29048 North Wales 21.922026 4145.391235 1
dtype: int64
- The top 5 crimes in North and South Wales are from similar crime types
def police_patrol():
'''This functions shows the top 10 locations within the different regions that has the highest crime rate
this is to recommend locations in need of frequent police patrol to curb crime in the area.'''
# define a temporary variable to hold the count of crimes grouped by crime location
n_policing_temp = n_wales[['LSOA name','Crime type']].groupby('LSOA name').count()
#reset index of the variable to change the current index
n_policing_temp.reset_index(inplace=True)
#sort the values in descending order to show top values
n_policing = n_policing_temp.sort_values(by=['Crime type'],ascending=False).head(10)
#use seaborn barplot method to create a bar plot of the top 10 values and set a title
_ = sns.barplot(x = 'Crime type', y = 'LSOA name', data= n_policing)\
.set_title("Top ten Crime locations in North Wales per 10,000 people")
plt.show()
# define a temporary variable to hold the count of crimes grouped by crime location
s_policing_temp = s_wales[['LSOA name','Crime type']].groupby('LSOA name').count()
#reset index of the variable to change the current index
s_policing_temp.reset_index(inplace=True)
#sort the values in descending order to show top values
s_policing = s_policing_temp.sort_values(by=['Crime type'],ascending=False).head(10)
#use seaborn barplot method to create a bar plot of the top 10 values and set a title
_ = sns.barplot(x = 'Crime type', y = 'LSOA name', data= s_policing)\
.set_title("Top ten Crime locations in South Wales per 10,000 people")
plt.show()
#call help method on this function to show what it does to the user
help(police_patrol)
police_patrol()
Help on function police_patrol in module __main__:
police_patrol()
This functions shows the top 10 locations within the different regions that has the highest crime rate
this is to recommend locations in need of frequent police patrol to curb crime in the area.
To answer this question we would be using the n_s_wales_grouped dataframe. We want to know the most common crime types in South and North Wales and see if we have any similarities in the crime types. We will be using the FacetGrid plot method in Seaborn which helps plot barcharts for all the months in a grid. The row shows the Month while the Columns shows the region that is currently being analysed.
def Common_crimes():
'''This function creates a grid plot using seaborn to show the a monthly barplot of the crime rate of
the different crime type. The rows shows the month while the column shows the region.'''
#We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region
#North Wales to the Left and South Wales to the right
#creating a FacetGrid with Seaborn library and this is where we select dataset to be used
grid = sns.FacetGrid(n_s_wales_grouped, row='Month', col='Region', height=3.5,aspect = 2.65)
#create a variable crime order to hold the order in which the plots would be displayed
crime_order = list(n_s_wales_grouped['Crime code'].unique())
#lets populate the facetGrid with the specific plots and add the order
_ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order = crime_order )
grid.fig.subplots_adjust(top=0.96)
#adding the title of the plot
_ = grid.fig.suptitle('Most common crime types per 10,000 people in North and South Wales', fontsize=28)
plt.show()
#call help method on this function to show what it does to the user
help(Common_crimes)
#show the crime codes
print(crime_mapping)
Common_crimes()
Help on function Common_crimes in module __main__:
Common_crimes()
This function creates a grid plot using seaborn to show the a monthly barplot of the crime rate of
the different crime type. The rows shows the month while the column shows the region.
{'Other theft': 'OT', 'Theft from the person': 'TFP', 'Violence and sexual offences': 'VSO', 'Anti-social behaviour': 'ASB', 'Burglary': 'BUG', 'Criminal damage and arson': 'CDA', 'Drugs': 'DRG', 'Public order': 'PO', 'Other crime': 'OC', 'Possession of weapons': 'POW', 'Robbery': 'ROB', 'Vehicle crime': 'VC', 'Shoplifting': 'SPL', 'Bicycle theft': 'BT'}
- From the grid plot above we can see that across all the months and in both regions Violence and sexual offences(VSO) and Anti-social behaviour(ASB) have the highest crime rate for both South and North Wales. We have similar trends in crime type for both North and South Wales.
- North Wales has around twice the rate of Violent and Sexual offences as South wales per 10,000 people although South Wales has a higher rate of these crimes in total.
We would be using the n_s_wales_grouped dataframe to find the effects of the covid lockdown on crimes to see if any crime type took a deep during the lockdown. We will be using the normalized crime count in this case, so we are looking at the crime count per 10,000 people in North and South Wales. The seaborn FacetGrid method will also be used here. The rows of the plot shows pre-covid and during covid while the column shows the region.
def Covid_crimes():
'''This function displays a barplot showing a comparison between the crime rate before covid and during
covid, This was created using the seaborn gridplot module'''
#We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region
#North Wales to the Left and South Wales to the right
#creating a FacetGrid with Seaborn library and assigning data to be used
grid = sns.FacetGrid(n_s_wales_grouped, row='Covid', col='Region', height=3.5,aspect = 2.65)
#create a variable crime order to hold the order in which the plots would be displayed
crime_order2 = list(n_s_wales_grouped['Crime code'].unique())
#lets populate the facetGrid with the specific plots and add the order
_ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order = crime_order2)
grid.fig.subplots_adjust(top=0.8)
#adding the title of the plot
_ = grid.fig.suptitle('Difference in crime rate before and during COVID 19 per 10,000 people', fontsize=28)
plt.show()
#call help method on this function to show what it does to the user
help(Covid_crimes)
#show the crime codes
print(crime_mapping)
Covid_crimes()
Help on function Covid_crimes in module __main__:
Covid_crimes()
This function displays a barplot showing a comparison between the crime rate before covid and during
covid, This was created using the seaborn gridplot module
{'Other theft': 'OT', 'Theft from the person': 'TFP', 'Violence and sexual offences': 'VSO', 'Anti-social behaviour': 'ASB', 'Burglary': 'BUG', 'Criminal damage and arson': 'CDA', 'Drugs': 'DRG', 'Public order': 'PO', 'Other crime': 'OC', 'Possession of weapons': 'POW', 'Robbery': 'ROB', 'Vehicle crime': 'VC', 'Shoplifting': 'SPL', 'Bicycle theft': 'BT'}
Reduced during Covid
- From the plot above we can see that Burglary reduced almost by half of what it was before covid.
- Criminal damage and arson also reduced by a bit during Covid compared to the period before.
- Robbery reduced to almost non existence during covid
- Shoplifting and Vehicle crimes reduced a bit during covid
Increased during covid
- We saw an increase in Anti social behaviour during Covid which may be due to the new social regulations around covid lockdown.
Remained the same
Drug use, Other crime, Other theft, Possession of Weapons, Theft from the person, Violence and sexual offences and Bicycle theft all remained almost unchanged during Covid.
The analysis is to check for differences in Burglary rate during the summer and winter. We would be using the n_s_wales_grouped dataframe for this and also a FacetGrid. The row shows the season while the column shows the region.
def Seasonal_crimes():
'''This function displays a barplot showing a comparison between the crime rate grouped by season,
This was created using the seaborn gridplot module '''
#We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region
#North Wales to the Left and South Wales to the right
#creating a FacetGrid with Seaborn library and assigning data to be used
grid = sns.FacetGrid(n_s_wales_grouped, row='season', col='Region', height=3.5,aspect = 2.65)
#create a variable crime order to hold the order in which the plots would be displayed
crime_order3 = list(n_s_wales_grouped['Crime code'].unique())
#lets populate the facetGrid with the specific plots
_ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order=crime_order3)
plt.show()
#call help method on this function to show what it does to the user
help(Seasonal_crimes)
#show the crime codes
print(crime_mapping)
Seasonal_crimes()
Help on function Seasonal_crimes in module __main__:
Seasonal_crimes()
This function displays a barplot showing a comparison between the crime rate grouped by season,
This was created using the seaborn gridplot module
{'Other theft': 'OT', 'Theft from the person': 'TFP', 'Violence and sexual offences': 'VSO', 'Anti-social behaviour': 'ASB', 'Burglary': 'BUG', 'Criminal damage and arson': 'CDA', 'Drugs': 'DRG', 'Public order': 'PO', 'Other crime': 'OC', 'Possession of weapons': 'POW', 'Robbery': 'ROB', 'Vehicle crime': 'VC', 'Shoplifting': 'SPL', 'Bicycle theft': 'BT'}
To find the relationship between the crime rate in North and South Wales we will be using the scatterplot feature of Seaborn.
def crime_corr():
'''This function displays the correlation between the normalized crime rates of North and South Wales,
this was created using the scatterplot method of the seaborn library'''
#splice the required columns of South wales data into a variable
scat_s = dist_s_wales[['Month','Norm Monthly Crime']]
#rename the column names using the rename method of pandas
scat_s =scat_s.rename(columns={"Norm Monthly Crime": "s_wales_cnt"})
#splice the required columns of North wales data into a variable
scat_n = dist_n_wales[['Month','Norm Monthly Crime']]
#rename the column names using the rename method of pandas
scat_n =scat_n.rename(columns={"Norm Monthly Crime": "n_wales_cnt",'Month':'Month2'})
#concatenate the north and south wales data into one dataframe
scatter_w = pd.concat([scat_s, scat_n], axis=1)
#reset the index to change the current index
scatter_w = scatter_w.reset_index()
#drop one of the duplicated months
scatter_w = scatter_w.drop(['Month2'], axis=1)
#use seaborn to plot a scatterplot for crime rate and set title of plot
_ = sns.scatterplot(data=scatter_w, x="s_wales_cnt", y="n_wales_cnt").\
set_title("Correlation Plot for North and South Wales crime rate")
plt.show()
#call help method on this function to show what it does to the user
help(crime_corr)
crime_corr()
Help on function crime_corr in module __main__:
crime_corr()
This function displays the correlation between the normalized crime rates of North and South Wales,
this was created using the scatterplot method of the seaborn library
To check for the region with the highest crime rate per 10,000 people we will be using the swarmplot feature of seaborn which helps plot the normalized crime rate for each month and separates them into regions.
def Highest_crime():
'''This is a bee swarmplot showing the normalized crime rate in each region i.e crime rate per 10,000
people in North and South wales. This was created using the seaborn library'''
#nsert the required columns into the method and select data to be used
_ = sns.swarmplot(x = 'Region', y = 'Norm Monthly Crime', data= Monthly_crime_count_tot).\
set_title("Swarm Plot for North and South Wales crime rate per 10,000 people")
#create the x axis label
_ = plt.xlabel('Region')
#create the y axis label
_ = plt.ylabel('Monthly Crime count')
plt.show()
#call help method on this function to show what it does to the user
help(Highest_crime)
Highest_crime()
Help on function Highest_crime in module __main__:
Highest_crime()
This is a bee swarmplot showing the normalized crime rate in each region i.e crime rate per 10,000
people in North and South wales. This was created using the seaborn library
We will use a time series plot to show the evolution of crime rate over time. We will use the normalized crime count to plot the time series graph using the dist_n_wales and dist_s_wales dataframes with the pandas plot method
def crime_evo():
'''This function shows the evolution of crime over time in North and South Wales. This function
uses the normalized dataset which means it shows crime per 10,000 people in both regions'''
#splice the required columns from the dataframe and set index for north data
n_plot = dist_n_wales[['Month','Norm Monthly Crime']].set_index('Month')
#splice the required columns from the dataframe and set index for south data
s_plot = dist_s_wales[['Month','Norm Monthly Crime']].set_index('Month')
#assign the northern plot to a variable
ax = n_plot.plot()
#use the northern variable to create the southern plot
s_plot.plot(ax=ax)
#define the legend for the plot
ax.legend(["North Wales", "South Wales"])
plt.show()
#call help method on this function to show what it does to the user
help(crime_evo)
crime_evo()
Help on function crime_evo in module __main__:
crime_evo()
This function shows the evolution of crime over time in North and South Wales. This function
uses the normalized dataset which means it shows crime per 10,000 people in both regions
The plot above shows that crime per 10,000 people in North Wales is taking a dip while crime rate in Southern Wale is taking a gradual rise.
def pop_crime_type():
'''The goal of this function is to display a piechart with the most popular crime types. This was created
using the pie method in pandas'''
#plot the crime types of North wales on a piechart
plot = crime_cnt_n_wales.set_index('Crime type').plot.pie(y='Percent',figsize=(15, 15))
#plot the crime types of south wales on a piechart
plot = crime_cnt_s_wales.set_index('Crime type').plot.pie(y='Percent',figsize=(15, 15))
plt.show()
pop_crime_type()
4.0 Statistical Testing
Three statistical tests would be performed in this section using the scipy.stats module, we will be using different classes from the module.
Research Question 1.0 is the Monthly crime rate for north and south wales normal?
To-do We will be testing the crime rate for North and South wales to see if it has normal distribution.
Details
- We will be using the shapiro-wilk test to check for the distribution.
- The Scipy.stats module is needed and we need to import the shapiro class from the module
- We need just 1-dimensional dataset to input into this method.
- We would be using a p-value of 0.05 which means when we have a value lower than this we can reject the null hypothesis and when it is higher we cannot reject the null hypothesis
#is the Monthly crime rate for north and south wales normal?
#Interpretation
#import shapiro from scipy module
from scipy.stats import shapiro
#assign the required column to a variable
data_s = dist_s_wales['Norm Monthly Crime']
#assign the output of the test to two variables
stat, p = shapiro(data_s)
#print out the stat and p-value
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('At a 5% significance level we do not have enough evidence to reject the null hypothesis\
that the South wales dataset has a Gaussian distribution')
else:
print('At a 5% significance level we have enough evidence to reject the null hypothesis\
and accept alternative hypothesis that the South wales dataset does not have a Gaussian distribution.')
stat=0.948, p=0.285
At a 5% significance level we do not have enough evidence to reject the null hypothesis that the South wales dataset has a Gaussian distribution
dist_n_wales
from scipy.stats import shapiro
data_n = dist_n_wales['Norm Monthly Crime']
stat, p = shapiro(data_n)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('At a 5% significance level we do not have enough evidence to reject the null hypothesis\
that the South wales dataset has a Gaussian distribution')
else:
print('At a 5% significance level we have enough evidence to reject the null hypothesis\
and accept alternative hypothesis that the South wales dataset does not have a Gaussian distribution.')
stat=0.942, p=0.217
At a 5% significance level we do not have enough evidence to reject the null hypothesis that the South wales dataset has a Gaussian distribution
Research Question 2.0 What type of relationship do we have between North and South wales crime rate?
To-do We will be testing the relationship between North and South wales Crime rate.
Details
- We will be using the spearman correlation test to check for correlation.
- The Scipy.stats module is needed and we need to import the spearmanr class from the module
- The spearmanr method takes two arguments. The two datasets to be compared in a 1-d array
- We would be using a p-value of 0.05 which means when we have a value lower than this we can reject the null hypothesis and when it is higher we cannot reject the null hypothesis
- The correlation will be performed on the normalized crime count, showing crime rate per 10,000 people
H0: North and South Wales crime rate are independent. H1: North and South Wales crime rate are dependent.
# Example of the Spearman's Rank Correlation Test
from scipy.stats import spearmanr
#splice the required columns of South wales data into a variable
scat_s = dist_s_wales[['Month','Norm Monthly Crime']]
#rename the column names using the rename method of pandas
scat_s =scat_s.rename(columns={"Norm Monthly Crime": "s_wales_cnt"})
#splice the required columns of North wales data into a variable
scat_n = dist_n_wales[['Month','Norm Monthly Crime']]
#rename the column names using the rename method of pandas
scat_n =scat_n.rename(columns={"Norm Monthly Crime": "n_wales_cnt",'Month':'Month2'})
#concatenate the north and south wales data into one dataframe
scatter_w = pd.concat([scat_s, scat_n], axis=1)
#reset the index to change the current index
scatter_w = scatter_w.reset_index()
#drop one of the duplicated months
scatter_w = scatter_w.drop(['Month2'], axis=1)
stat, p = spearmanr(scatter_w['s_wales_cnt'], scatter_w['n_wales_cnt'])
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('At a 5% significance level, we do not have enough statistical evidence to reject the\
null hypothesis that the crime rate of North and South Wales are independent')
else:
print('At a 5% significance level, we have enough statistical evidence to reject the\
null hypothesis that the crime rate of North and South Wales are independent')
stat=0.333, p=0.130
At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the crime rate of North and South Wales are independent
Research Question 3.0 Is there difference in the combined crime types observed for North and South Wales?
To-do
In this experiement we have grouped the monthly crime rate across North and South wales by the crime type. We will be using a parametric ANOVA test to test if the mean of the different monthly crime rate count differs among the different crime types.
Our assumptions are;
The subjects have been selected at random from the total groups.
The dependent variable/response is normally distributed in each group.
The dependent variable has the same variance in each group.
Details
- We will be using the one-way anova test to check for differences in the different crime rates.
- The Scipy.stats module is needed and we need to import the f_oneway class from the module
- The f_oneway method takes multiple arguments. The list of all the variables to be compared.
- We would be using a p-value of 0.05 which means when we have a value lower than this we can reject the null hypothesis and when it is higher we cannot reject the null hypothesis
The hypotheses in this case are:
H0 = the mean of the crime rate of South and North Wales is different in each of the different crime types H1 = the mean of the crime rate of South and North Wales is different in each of the different crime types
#create a list of all unique crimes
crime_list = list(n_s_wales_grouped['Crime type'].unique())
#create an empty list
crime_type_cnt = []
#create a for loop to append each crime type and their monthly crime to the list crime_type_cnt
for i in crime_list:
crime_temp = s_wales_grouped[s_wales_grouped['Crime type']== i]
crime_temp = crime_temp[['Norm Crime count']]
crime_temp = crime_temp.rename(columns={"Norm Crime count": i})
crime_type_cnt.append(crime_temp)
#import the method needed for this test
from scipy.stats import f_oneway
#assign each crime type to a variable
crime1 = crime_type_cnt[0]
crime2 = crime_type_cnt[1]
crime3 = crime_type_cnt[2]
crime4 = crime_type_cnt[3]
crime5 = crime_type_cnt[4]
crime6 = crime_type_cnt[5]
crime7 = crime_type_cnt[6]
crime8 = crime_type_cnt[7]
crime9 = crime_type_cnt[8]
crime10 = crime_type_cnt[9]
crime11 = crime_type_cnt[10]
crime12 = crime_type_cnt[11]
crime13 = crime_type_cnt[12]
crime14 = crime_type_cnt[13]
#apply the f_oneway method
stat, p = f_oneway(crime1,
crime2,
crime3,
crime4,
crime5,
crime6,
crime7,
crime8,
crime9,
crime10,
crime11,
crime12,
crime13,
crime14,
)
#print results
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('At a 5% significance level, we do not have enough statistical evidence to reject the\
null hypothesis that the means of the different crime types are equal')
else:
print('At a 5% significance level, we have enough statistical evidence to reject the null\
hypothesis that the means of the different crime types are equal')
stat=123.629, p=0.000
At a 5% significance level, we have enough statistical evidence to reject the null hypothesis that the means of the different crime types are equal
Research Question 4.0 Is there a difference in Burglary rate during winter and Summer in North Wales?
Details
- We will be using the Paired t test to check for differences in the different crime rates.
- The Scipy.stats module is needed and we need to import the ttest_rel class from the module
- The ttest_rel method takes two arguments. The 1-d array of the variables to be compared.
- We would be using a p-value of 0.05 which means when we have a value lower than this we can reject the null hypothesis and when it is higher we cannot reject the null hypothesis
To-do In this experiment the goal is to compare the Burglary rate of the two seasons Winter and Summer and this means we are trying to compare two variables with the same response to the same subject which means we would use the Paired T test for this.
H0: the crime rate in winter and summer are the same. H1: the crime rate in winter and summer are not the same.
# Import the ttest_rel module from scipy
from scipy.stats import ttest_rel
#assign the north wales normed data as pair_1_n_temp
#create a dataframe pair_1_n_temp of all monthly crime rate in Winter
pair_1_n_temp = n_wales_grouped[n_wales_grouped['season']== 'Winter']
pair_1_n_temp = pair_1_n_temp [pair_1_n_temp['Year'] == '2019']
pair_1_n_temp = pair_1_n_temp[pair_1_n_temp['Crime type'] == 'Burglary']
pair_1_n = pair_1_n_temp['Norm Crime count']
#create a dataframe pair_1_n_temp of all monthly crime rate in Summer
pair_2_n_temp = n_wales_grouped[n_wales_grouped['season']== 'Summer']
pair_2_n_temp = pair_2_n_temp [pair_2_n_temp['Year'] == '2019']
pair_2_n_temp = pair_2_n_temp[pair_2_n_temp['Crime type'] == 'Burglary']
pair_2_n = pair_2_n_temp['Norm Crime count']
#assign outputs into stat and p
stat, p = ttest_rel(pair_1_n, pair_2_n)
#print output
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the burglary crime rate during Winter and summer are the same')
else:
print('At a 5% significance level, we have enough statistical evidence to reject the null hypothesis and accept alternative hypothesis that the burglary crime rate during Winter and summer are different')
stat=-2.312, p=0.147
At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the burglary crime rate during Winter and summer are the same
Research Question 5.0 Is there a difference in Burglary rate during winter and Summer in South Wales?
To-do In this experiment the goal is to compare the Burglary rate of the two seasons Winter and Summer and this means we are trying to compare two variables with the same response to the same subject which means we would use the Paired T test for this.
H0: the crime rate in winter and summer are the same. H1: the crime rate in winter and summer are not the same.
# Import the ttest_rel module from scipy
from scipy.stats import ttest_rel
#assign the south wales normed data as pair_1_n_temp
#create a dataframe pair_1_s of all monthly crime rate in Winter
pair_1_s_temp = s_wales_grouped[s_wales_grouped['season']== 'Winter']
pair_1_s_temp = pair_1_s_temp [pair_1_s_temp['Year'] == '2019']
pair_1_s_temp = pair_1_s_temp[pair_1_s_temp['Crime type'] == 'Burglary']
pair_1_s = pair_1_s_temp['Norm Crime count']
#create a dataframe pair_2_s of all monthly crime rate in Winter
pair_2_s_temp = s_wales_grouped[n_wales_grouped['season']== 'Summer']
pair_2_s_temp = pair_2_s_temp [pair_2_s_temp['Year'] == '2019']
pair_2_s_temp = pair_2_s_temp[pair_2_s_temp['Crime type'] == 'Burglary']
pair_2_s = pair_2_s_temp['Norm Crime count']
#assign outputs into stat and p
stat, p = ttest_rel(pair_1_s, pair_2_s)
#print output
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
print('At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the burglary crime rate during Winter and summer are the same')
else:
print('At a 5% significance level, we have enough statistical evidence to reject the null hypothesis and accept alternative hypothesis that the burglary crime rate during Winter and summer are different')
stat=-0.453, p=0.695
At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the burglary crime rate during Winter and summer are the same
5.0 Heatmap
This is a simple heat map created using the Folium and Basemap library to model the crime recorded by both the North and South Wales police district
To-do In this task we would be drawing a heatmap for crime rate in both North and South Wales
Details
- The library used is Folium which needs to be imported before use
- To use this chunk of code you will need to import the joined csv file
- The columns ‘Context’, ‘Last outcome category’,‘Crime ID’ needs to be dropped from both dataframes
- There is a need for the concatenation of both datarames to create a single one
- Use the zip method to combine the values in the longitude and latitude columnss
- use the folium.Map method to set the default location of the map
- Set the longitude and Latitude to be plotted
Warning!!!!! Ensure you insert the path to the folder that contains the South and North wales dataset!
def heat_map(path_n,path_s):
'''This is a function for creating heatmaps, the function takes the concatenated files for both North
and South wales dataset, it the uses this to produce heatmap for the crime rate of these regions'''
#assign the north wales file to path_n
map_n_wales = pd.read_csv(path_n)
#assign the south wales file to path_s
map_s_wales = pd.read_csv(path_s)
#drop the unwanted columns
map_n_wales = map_n_wales.drop(['Context', 'Last outcome category','Crime ID'], axis=1)
#drop the unwanted columns
map_s_wales = map_s_wales.drop(['Context', 'Last outcome category','Crime ID'], axis=1)
#drop rows with missing values
map_s_wales = map_s_wales.dropna()
#drop rows with missing values
map_n_wales = map_n_wales.dropna()
#concatenate the cleaned north and south wales data
map_s_n_wales = pd.concat([map_s_wales,map_n_wales])
#zip two required columns and assign to a variable
crime_locations = list(zip(map_s_n_wales.Latitude, map_s_n_wales.Longitude))
# generate map
base_map = folium.Map(location=[52.2417, -3.3777], zoom_start=6.5)
heatmap = plugins.HeatMap(crime_locations, radius=3, blur=2)
return base_map.add_child(heatmap)
#callfunction with two parameters the location of the South and North wales files
heat_map('/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Northwales/North_wales_tot.csv',\
'/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Southwales/South_wales_tot.csv')
Data Exploration App
This is a simple App created using the Tkinter and Pillow library to model the crime recorded by both the North and South Wales police district. The dataset used for this application has been normalized so it will be showing crime rae per population of 10,000 in both regions.
To-do In this task we would be creating a application for our data exploration of the dataset.
Details
- The library used is Tkinter and Pillow which needs to be imported before use
- To use this chunk of code you will need to ensure data wrangling phase of this document has been completed already
- We also need to confirm that the logo file is in the same directory as this file being used.
- This code contains 8 functions each with its own button on the GUI, on clicking any of this button the app would be prompted to run the related function and display the output.
- The Seaborn module was used extensively in this code to create beautiful plots.
- The App displays the output of any of the button clicked in this consile.
- On running this block of code the application is initiated and you should see a pop where you can interact with the applications.
- Always close the application after use to avoid interference when using other parts of this document.
import tkinter as tk
from PIL import Image, ImageTk
#set the root of the app
root = tk.Tk()
#define the space the app would cover
canvas = tk.Canvas(root, width=900, height=300)
#define the columns the app would span
canvas.grid(columnspan = 3)
#create a logo using the pillow image library
#create a variable logo to hold the image
logo = Image.open('logo.png')
#use the photoimage method to insert the image as logo
logo = ImageTk.PhotoImage(logo)
#assign the label
logo_label = tk.Label(image = logo)
logo_label.image = logo
#select the row and column image should appear
logo_label.grid(column=1, row =0)
#instructions
instructions = tk.Label(root, text='Data Explorer', font = "Raleway")
instructions.grid(columnspan = 3, column = 0, row=1)
def police_patrol():
'''This functions shows the top 10 locations within the different regions that has the highest crime rate
this is to recommend locations in need of frequent police patrol to curb crime in the area.'''
# define a temporary variable to hold the count of crimes grouped by crime location
n_policing_temp = n_wales[['LSOA name','Crime type']].groupby('LSOA name').count()
#reset index of the variable to change the current index
n_policing_temp.reset_index(inplace=True)
#sort the values in descending order to show top values
n_policing = n_policing_temp.sort_values(by=['Crime type'],ascending=False).head(10)
#use seaborn barplot method to create a bar plot of the top 10 values and set a title
_ = sns.barplot(x = 'Crime type', y = 'LSOA name', data= n_policing)\
.set_title("Top ten Crime locations in North Wales per 10,000 people")
plt.show()
# define a temporary variable to hold the count of crimes grouped by crime location
s_policing_temp = s_wales[['LSOA name','Crime type']].groupby('LSOA name').count()
#reset index of the variable to change the current index
s_policing_temp.reset_index(inplace=True)
#sort the values in descending order to show top values
s_policing = s_policing_temp.sort_values(by=['Crime type'],ascending=False).head(10)
#use seaborn barplot method to create a bar plot of the top 10 values and set a title
_ = sns.barplot(x = 'Crime type', y = 'LSOA name', data= s_policing)\
.set_title("Top ten Crime locations in South Wales per 10,000 people")
plt.show()
#call help method on this function to show what it does to the user
help(police_patrol)
#switch the button status to report generated
browse_text.set("Report Generated, check console")
def Common_crimes():
'''This function creates a grid plot using seaborn to show the a monthly barplot of the crime rate of
the different crime type. The rows shows the month while the column shows the region.'''
#We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region
#North Wales to the Left and South Wales to the right
#creating a FacetGrid with Seaborn library and this is where we select dataset to be used
grid = sns.FacetGrid(n_s_wales_grouped, row='Month', col='Region', height=3.5,aspect = 2.65)
#create a variable crime order to hold the order in which the plots would be displayed
crime_order = list(n_s_wales_grouped['Crime code'].unique())
#lets populate the facetGrid with the specific plots and add the order
_ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order = crime_order )
grid.fig.subplots_adjust(top=0.96)
#adding the title of the plot
_ = grid.fig.suptitle('Most common crime types per 10,000 people in North and South Wales', fontsize=28)
plt.show()
#call help method on this function to show what it does to the user
help(Common_crimes)
#show the crime codes
print(crime_mapping)
browse_text2.set("Report Generated, check console")
def Covid_crimes():
'''This function displays a barplot showing a comparison between the crime rate before covid and during
covid, This was created using the seaborn gridplot module'''
#We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region
#North Wales to the Left and South Wales to the right
#creating a FacetGrid with Seaborn library and assigning data to be used
grid = sns.FacetGrid(n_s_wales_grouped, row='Covid', col='Region', height=3.5,aspect = 2.65)
#create a variable crime order to hold the order in which the plots would be displayed
crime_order2 = list(n_s_wales_grouped['Crime code'].unique())
#lets populate the facetGrid with the specific plots and add the order
_ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order = crime_order2)
grid.fig.subplots_adjust(top=0.8)
#adding the title of the plot
_ = grid.fig.suptitle('Difference in crime rate before and during COVID 19 per 10,000 people', fontsize=28)
plt.show()
#call help method on this function to show what it does to the user
help(Covid_crimes)
#show the crime codes
print(crime_mapping)
browse_text3.set("Report Generated, check console")
def Seasonal_crimes():
'''This function displays a barplot showing a comparison between the crime rate grouped by season,
This was created using the seaborn gridplot module '''
#We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region
#North Wales to the Left and South Wales to the right
#creating a FacetGrid with Seaborn library and assigning data to be used
grid = sns.FacetGrid(n_s_wales_grouped, row='season', col='Region', height=3.5,aspect = 2.65)
#create a variable crime order to hold the order in which the plots would be displayed
crime_order3 = list(n_s_wales_grouped['Crime code'].unique())
#lets populate the facetGrid with the specific plots
_ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order=crime_order3)
plt.show()
#call help method on this function to show what it does to the user
help(Seasonal_crimes)
#show the crime codes
print(crime_mapping)
browse_text4.set("Report Generated, check console")
def crime_corr():
'''This function displays the correlation between the normalized crime rates of North and South Wales,
this was created using the scatterplot method of the seaborn library'''
#splice the required columns of South wales data into a variable
scat_s = dist_s_wales[['Month','Norm Monthly Crime']]
#rename the column names using the rename method of pandas
scat_s =scat_s.rename(columns={"Norm Monthly Crime": "s_wales_cnt"})
#splice the required columns of North wales data into a variable
scat_n = dist_n_wales[['Month','Norm Monthly Crime']]
#rename the column names using the rename method of pandas
scat_n =scat_n.rename(columns={"Norm Monthly Crime": "n_wales_cnt",'Month':'Month2'})
#concatenate the north and south wales data into one dataframe
scatter_w = pd.concat([scat_s, scat_n], axis=1)
#reset the index to change the current index
scatter_w = scatter_w.reset_index()
#drop one of the duplicated months
scatter_w = scatter_w.drop(['Month2'], axis=1)
#use seaborn to plot a scatterplot for crime rate and set title of plot
_ = sns.scatterplot(data=scatter_w, x="s_wales_cnt", y="n_wales_cnt").\
set_title("Correlation Plot for North and South Wales crime rate")
plt.show()
#call help method on this function to show what it does to the user
help(crime_corr)
browse_text5.set("Report Generated, check console")
def Highest_crime():
'''This is a bee swarmplot showing the normalized crime rate in each region i.e crime rate per 10,000
people in North and South wales. This was created using the seaborn library'''
#nsert the required columns into the method and select data to be used
_ = sns.swarmplot(x = 'Region', y = 'Norm Monthly Crime', data= Monthly_crime_count_tot).\
set_title("Swarm Plot for North and South Wales crime rate per 10,000 people")
#create the x axis label
_ = plt.xlabel('Region')
#create the y axis label
_ = plt.ylabel('Monthly Crime count')
plt.show()
#call help method on this function to show what it does to the user
help(Highest_crime)
browse_text6.set("Report Generated, check console")
def crime_evo():
'''This function shows the evolution of crime over time in North and South Wales. This function
uses the normalized dataset which means it shows crime per 10,000 people in both regions'''
#splice the required columns from the dataframe and set index for north data
n_plot = dist_n_wales[['Month','Norm Monthly Crime']].set_index('Month')
#splice the required columns from the dataframe and set index for south data
s_plot = dist_s_wales[['Month','Norm Monthly Crime']].set_index('Month')
#assign the northern plot to a variable
ax = n_plot.plot()
#use the northern variable to create the southern plot
s_plot.plot(ax=ax)
#define the legend for the plot
ax.legend(["North Wales", "South Wales"])
plt.show()
#call help method on this function to show what it does to the user
help(crime_evo)
browse_text7.set("Report Generated, check console")
def pop_crime_type():
'''The goal of this function is to display a piechart with the most popular crime types. This was created
using the pie method in pandas'''
#plot the crime types of North wales on a piechart
plot = crime_cnt_n_wales.set_index('Crime type').plot.pie(y='Percent',figsize=(15, 15))
#plot the crime types of south wales on a piechart
plot = crime_cnt_s_wales.set_index('Crime type').plot.pie(y='Percent',figsize=(15, 15))
plt.show()
browse_text8.set("Report Generated, check console")
#police_patrol button
#assign the browse text as a string
browse_text = tk.StringVar()
#create a button to call the police_patrol function on click and assign the parameters
browse_btn = tk.Button(root, textvariable=browse_text,command = lambda:police_patrol(), \
font="Raleway", bg="#20bebe", fg="white", height=2, width=50)
#set the name of the button
browse_text.set("Generate areas in need of patrol")
#assign the column and row the button should appear
browse_btn.grid(column=1, row=2)
#Common_crimes button
#assign the browse text as a string
browse_text2 = tk.StringVar()
#create a button to call the Common_crimes function on click and assign the parameters
browse_btn2 = tk.Button(root, textvariable=browse_text2, \
font="Raleway", bg="#20bebe", fg="white", command = lambda:Common_crimes(), \
height=2, width=50)
#set the name of the button
browse_text2.set("Common Crimes in North and South Wales")
#assign the column and row the button should appear
browse_btn2.grid(column=1, row=4)
#Common_crimes button
#assign the browse text as a string
browse_text3 = tk.StringVar()
#create a button to call the Common_crimes function on click and assign the parameters
browse_btn3 = tk.Button(root, textvariable=browse_text3, \
font="Raleway", bg="#20bebe", fg="white", command = lambda:Covid_crimes(), \
height=2, width=50)
#set the name of the button
browse_text3.set("Pre-Covid Crimes Vs Covid Crimes")
#assign the column and row the button should appear
browse_btn3.grid(column=1, row=6)
#Seasonal_crimes button
#assign the browse text as a string
browse_text4 = tk.StringVar()
#create a button to call the Seasonal_crimes function on click and assign the parameters
browse_btn4 = tk.Button(root, textvariable=browse_text4, \
font="Raleway", bg="#20bebe", fg="white", command = lambda:Seasonal_crimes(), \
height=2, width=50)
#set the name of the button
browse_text4.set("Crime rate by Season")
#assign the column and row the button should appear
browse_btn4.grid(column=1, row=7)
#crime_corr button
#assign the browse text as a string
browse_text5 = tk.StringVar()
#create a button to call the crime_corr function on click and assign the parameters
browse_btn5 = tk.Button(root, textvariable=browse_text5, \
font="Raleway", bg="#20bebe", fg="white", command = lambda:crime_corr(), \
height=2, width=50)
#set the name of the button
browse_text5.set("North and South Wales Crime correlation")
#assign the column and row the button should appear
browse_btn5.grid(column=1, row=8)
#Highest_crime button
#assign the browse text as a string
browse_text6 = tk.StringVar()
#create a button to call the Highest_crime function on click and assign the parameters
browse_btn6 = tk.Button(root, textvariable=browse_text6, \
font="Raleway", bg="#20bebe", fg="white", command = lambda:Highest_crime(), \
height=2, width=50)
#set the name of the button
browse_text6.set("Region With Highest Crime Rate")
#assign the column and row the button should appear
browse_btn6.grid(column=1, row=9)
#crime_evo button
#assign the browse text as a string
browse_text7 = tk.StringVar()
#create a button to call the crime_evo function on click and assign the parameters
browse_btn7 = tk.Button(root, textvariable=browse_text7, \
font="Raleway", bg="#20bebe", fg="white", command = lambda:crime_evo(), \
height=2, width=50)
#set the name of the button
browse_text7.set("Evolution of crime over time")
#assign the column and row the button should appear
browse_btn7.grid(column=1, row=10)
#pop_crime_type button
#assign the browse text as a string
browse_text8 = tk.StringVar()
#create a button to call the pop_crime_type function on click and assign the parameters
browse_btn8 = tk.Button(root, textvariable=browse_text8, \
font="Raleway", bg="#20bebe", fg="white", command = lambda:pop_crime_type(), \
height=2, width=50)
#set the name of the button
browse_text8.set("Popular Crime type in North and South Wales")
#assign the column and row the button should appear
browse_btn8.grid(column=1, row=12)
canvas = tk.Canvas(root, width=600, height=250)
#set how many columns the app should span
canvas.grid(columnspan=3)
root.mainloop()