The code above is to help format the notebook and make the outputs appear inline and within the document.

1.0 Data Preprocessing

This is to load all the modules to be used in this analysis. The os module provides a way to interact with the Os of the machine, the glob module is used to retrieve the path names needed. Pandas,Numpy, Seaborn, Matplotlib.pyplot and Folium are modules used to interact with the dataset and plot visualizations.

import os
import glob
from os import path
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import folium
from folium import plugins
sns.set()

We would be modify the code below gotten from here https://www.freecodecamp.org/news/how-to-combine-multiple-csv-files-with-8-lines-of-code-265183e0854/ to create a function that will merge all the 22 months into one single file for each region.

To-do In this phase we are trying to prepare the dataset into a format suitable for analysis.

Warning!!!!! Since the document is being copied from one machine to another the paths specified in this documnet would not be the same as that of the user of the document so ensure to copy the right paths!

Details

The library used is Pandas, Os and glob which needs to be imported before use
To use this chunk of code you will need to ensure you have two folders, one containing South Wales data and the other folder holds the North Wales data. Copy the paths to these folders and keep it somewhere as they are very crucial to this program.
When using this program you will be required to specify file paths ensure you use the path you have copied
If you want the program below to run ensure you have deleted any file with the same name as the output file you want to create as the code wont run if there is an existing file in the directory
The Merged file for both North and South Wales data would be merged into the specified paths
Use the python help method to see what the function does.

#define the function that takes two arguments the path of the file and the file name
def combine_all(fpath,fname):
    '''This is a function that merges all the files in a directory and compares the count of the rows of the
    combined file to the sum of the rows of all the component individual files. It is used to confirm the
    merged file is correct'''
    #use the os module to switch to the specified directory
    os.chdir(fpath)
    #check of the file exists in the dir and merge if it not using glob(get file names) and concat method
    if path.isfile(fname)==True:
        print('The Merged file already exists in directory please delete first or proceed with existing file')
    else:
        #pick the file extensions
        extension = 'csv'
        #assign all file names with the extension csv into a list and assign to variable
        all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
        #combine all files in the list
        combined_csv = pd.concat([pd.read_csv(f) for f in all_filenames ])
        #export to csv and name as file name inputed
        combined_csv.to_csv(fname, index=False, encoding='utf-8-sig')
        #read in the csvs with the specified file name
        df = pd.read_csv(fname)
        #confirm the count of the data
        tot_row_count = df['Reported by'].count()
        #get the individual counts of all the files and compare with total combine counts
        file_row = []
        for i in all_filenames:
            if i != fname:
                file_count = pd.read_csv(i)
                row = file_count['Reported by'].count()
                file_row.append(row)
                file_row_count = sum(file_row)
        return tot_row_count,file_row_count

Warning!!!!! Ensure you insert the path to the folder that contains the North wales dataset only!

#Merge all the Northern Wales files, change this directory to the directory where the files are
combine_all("/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Northwales",'North_wales_tot.csv')

The Merged file already exists in directory please delete first or proceed with existing file

Warning!!!!! Ensure you insert the path to the folder that contains the South wales dataset only!

#Merge all the Southern Wales files, change this directory to the directory where the files are
combine_all("/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Southwales",'South_wales_tot.csv')

The Merged file already exists in directory please delete first or proceed with existing file

Outcome of Data Preprocessing

We now have a unified file for each of the different regions being studied North and South wales data for the 22 months being studied has been created. just viewing the outcome of the row cout above we will see that south wales has almost double the total crime count for north wales from January 2019 to october 2020.

Data cleaning 1.0

To-do In this phase we have successfully merged our data and we will be proceeding with the data cleaning process.

Warning!!!!! Ensure to specify the path where the files have been merged from the use of the combine_all function.

Details

The library used is Pandas which needs to be imported before use
To use this chunk of code you will need to ensure you specified the right paths when reading in the CSV with pandas read_csv method

#reading in the Combine North wales dataset  and the combined Southwales dataset
#function takes the location of the files and the file output name
n_wales = pd.read_csv('/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Northwales/North_wales_tot.csv')
s_wales = pd.read_csv('/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Southwales/South_wales_tot.csv')

#print('The shape for North wales
#checking the shape of the data and missing values
print("Shape of the North wales data is {}".format(n_wales.shape))
print(f"Number of missing values: {n_wales.isnull().sum().sum()}")
print("Shape of the South wales data is {}".format(s_wales.shape))
print(f"Number of missing values: {s_wales.isnull().sum().sum()}")

Shape of the North wales data is (132506, 12)
Number of missing values: 190634
Shape of the South wales data is (268738, 12)
Number of missing values: 427140

#getting the number of missing files for North Wales
n_wales.isnull().sum()

Crime ID                  29048
Month                         0
Reported by                   0
Falls within                  0
Longitude                     8
Latitude                      8
Location                      0
LSOA code                     8
LSOA name                     8
Crime type                    0
Last outcome category     29048
Context                  132506
dtype: int64

#getting the number of missing files for South Wales
s_wales.isnull().sum()

Crime ID                  69219
Month                         0
Reported by                   0
Falls within                  0
Longitude                  4991
Latitude                   4991
Location                      0
LSOA code                  4991
LSOA name                  4991
Crime type                    0
Last outcome category     69219
Context                  268738
dtype: int64

#viewing the first 3 dataset from the North wales data
n_wales.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

#viewing the first 3 dataset from the south wales data
s_wales.head(3)

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Observations from Data cleaning 1.0

Northern Wales has around 132506 rows of data with each row reperesenting a crime committed in the region
South Wales has around 268738 rows of data which is around twice the rows recorded for North wales, which looks like South wales has around twice as much of the northern wales crimes committed
The shape of the data is the same for both dataset and the columns and number ofcolumns match on both dataset.
From viewing the first 3 rows of the North wales and south wales dataset we can see that each row reperesents a crime committed with the crime ID being the unique identifier for each row. The row captures the location of the crime, month it occurred,outcome and the context in which the crime occurred
The month column is used to aggregate the crimes committed monthly and the month for which a crime occurred is whats captured.
Crime ID, Last outcome category and Context are responsible for the missing values in both dataset with the null values in the South wales data higher than that of the North which is expected since South wales has more rows of data

Data cleaning 2.0

To-do We are trying to compare the crimes committed between the two police jurisdictions so we would not need columns like Last outcome category,Context and crime ID since it is established that each row is a distinct crime.

Details

The library used is Pandas which needs to be imported before use
We will be dropping redundant columns from the dataset

#We will be dropping columns that would not impact the analysis using the drop method
n_wales = n_wales.drop(['Context', 'Last outcome category','Crime ID'], axis=1)
n_wales.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

#We will be dropping columns that would not impact the analysis using the drop method
s_wales = s_wales.drop(['Context', 'Last outcome category','Crime ID'], axis=1)
s_wales.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

#re-checking the shape of the data and missing values
print("Shape of the North wales data is {}".format(n_wales.shape))
print(f"Number of missing values: {n_wales.isnull().sum().sum()}")
print("Shape of the South wales data is {}".format(s_wales.shape))
print(f"Number of missing values: {s_wales.isnull().sum().sum()}")

Shape of the North wales data is (132506, 9)
Number of missing values: 32
Shape of the South wales data is (268738, 9)
Number of missing values: 19964

#show missing values
n_wales.isnull().sum()

Month           0
Reported by     0
Falls within    0
Longitude       8
Latitude        8
Location        0
LSOA code       8
LSOA name       8
Crime type      0
dtype: int64

#show missing values
s_wales.isnull().sum()

Month              0
Reported by        0
Falls within       0
Longitude       4991
Latitude        4991
Location           0
LSOA code       4991
LSOA name       4991
Crime type         0
dtype: int64

#confirm we have same columns in dataset of both regions
n_wales_columns = set(list(n_wales.columns))
s_wales_columns = set(list(s_wales.columns))
#using the zip method to check if we have any column in either dataset not in the other
for i,j in zip(n_wales_columns,s_wales_columns):
    assert(i == j), \
       'There is a column in one of the regions not present in the other'

Observations from Data cleaning 2.0

To-do We will be fixing missing and null values

Details

We removed 3 columns that would not contribute to our analysis and which had a bulk of the missing values. After removing this values we checked for missing values again and although the missing values have reduced we can still notice we have missing values for Longitude,Latitude, LSOA code and LSOA name which are location details.
We asserted that we have the same columns and number of columns in both datasets, this mean we could merge the dataset down the line. for now we use them individually so we can confirm they have similar distributions

Data cleaning 3.0 : Fixing all missing values

< In this stage of our data cleaning procedure we will be fixing all null values for the location details missing. From the analysis above we can se that although all other location details have missing values the location variable(column) has no missing values so this would be used to fix the missing values in our dataset

Details

The library used is Pandas which needs to be imported before use
We will be dropping redundant columns from the dataset

#checking for Null values in the longitude column
s_wales[s_wales['Longitude'].isnull()]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

#checking for Null values in the longitude column
n_wales[n_wales['Longitude'].isnull()]

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

#checking count of rows with "No location"
n_wales[n_wales['Location'] == 'No Location'].count()

Month           8
Reported by     8
Falls within    8
Longitude       0
Latitude        0
Location        8
LSOA code       0
LSOA name       0
Crime type      8
dtype: int64

#checking count of rows with "No location"
s_wales[s_wales['Location'] == 'No Location'].count()

Month           4991
Reported by     4991
Falls within    4991
Longitude          0
Latitude           0
Location        4991
LSOA code          0
LSOA name          0
Crime type      4991
dtype: int64

From the above we can see that although Location shows that it has no missing values, the inputs for the columns shows that there is no location available, we would leave this dataset as it is and group the crimes with no location as unKnown location. All the missing values has been accounted for and we will proceed with the analysis.

n_wales['Month'].count()

2.0 Data Wrangling and Normalisation

In this section of the Analysis we will shape our data into formats suitable for analysis. We will be creating individual tables for the two regions and also merging the dataframes into one for analytical purposes.

The data dataset to be used for this step has 10 columns. We can divide these columns into two.

Columns that shows crime details(Month, Reported by, Crime type)
Columns that shows crime location details(Falls within, Longitude, Latitude, Location, LSOA code)

The Month contains the month the crime was committed, a row of the data is a crime committed in the month it was recorded. Most of the crime has a recorded location details for it but some do not. We wont be dealing with crime specific locations and as such this wont be part of the columns wrangled in this step.

Reference for Month Number to Month name conversion in code below - https://stackoverflow.com/questions/37625334/python-pandas-convert-month-int-to-month-name/37625467

We will normalise the data using population since our data is a region based dataset, population might influence our outcome. formular = (Total crime count per region/region population) x 1000

That means that the numbers shown in the normalized data represent total number of crimes per 1000 people.

Step 1: Creating dataset for count of Crime type by month and season

In this step we will create the s_wales_grouped,n_wales_grouped and n_s_wales_grouped dataframes. Most of the wrangling would be done using built in methods and attributes of the Pandas module. The data created in this step is one that each row is a Crime type, normalized crime count ot 100,000 people and its corresponding month. This was done to have a way to compare the crime type for different time periods and compare the crime types. s_wales_grouped contains the South Wales data while n_wales_grouped contains Northern wales data. The n_s_wales_grouped concatenates both dataframes into one.

steps used

For each region the crime count was grouped by the Month and crime to give a dataset that shows crime rate of a particular crime type for a given month.
The output of the grouping resulted two month column and as such we dropped one of them
The column names where then renamed to sensible names that described the datapoints in the column
We then had to reset the index to remove the month column has an index since we need it in our dataset
We then used a string literal to tag each row for the region it reperesents
The string split method was used to split the month column into Year and the month number which was then formatted to the corresponding month name
A Seasons mapping dictionary was created to hold the different months and the corresponding season it falls into, this dictionary was then used to categorize the Months into the four seasons using a lambda function and the pandas apply method
A dictionary covid_mapping was created to group the dataset into pre covid and post covid and was applied to the dataset with the same technique as above
A crime dictionary was also created to encode each crime rate with 3 characters. This was then used to create a crime code column which is a code that reperesents each crime type
The crime count was then normalized into total crimes per 100,000 people using the formular (Total crime count per region/region population) x 10,000
Finally, the two datasets created for North and South wales was concatenated into one single dataset.

#Grouping data using the groupby method to group the data by month and crime type for a particuar region

#southwales data grouped and count of crime type derived
s_wales_temp = s_wales[['Month', 'Crime type']].groupby([s_wales['Month'], s_wales['Crime type']]).count()
#dropping the Month column
s_wales_temp = s_wales_temp.drop('Month',axis=1)
#renaming the crime type to column to crime count
s_wales_grouped = s_wales_temp.rename(columns={"Crime type": "Crime count"})
#resetting index to remove Crime type from being index
s_wales_grouped.reset_index(inplace=True)
#using string literal to assign Region name to dataframe
s_wales_grouped['Region'] = 'South Wales'
#splitting the month column into the year and the month number
s_wales_grouped[['Year','Month Name']] = s_wales_grouped.Month.str.split("-",expand=True,)
#converting the Month Number to Month name using referenced code above
s_wales_grouped['Month Name'] = pd.to_datetime(s_wales_grouped['Month Name'], format='%m').dt.month_name()\
.str.slice(stop=3)
#Creating a new column called season to group months into seasons
#create variable mapping for Months
seasons_mapping = {'Jan':'Winter','Dec':'Winter','Feb':'Winter','Mar':'Spring',\
                  'Apr':'Spring','May':'Spring','Jun':'Summer','Jul':'Summer','Aug':'Summer',\
                  'Sep':'Autumn','Oct':'Autumn','Nov':'Autumn'}
#use lambda function to apply function across the column top down
s_wales_grouped['season']= s_wales_grouped['Month Name'].apply(lambda x:seasons_mapping[x])
#create column to group months into pre covid and covid periods
#according to timeline in link https://bfpg.co.uk/2020/04/covid-19-timeline/ covid first patient in UK was Jan
# We will assume 2019 is pre covid and 2020 is covid
covid_mapping = {'2019':'Pre-Covid','2020':'Covid'}
#use lambda function to apply function across the column top down
s_wales_grouped['Covid']= s_wales_grouped['Year'].apply(lambda x:covid_mapping[x])

#creating a key for the crime types
crime_mapping = {'Other theft':'OT', 'Theft from the person':'TFP',
       'Violence and sexual offences':'VSO', 'Anti-social behaviour':'ASB',
       'Burglary':'BUG', 'Criminal damage and arson':'CDA', 'Drugs':'DRG', 'Public order':'PO',
       'Other crime':'OC', 'Possession of weapons':'POW', 'Robbery':'ROB', 'Vehicle crime':'VC',
       'Shoplifting':'SPL', 'Bicycle theft':'BT'}
#use lambda function to apply function across the column top down
s_wales_grouped['Crime code']= s_wales_grouped['Crime type'].apply(lambda x:crime_mapping[x])
#since our data is location based we will use the population of each region to normalise the data
s_wales_grouped['Norm Crime count']= s_wales_grouped['Crime count'].apply(lambda x:(x/299239) * 10000)




#Northwales data grouped and count of crime type derived
n_wales_temp = n_wales[['Month', 'Crime type']].groupby([n_wales['Month'], n_wales['Crime type']]).count()
n_wales_temp = n_wales_temp.drop('Month',axis=1)
n_wales_grouped = n_wales_temp.rename(columns={"Crime type": "Crime count"})
#resetting index to remove Crime type from being index
n_wales_grouped.reset_index(inplace=True)
#using string literal to assign Region name to dataframe
n_wales_grouped['Region'] = 'North Wales'
#splitting the month column into the year and the month number
n_wales_grouped[['Year','Month Name']] = n_wales_grouped.Month.str.split("-",expand=True,)
#converting the Month Number to Month name using referenced code above
n_wales_grouped['Month Name'] = pd.to_datetime(n_wales_grouped['Month Name'], format='%m').dt.month_name()\
.str.slice(stop=3)
#Creating a new column called season to group months into seasons
#create variable mapping for Months
#use lambda function to apply function across the column top down
n_wales_grouped['season']= n_wales_grouped['Month Name'].apply(lambda x:seasons_mapping[x])
#create column to group months into pre covid and covid periods
#according to timeline in link https://bfpg.co.uk/2020/04/covid-19-timeline/ covid first patient in UK was Jan
# We will assume 2019 is pre covid and 2020 is covid
#use lambda function to apply function across the column top down
n_wales_grouped['Covid']= n_wales_grouped['Year'].apply(lambda x:covid_mapping[x])

#use lambda function to apply function across the column top down
n_wales_grouped['Crime code']= n_wales_grouped['Crime type'].apply(lambda x:crime_mapping[x])
#since our data is location based we will use the population of each region to normalise the data
n_wales_grouped['Norm Crime count']= n_wales_grouped['Crime count'].apply(lambda x:(x/70073) * 10000)




#concatenating the two regions dataset
n_s_wales_grouped = pd.concat([n_wales_grouped,s_wales_grouped])
n_wales_grouped.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Step 2 Dataset for Monthly total crime count by region

This dataset was created to hold the total monthly crime rate of each region separately and also in a single dataframe. The crime rate was then normalized using the formular formular (Total crime count per region/region population) x 10,000. This is to basically hold monthly distribution of crime rate in North and South Wales. The pandas framework and its methods were used extensively in this step.

steps used

The total crime count was grouped by the months and the columns were renamed using the rename method in pandas
String literal was then used to assign the region to each row
The data was normalised with the North and South wales population multiplied by a 10,000 to show crime per 10,000 people
The index was reset to remove the month as the index
The string split method was used to split the month column into Year and the month number which was then formatted to the corresponding month name
Wrangled dataset for the two regions were then concatenated into a dataframe.

#getting the monthly total crime distribution of crime in north wales
#group crime count by month using group method
dist_n_wales = n_wales[['Crime type','Month']].groupby('Month').count()
#rename columns using rename method
dist_n_wales = dist_n_wales.rename({'Crime type': 'Monthly Crime count'}, axis=1)
#assign region to data
dist_n_wales['Region'] = 'North Wales'
#normalise data using population to crime per 10,000 people
dist_n_wales['Norm Monthly Crime']= dist_n_wales['Monthly Crime count'].apply(lambda x:(x/70073) * 10000)
#reset index
dist_n_wales.reset_index(inplace=True)
#split the Month column into year and month name
dist_n_wales[['Year','Month Name']] = dist_n_wales.Month.str.split("-",expand=True,)
#converting the Month Number to Month name using referenced code above
dist_n_wales['Month Name'] = pd.to_datetime(dist_n_wales['Month Name'], format='%m').dt.month_name()\
.str.slice(stop=3)




#getting the monthly total crime distribution of crime in south wales
#group crime count by month using group method
dist_s_wales = s_wales[['Crime type','Month']].groupby('Month').count()
#rename columns using rename method
dist_s_wales = dist_s_wales.rename({'Crime type': 'Monthly Crime count'}, axis=1)
#assign region to data
dist_s_wales['Region'] = 'South Wales'
#normalise data using population to crime per 10,000 people
dist_s_wales['Norm Monthly Crime']= dist_s_wales['Monthly Crime count'].apply(lambda x:(x/299239) * 10000)
#reset index
dist_s_wales.reset_index(inplace=True)
#split the Month column into year and month name
dist_s_wales[['Year','Month Name']] = dist_s_wales.Month.str.split("-",expand=True,)
#converting the Month Number to Month name using referenced code above
dist_s_wales['Month Name'] = pd.to_datetime(dist_s_wales['Month Name'], format='%m').dt.month_name()\
.str.slice(stop=3)


#concatenating both dataframes into one table using the Concat method in pandas
Monthly_crime_count_tot = pd.concat([dist_n_wales,dist_s_wales])
Monthly_crime_count_tot.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

step 3 Dataset for percentage of a crime type in a particular region

This dataset was created to hold the total crime rate per crime type for both South and Northern wales, we then create a unified dataframe to hold both datasets.

steps used

We used the value_counts method of pandas to get the unique crime types present in our dataset then we use the dict method to convert this into a dictionary
The pandas DataFrame method was chained with the from_dict method used to create a dataframe from a dictionary
The columns method was used to name our new dataset
We then derived the monthly crime percentage but this was later dropped as it was not used
We normalised our data with the North and South wales population multiplied by a 10,000 to show crime per 10,000 people
Finally, the two datasets created where concatenated into one.

#crime type by region
#Creating the total rate of crime in North Wales grouped by the  crime types 
#get the count of each crime type and then use dict method to convert it into a dictionary
crime_cnt_n_wales = dict(n_wales['Crime type'].value_counts())
#Use pandas to convert the dictionary into a dataframe
crime_cnt_n_wales = pd.DataFrame.from_dict(crime_cnt_n_wales,orient='index').reset_index()
#rename the new columns using the pandas column method
crime_cnt_n_wales.columns = ['Crime type','Tot_cnt']
#use string literal to assign region
crime_cnt_n_wales['Region']='North Wales'
#calculate crime percentage of each crime type
crime_cnt_n_wales['Percent'] = (crime_cnt_n_wales['Tot_cnt'] / crime_cnt_n_wales['Tot_cnt'].sum()) * 100
#normalise data using population
crime_cnt_n_wales['norm']= crime_cnt_n_wales['Tot_cnt'].apply(lambda x:(x/70073) * 10000)




#Creating the total rate of crime in South Wales grouped by the  crime types 
#get the count of each crime type and then use dict method to convert it into a dictionary
crime_cnt_s_wales = dict(s_wales['Crime type'].value_counts())
#Use pandas to convert the dictionary into a dataframe
crime_cnt_s_wales = pd.DataFrame.from_dict(crime_cnt_s_wales,orient='index').reset_index()
#rename the new columns using the pandas column method
crime_cnt_s_wales.columns = ['Crime type','Tot_cnt']
#use string literal to assign region
crime_cnt_s_wales['Region']='South Wales'
#calculate crime percentage of each crime type
crime_cnt_s_wales['Percent'] = (crime_cnt_s_wales['Tot_cnt'] / crime_cnt_s_wales['Tot_cnt'].sum()) * 100
#normalise data using populatio
crime_cnt_s_wales['norm']= crime_cnt_s_wales['Tot_cnt'].apply(lambda x:(x/299239) * 10000)




#concatenating the total rate of crime in South and North Wales grouped by the  crime types
crime_cnt_tot = pd.concat([crime_cnt_n_wales,crime_cnt_s_wales])
#drop the percentage column
crime_cnt_tot =crime_cnt_tot.drop(['Percent'], axis=1)
crime_cnt_tot.head()

.dataframe tbody tr th {
    vertical-align: top;
}

.dataframe thead th {
    text-align: right;
}

Outputs From Wrangling

We need the data to be grouped by two variables, which is the Month of the observations and the crime type then the count of the crime type would be taken. the output of the wrangling in Step 1 above produced a dataframe which satisfies this requirement
A dataframe was needed to hold the distribution of the monthly measurement by the two regions. We have three outputs from this step which is the monthly data from south wales, North wales and a unified table holding all the monthly crimes grouped by the region
A dataframe was needed to hold the distribution of the monthly measurement by the two regions. We have three outputd from this step which is the monthly data from south wales, North wales and a unified table holding all the monthly crimes grouped by the region
The dataset was normalized using population of each region. The output gave the crime rate per 100,000 people in both regions

Output dataframes(tables) n_s_wales_grouped: This is a combined dataframe of the North and South wales dataset categorized into the crime count grouped by the crime type s_wales_grouped: This is a dataframe of the South wales dataset categorized into the crime count grouped by the crime type n_wales_grouped: This is a dataframe of the North wales dataset categorized into the crime count grouped by the crime type dist_s_wales: This shows the monthly crime count in the South Wales police district dist_n_wales: This shows the monthly crime count in the North Wales police district Monthly_crime_count_tot: This shows the combined monthly crime count in the North and South Wales police district crime_cnt_n_wales: This is the total Crime count grouped into the crime type for the North wales region crime_cnt_s_wales: This is the total Crime count grouped into the crime type for the North wales region crime_cnt_tot:This is the total Crime count grouped into the crime type combined for the North and South wales region

3.0 Data Exploration

In this section we will start exploring our dataset Graphically and also produce some summary statistics to understand the dataset. We will show plots to visualise our data.s

To-do Exploring the dataset using the different dataframes created during the wrangling phase

Details

The library used is Pandas, Numpy, Seaborn and Matplotlib.pyplot which needs to be imported before use
To use this chunk of code you will need to use the dataframes listed in the wrangling phase above
Each section shows the analysis question being answered and each section as an output summarizing the results
Ensure the Data Exploration application is not working to avoid interference
Ensure The method of any section is called.
the section starts with a summary statistics which just shows a summary of the data

#this datasets shows the monthly total crime rate for North and South wales
print(dist_s_wales.describe())

       Monthly Crime count  Norm Monthly Crime
count            22.000000           22.000000
mean          12215.363636          408.214291
std            1235.897112           41.301338
min           10322.000000          344.941669
25%           11271.250000          376.663804
50%           12066.500000          403.239551
75%           13000.250000          434.443706
max           15541.000000          519.350753

The standard deviation shows that on a average the monthly crime rate measured tend to be around (+ or -) 12,000 crimes from the average monthly crimes but when we factor population that comes into around (+ or -) 41
The month with the least crime rate had about 10,322 whih is 344 crimes per 10,000
The month with the maximum crime rate had 15,541 at 519 crimes per 10,000 people monthly

print(dist_n_wales.describe())

       Monthly Crime count  Norm Monthly Crime
count            22.000000           22.000000
mean           6023.000000          859.532202
std             468.567828           66.868527
min            5348.000000          763.204087
25%            5675.750000          809.976739
50%            5991.500000          855.036890
75%            6380.250000          910.514749
max            6964.000000          993.820730

The standard deviation shows that on a average the monthly crime rate measured tend to be around + or - 468 crimes from the average monthly crimes but when we factor population that comes into around + or - 66
The month with the least crime rate had about 10,322 whih is 344 crimes per 10,000
The month with the maximum crime rate had 5348 at 855 crimes per 10,000 people monthly

Although overall crime rate in South Wales is double the figures noticed for North Wales. North Wales has more crime rate per 10,000 people than South Wales
South Wales has a lesser crime rate standard deviation than North Wales showing that values noticed for South Wales are closer to the mean than that of the North Wales to its mean

#South Wales Crime count per crime type
crime_cnt_s_wales.value_counts()

Crime type                    Tot_cnt  Region       Percent    norm       
Violence and sexual offences  77116    South Wales  28.695607  2577.070502    1
Vehicle crime                 13155    South Wales  4.895102   439.615157     1
Theft from the person         1886     South Wales  0.701799   63.026544      1
Shoplifting                   15989    South Wales  5.949661   534.322064     1
Robbery                       945      South Wales  0.351644   31.580108      1
Public order                  24342    South Wales  9.057893   813.463486     1
Possession of weapons         1350     South Wales  0.502348   45.114440      1
Other theft                   15967    South Wales  5.941475   533.586865     1
Other crime                   4057     South Wales  1.509649   135.577248     1
Drugs                         8815     South Wales  3.280146   294.580586     1
Criminal damage and arson     22278    South Wales  8.289859   744.488519     1
Burglary                      10363    South Wales  3.856172   346.311811     1
Bicycle theft                 3256     South Wales  1.211589   108.809346     1
Anti-social behaviour         69219    South Wales  25.757057  2313.167735    1
dtype: int64

#North Wales Crime count per crime type
crime_cnt_n_wales.value_counts()

Crime type                    Tot_cnt  Region       Percent    norm       
Violence and sexual offences  49913    North Wales  37.668483  7123.000300    1
Vehicle crime                 2960     North Wales  2.233861   422.416623     1
Theft from the person         335      North Wales  0.252819   47.807287      1
Shoplifting                   6698     North Wales  5.054865   955.860317     1
Robbery                       372      North Wales  0.280742   53.087494      1
Public order                  11008    North Wales  8.307548   1570.933170    1
Possession of weapons         603      North Wales  0.455074   86.053116      1
Other theft                   7856     North Wales  5.928788   1121.116550    1
Other crime                   2171     North Wales  1.638416   309.819759     1
Drugs                         2683     North Wales  2.024814   382.886418     1
Criminal damage and arson     12842    North Wales  9.691637   1832.660226    1
Burglary                      5280     North Wales  3.984725   753.499922     1
Bicycle theft                 737      North Wales  0.556201   105.176031     1
Anti-social behaviour         29048    North Wales  21.922026  4145.391235    1
dtype: int64

The top 5 crimes in North and South Wales are from similar crime types

def police_patrol():
    '''This functions shows the top 10 locations within the different regions that has the highest crime rate
    this is to recommend locations in need of frequent police patrol to curb crime in the area.'''
    # define a temporary variable to hold the count of crimes grouped by crime location
    n_policing_temp = n_wales[['LSOA name','Crime type']].groupby('LSOA name').count()
    #reset index of the variable to change the current index
    n_policing_temp.reset_index(inplace=True)
    #sort the values in descending order to show top values
    n_policing = n_policing_temp.sort_values(by=['Crime type'],ascending=False).head(10)
    #use seaborn barplot method to create a bar plot of the top 10 values and set a title
    _ = sns.barplot(x = 'Crime type', y = 'LSOA name',  data= n_policing)\
    .set_title("Top ten Crime locations in North Wales per 10,000 people")
    plt.show()
    # define a temporary variable to hold the count of crimes grouped by crime location
    s_policing_temp = s_wales[['LSOA name','Crime type']].groupby('LSOA name').count()
    #reset index of the variable to change the current index
    s_policing_temp.reset_index(inplace=True)
    #sort the values in descending order to show top values
    s_policing = s_policing_temp.sort_values(by=['Crime type'],ascending=False).head(10)
    #use seaborn barplot method to create a bar plot of the top 10 values and set a title
    _ = sns.barplot(x = 'Crime type', y = 'LSOA name',  data= s_policing)\
    .set_title("Top ten Crime locations in South Wales per 10,000 people")
    plt.show()
    #call help method on this function to show what it does to the user
    help(police_patrol)
police_patrol()

png

Help on function police_patrol in module __main__:

police_patrol()
    This functions shows the top 10 locations within the different regions that has the highest crime rate
    this is to recommend locations in need of frequent police patrol to curb crime in the area.

To answer this question we would be using the n_s_wales_grouped dataframe. We want to know the most common crime types in South and North Wales and see if we have any similarities in the crime types. We will be using the FacetGrid plot method in Seaborn which helps plot barcharts for all the months in a grid. The row shows the Month while the Columns shows the region that is currently being analysed.

def Common_crimes():
    '''This function creates a grid plot using seaborn to show the a monthly barplot of the crime rate of
    the different crime type. The rows shows the month while the column shows the region.'''
    #We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region 
    #North Wales to the Left and South Wales to the right
    #creating a FacetGrid with Seaborn library and this is where we select dataset to be used
    grid = sns.FacetGrid(n_s_wales_grouped, row='Month', col='Region', height=3.5,aspect = 2.65)
    #create a variable crime order to hold the order in which the plots would be displayed
    crime_order = list(n_s_wales_grouped['Crime code'].unique())
    #lets populate the facetGrid with the specific plots and add the order
    _ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order = crime_order )
    grid.fig.subplots_adjust(top=0.96)
    #adding the title of the plot
    _ = grid.fig.suptitle('Most common crime types per 10,000 people in North and South Wales', fontsize=28)
    plt.show()
    #call help method on this function to show what it does to the user
    help(Common_crimes)
    #show the crime codes
    print(crime_mapping)
Common_crimes()

png

Help on function Common_crimes in module __main__:

Common_crimes()
    This function creates a grid plot using seaborn to show the a monthly barplot of the crime rate of
    the different crime type. The rows shows the month while the column shows the region.

{'Other theft': 'OT', 'Theft from the person': 'TFP', 'Violence and sexual offences': 'VSO', 'Anti-social behaviour': 'ASB', 'Burglary': 'BUG', 'Criminal damage and arson': 'CDA', 'Drugs': 'DRG', 'Public order': 'PO', 'Other crime': 'OC', 'Possession of weapons': 'POW', 'Robbery': 'ROB', 'Vehicle crime': 'VC', 'Shoplifting': 'SPL', 'Bicycle theft': 'BT'}

From the grid plot above we can see that across all the months and in both regions Violence and sexual offences(VSO) and Anti-social behaviour(ASB) have the highest crime rate for both South and North Wales. We have similar trends in crime type for both North and South Wales.
North Wales has around twice the rate of Violent and Sexual offences as South wales per 10,000 people although South Wales has a higher rate of these crimes in total.

We would be using the n_s_wales_grouped dataframe to find the effects of the covid lockdown on crimes to see if any crime type took a deep during the lockdown. We will be using the normalized crime count in this case, so we are looking at the crime count per 10,000 people in North and South Wales. The seaborn FacetGrid method will also be used here. The rows of the plot shows pre-covid and during covid while the column shows the region.

def Covid_crimes():
    '''This function displays a barplot showing a comparison between the crime rate before covid and during
    covid, This was created using the seaborn gridplot module'''
    #We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region 
    #North Wales to the Left and South Wales to the right
    #creating a FacetGrid with Seaborn library and assigning data to be used
    grid = sns.FacetGrid(n_s_wales_grouped, row='Covid', col='Region', height=3.5,aspect = 2.65)
    #create a variable crime order to hold the order in which the plots would be displayed
    crime_order2 = list(n_s_wales_grouped['Crime code'].unique())
    #lets populate the facetGrid with the specific plots and add the order
    _ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order = crime_order2)
    grid.fig.subplots_adjust(top=0.8)
    #adding the title of the plot
    _ = grid.fig.suptitle('Difference in crime rate before and during COVID 19 per 10,000 people', fontsize=28)
    plt.show()
    #call help method on this function to show what it does to the user
    help(Covid_crimes)
    #show the crime codes
    print(crime_mapping) 
Covid_crimes()

png

Help on function Covid_crimes in module __main__:

Covid_crimes()
    This function displays a barplot showing a comparison between the crime rate before covid and during
    covid, This was created using the seaborn gridplot module

{'Other theft': 'OT', 'Theft from the person': 'TFP', 'Violence and sexual offences': 'VSO', 'Anti-social behaviour': 'ASB', 'Burglary': 'BUG', 'Criminal damage and arson': 'CDA', 'Drugs': 'DRG', 'Public order': 'PO', 'Other crime': 'OC', 'Possession of weapons': 'POW', 'Robbery': 'ROB', 'Vehicle crime': 'VC', 'Shoplifting': 'SPL', 'Bicycle theft': 'BT'}

Reduced during Covid

From the plot above we can see that Burglary reduced almost by half of what it was before covid.
Criminal damage and arson also reduced by a bit during Covid compared to the period before.
Robbery reduced to almost non existence during covid
Shoplifting and Vehicle crimes reduced a bit during covid

Increased during covid

We saw an increase in Anti social behaviour during Covid which may be due to the new social regulations around covid lockdown.

Remained the same

Drug use, Other crime, Other theft, Possession of Weapons, Theft from the person, Violence and sexual offences and Bicycle theft all remained almost unchanged during Covid.

The analysis is to check for differences in Burglary rate during the summer and winter. We would be using the n_s_wales_grouped dataframe for this and also a FacetGrid. The row shows the season while the column shows the region.

def Seasonal_crimes():
    '''This function displays a barplot showing a comparison between the crime rate grouped by season, 
    This was created using the seaborn gridplot module '''
    #We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region 
    #North Wales to the Left and South Wales to the right
    #creating a FacetGrid with Seaborn library and assigning data to be used
    grid = sns.FacetGrid(n_s_wales_grouped, row='season', col='Region', height=3.5,aspect = 2.65)
    #create a variable crime order to hold the order in which the plots would be displayed
    crime_order3 = list(n_s_wales_grouped['Crime code'].unique())
    #lets populate the facetGrid with the specific plots
    _ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order=crime_order3)
    plt.show()
    #call help method on this function to show what it does to the user
    help(Seasonal_crimes)
    #show the crime codes
    print(crime_mapping) 
    
Seasonal_crimes()

png

Help on function Seasonal_crimes in module __main__:

Seasonal_crimes()
    This function displays a barplot showing a comparison between the crime rate grouped by season, 
    This was created using the seaborn gridplot module

{'Other theft': 'OT', 'Theft from the person': 'TFP', 'Violence and sexual offences': 'VSO', 'Anti-social behaviour': 'ASB', 'Burglary': 'BUG', 'Criminal damage and arson': 'CDA', 'Drugs': 'DRG', 'Public order': 'PO', 'Other crime': 'OC', 'Possession of weapons': 'POW', 'Robbery': 'ROB', 'Vehicle crime': 'VC', 'Shoplifting': 'SPL', 'Bicycle theft': 'BT'}

To find the relationship between the crime rate in North and South Wales we will be using the scatterplot feature of Seaborn.

def crime_corr():
    '''This function displays the correlation between the normalized crime rates of North and South Wales,
    this was created using the scatterplot method of the seaborn library'''
    #splice the required columns of South wales data into a variable
    scat_s = dist_s_wales[['Month','Norm Monthly Crime']]
    #rename the column names using the rename method of pandas
    scat_s =scat_s.rename(columns={"Norm Monthly Crime": "s_wales_cnt"})
    #splice the required columns of North wales data into a variable
    scat_n = dist_n_wales[['Month','Norm Monthly Crime']]
    #rename the column names using the rename method of pandas
    scat_n =scat_n.rename(columns={"Norm Monthly Crime": "n_wales_cnt",'Month':'Month2'})
    #concatenate the north and south wales data into one dataframe
    scatter_w = pd.concat([scat_s, scat_n], axis=1)
    #reset the index to change the current index
    scatter_w = scatter_w.reset_index()
    #drop one of the duplicated months
    scatter_w = scatter_w.drop(['Month2'], axis=1)
    #use seaborn to plot a scatterplot for crime rate and set title of plot
    _ = sns.scatterplot(data=scatter_w, x="s_wales_cnt", y="n_wales_cnt").\
    set_title("Correlation Plot for North and South Wales crime rate")
    plt.show()
    #call help method on this function to show what it does to the user
    help(crime_corr)
crime_corr()

png

Help on function crime_corr in module __main__:

crime_corr()
    This function displays the correlation between the normalized crime rates of North and South Wales,
    this was created using the scatterplot method of the seaborn library

To check for the region with the highest crime rate per 10,000 people we will be using the swarmplot feature of seaborn which helps plot the normalized crime rate for each month and separates them into regions.

def Highest_crime():
    '''This is a bee swarmplot showing the normalized crime rate in each region i.e crime rate per 10,000 
    people in North and South wales. This was created using the seaborn library'''
    #nsert the required columns into the method and select data to be used
    _ = sns.swarmplot(x = 'Region', y = 'Norm Monthly Crime', data= Monthly_crime_count_tot).\
    set_title("Swarm Plot for North and South Wales crime rate per 10,000 people")
    #create the x axis label
    _ = plt.xlabel('Region')
    #create the y axis label
    _ = plt.ylabel('Monthly Crime count')
    plt.show()
    #call help method on this function to show what it does to the user
    help(Highest_crime) 
Highest_crime()

png

Help on function Highest_crime in module __main__:

Highest_crime()
    This is a bee swarmplot showing the normalized crime rate in each region i.e crime rate per 10,000 
    people in North and South wales. This was created using the seaborn library

We will use a time series plot to show the evolution of crime rate over time. We will use the normalized crime count to plot the time series graph using the dist_n_wales and dist_s_wales dataframes with the pandas plot method

def crime_evo():
    '''This function shows the evolution of crime over time in North and South Wales. This function
    uses the normalized dataset which means it shows crime per 10,000 people in both regions'''
    #splice the required columns from the dataframe and set index for north data
    n_plot = dist_n_wales[['Month','Norm Monthly Crime']].set_index('Month')
    #splice the required columns from the dataframe and set index for south data
    s_plot = dist_s_wales[['Month','Norm Monthly Crime']].set_index('Month')
    #assign the northern plot to a variable
    ax = n_plot.plot()
    #use the northern variable to create the southern plot
    s_plot.plot(ax=ax)
    #define the legend for the plot
    ax.legend(["North Wales", "South Wales"])
    plt.show()
    #call help method on this function to show what it does to the user
    help(crime_evo)
crime_evo()

png

Help on function crime_evo in module __main__:

crime_evo()
    This function shows the evolution of crime over time in North and South Wales. This function
    uses the normalized dataset which means it shows crime per 10,000 people in both regions

The plot above shows that crime per 10,000 people in North Wales is taking a dip while crime rate in Southern Wale is taking a gradual rise.

def pop_crime_type():
    '''The goal of this function is to display a piechart with the most popular crime types. This was created
    using the pie method in pandas'''
    #plot the crime types of North wales on a piechart
    plot = crime_cnt_n_wales.set_index('Crime type').plot.pie(y='Percent',figsize=(15, 15))
    #plot the crime types of south wales on a piechart
    plot = crime_cnt_s_wales.set_index('Crime type').plot.pie(y='Percent',figsize=(15, 15))
    plt.show()
pop_crime_type()

png

4.0 Statistical Testing

Three statistical tests would be performed in this section using the scipy.stats module, we will be using different classes from the module.

Research Question 1.0 is the Monthly crime rate for north and south wales normal?

To-do We will be testing the crime rate for North and South wales to see if it has normal distribution.

Details

We will be using the shapiro-wilk test to check for the distribution.
The Scipy.stats module is needed and we need to import the shapiro class from the module
We need just 1-dimensional dataset to input into this method.
We would be using a p-value of 0.05 which means when we have a value lower than this we can reject the null hypothesis and when it is higher we cannot reject the null hypothesis

#is the Monthly crime rate for north and south wales normal?
#Interpretation

#import shapiro from scipy module
from scipy.stats import shapiro
#assign the required column to a variable
data_s = dist_s_wales['Norm Monthly Crime']
#assign the output of the test to two variables
stat, p = shapiro(data_s)
#print out the stat and p-value
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('At a 5% significance level we do not have enough evidence to reject the null hypothesis\
    that the South wales dataset has a Gaussian distribution')
else:
    print('At a 5% significance level we have enough evidence to reject the null hypothesis\
    and accept alternative hypothesis that the South wales dataset does not have a Gaussian distribution.')

stat=0.948, p=0.285
At a 5% significance level we do not have enough evidence to reject the null hypothesis    that the South wales dataset has a Gaussian distribution

dist_n_wales
from scipy.stats import shapiro
data_n = dist_n_wales['Norm Monthly Crime']
stat, p = shapiro(data_n)
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('At a 5% significance level we do not have enough evidence to reject the null hypothesis\
    that the South wales dataset has a Gaussian distribution')
else:
    print('At a 5% significance level we have enough evidence to reject the null hypothesis\
    and accept alternative hypothesis that the South wales dataset does not have a Gaussian distribution.')

stat=0.942, p=0.217
At a 5% significance level we do not have enough evidence to reject the null hypothesis    that the South wales dataset has a Gaussian distribution

Research Question 2.0 What type of relationship do we have between North and South wales crime rate?

To-do We will be testing the relationship between North and South wales Crime rate.

Details

We will be using the spearman correlation test to check for correlation.
The Scipy.stats module is needed and we need to import the spearmanr class from the module
The spearmanr method takes two arguments. The two datasets to be compared in a 1-d array
We would be using a p-value of 0.05 which means when we have a value lower than this we can reject the null hypothesis and when it is higher we cannot reject the null hypothesis
The correlation will be performed on the normalized crime count, showing crime rate per 10,000 people

H0: North and South Wales crime rate are independent. H1: North and South Wales crime rate are dependent.

# Example of the Spearman's Rank Correlation Test
from scipy.stats import spearmanr
#splice the required columns of South wales data into a variable
scat_s = dist_s_wales[['Month','Norm Monthly Crime']]
#rename the column names using the rename method of pandas
scat_s =scat_s.rename(columns={"Norm Monthly Crime": "s_wales_cnt"})
#splice the required columns of North wales data into a variable
scat_n = dist_n_wales[['Month','Norm Monthly Crime']]
#rename the column names using the rename method of pandas
scat_n =scat_n.rename(columns={"Norm Monthly Crime": "n_wales_cnt",'Month':'Month2'})
#concatenate the north and south wales data into one dataframe
scatter_w = pd.concat([scat_s, scat_n], axis=1)
#reset the index to change the current index
scatter_w = scatter_w.reset_index()
#drop one of the duplicated months
scatter_w = scatter_w.drop(['Month2'], axis=1)
stat, p = spearmanr(scatter_w['s_wales_cnt'], scatter_w['n_wales_cnt'])
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('At a 5% significance level, we do not have enough statistical evidence to reject the\
    null hypothesis that the crime rate of North and South Wales are independent')
else:
    print('At a 5% significance level, we have enough statistical evidence to reject the\
    null hypothesis that the crime rate of North and South Wales are independent')

stat=0.333, p=0.130
At a 5% significance level, we do not have enough statistical evidence to reject the    null hypothesis that the crime rate of North and South Wales are independent

Research Question 3.0 Is there difference in the combined crime types observed for North and South Wales?

To-do

In this experiement we have grouped the monthly crime rate across North and South wales by the crime type. We will be using a parametric ANOVA test to test if the mean of the different monthly crime rate count differs among the different crime types.

Our assumptions are;

The subjects have been selected at random from the total groups.
The dependent variable/response is normally distributed in each group.
The dependent variable has the same variance in each group.

Details

We will be using the one-way anova test to check for differences in the different crime rates.
The Scipy.stats module is needed and we need to import the f_oneway class from the module
The f_oneway method takes multiple arguments. The list of all the variables to be compared.
We would be using a p-value of 0.05 which means when we have a value lower than this we can reject the null hypothesis and when it is higher we cannot reject the null hypothesis

The hypotheses in this case are:

H0 = the mean of the crime rate of South and North Wales is different in each of the different crime types H1 = the mean of the crime rate of South and North Wales is different in each of the different crime types

#create a list of all unique crimes
crime_list = list(n_s_wales_grouped['Crime type'].unique())
#create an empty list
crime_type_cnt = []
#create a for loop to append each crime type and their monthly crime to the list crime_type_cnt
for i in crime_list:
    crime_temp = s_wales_grouped[s_wales_grouped['Crime type']== i]
    crime_temp = crime_temp[['Norm Crime count']]
    crime_temp = crime_temp.rename(columns={"Norm Crime count": i})
    crime_type_cnt.append(crime_temp)

#import the method needed for this test
from scipy.stats import f_oneway
#assign each crime type to a variable
crime1 = crime_type_cnt[0]
crime2 = crime_type_cnt[1]
crime3 = crime_type_cnt[2]
crime4 = crime_type_cnt[3]
crime5 = crime_type_cnt[4]
crime6 = crime_type_cnt[5]
crime7 = crime_type_cnt[6]
crime8 = crime_type_cnt[7]
crime9 = crime_type_cnt[8]
crime10 = crime_type_cnt[9]
crime11 = crime_type_cnt[10]
crime12 = crime_type_cnt[11]
crime13 = crime_type_cnt[12]
crime14 = crime_type_cnt[13]

#apply the f_oneway method
stat, p = f_oneway(crime1, 
crime2, 
crime3,
crime4, 
crime5, 
crime6, 
crime7, 
crime8, 
crime9, 
crime10,
crime11,
crime12,
crime13,
crime14,
)
#print results
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('At a 5% significance level, we do not have enough statistical evidence to reject the\
    null hypothesis that the means of the different crime types are equal')
else:
    print('At a 5% significance level, we have enough statistical evidence to reject the null\
    hypothesis that the means of the different crime types are equal')

stat=123.629, p=0.000
At a 5% significance level, we have enough statistical evidence to reject the null    hypothesis that the means of the different crime types are equal

Research Question 4.0 Is there a difference in Burglary rate during winter and Summer in North Wales?

Details

We will be using the Paired t test to check for differences in the different crime rates.
The Scipy.stats module is needed and we need to import the ttest_rel class from the module
The ttest_rel method takes two arguments. The 1-d array of the variables to be compared.
We would be using a p-value of 0.05 which means when we have a value lower than this we can reject the null hypothesis and when it is higher we cannot reject the null hypothesis

To-do In this experiment the goal is to compare the Burglary rate of the two seasons Winter and Summer and this means we are trying to compare two variables with the same response to the same subject which means we would use the Paired T test for this.

H0: the crime rate in winter and summer are the same. H1: the crime rate in winter and summer are not the same.

# Import the ttest_rel module from scipy
from scipy.stats import ttest_rel
#assign the north wales normed data as pair_1_n_temp
#create a dataframe pair_1_n_temp of all monthly crime rate in Winter
pair_1_n_temp = n_wales_grouped[n_wales_grouped['season']== 'Winter']
pair_1_n_temp = pair_1_n_temp [pair_1_n_temp['Year'] == '2019']
pair_1_n_temp = pair_1_n_temp[pair_1_n_temp['Crime type'] == 'Burglary']
pair_1_n = pair_1_n_temp['Norm Crime count']
#create a dataframe pair_1_n_temp of all monthly crime rate in Summer
pair_2_n_temp = n_wales_grouped[n_wales_grouped['season']== 'Summer']
pair_2_n_temp = pair_2_n_temp [pair_2_n_temp['Year'] == '2019']
pair_2_n_temp = pair_2_n_temp[pair_2_n_temp['Crime type'] == 'Burglary']
pair_2_n = pair_2_n_temp['Norm Crime count']

#assign outputs into stat and p
stat, p = ttest_rel(pair_1_n, pair_2_n)
#print output
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the burglary crime rate during Winter and summer are the same')
else:
    print('At a 5% significance level, we have enough statistical evidence to reject the null hypothesis and accept alternative hypothesis that the burglary crime rate during Winter and summer are  different')

stat=-2.312, p=0.147
At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the burglary crime rate during Winter and summer are the same

Research Question 5.0 Is there a difference in Burglary rate during winter and Summer in South Wales?

H0: the crime rate in winter and summer are the same. H1: the crime rate in winter and summer are not the same.

# Import the ttest_rel module from scipy
from scipy.stats import ttest_rel
#assign the south wales normed data as pair_1_n_temp
#create a dataframe pair_1_s of all monthly crime rate in Winter
pair_1_s_temp = s_wales_grouped[s_wales_grouped['season']== 'Winter']
pair_1_s_temp = pair_1_s_temp [pair_1_s_temp['Year'] == '2019']
pair_1_s_temp = pair_1_s_temp[pair_1_s_temp['Crime type'] == 'Burglary']
pair_1_s = pair_1_s_temp['Norm Crime count']
#create a dataframe pair_2_s of all monthly crime rate in Winter
pair_2_s_temp = s_wales_grouped[n_wales_grouped['season']== 'Summer']
pair_2_s_temp = pair_2_s_temp [pair_2_s_temp['Year'] == '2019']
pair_2_s_temp = pair_2_s_temp[pair_2_s_temp['Crime type'] == 'Burglary']
pair_2_s = pair_2_s_temp['Norm Crime count']

#assign outputs into stat and p
stat, p = ttest_rel(pair_1_s, pair_2_s)
#print output
print('stat=%.3f, p=%.3f' % (stat, p))
if p > 0.05:
    print('At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the burglary crime rate during Winter and summer are the same')
else:
    print('At a 5% significance level, we have enough statistical evidence to reject the null hypothesis and accept alternative hypothesis that the burglary crime rate during Winter and summer are  different')

stat=-0.453, p=0.695
At a 5% significance level, we do not have enough statistical evidence to reject the null hypothesis that the burglary crime rate during Winter and summer are the same

5.0 Heatmap

This is a simple heat map created using the Folium and Basemap library to model the crime recorded by both the North and South Wales police district

To-do In this task we would be drawing a heatmap for crime rate in both North and South Wales

Details

The library used is Folium which needs to be imported before use
To use this chunk of code you will need to import the joined csv file
The columns ‘Context’, ‘Last outcome category’,‘Crime ID’ needs to be dropped from both dataframes
There is a need for the concatenation of both datarames to create a single one
Use the zip method to combine the values in the longitude and latitude columnss
use the folium.Map method to set the default location of the map
Set the longitude and Latitude to be plotted

Warning!!!!! Ensure you insert the path to the folder that contains the South and North wales dataset!


def heat_map(path_n,path_s):
    '''This is a function for creating heatmaps, the function takes the concatenated files for both North
    and South wales dataset, it the uses this to produce heatmap for the crime rate of these regions'''
    #assign the north wales file to path_n
    map_n_wales = pd.read_csv(path_n)
    #assign the south wales file to path_s
    map_s_wales = pd.read_csv(path_s)
    #drop the unwanted columns
    map_n_wales = map_n_wales.drop(['Context', 'Last outcome category','Crime ID'], axis=1)
    #drop the unwanted columns
    map_s_wales = map_s_wales.drop(['Context', 'Last outcome category','Crime ID'], axis=1)
    #drop rows with missing values
    map_s_wales = map_s_wales.dropna()
    #drop rows with missing values
    map_n_wales = map_n_wales.dropna()
    #concatenate the cleaned north and south wales data
    map_s_n_wales = pd.concat([map_s_wales,map_n_wales])
    #zip two required columns and assign to a variable 
    crime_locations = list(zip(map_s_n_wales.Latitude, map_s_n_wales.Longitude))
    # generate map
    base_map = folium.Map(location=[52.2417, -3.3777], zoom_start=6.5)
    heatmap = plugins.HeatMap(crime_locations, radius=3, blur=2)
    return base_map.add_child(heatmap)
#callfunction with two parameters the location of the South and North wales files
heat_map('/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Northwales/North_wales_tot.csv',\
         '/home/kp/Documents/David Kidner Assessment/Assessment/Lateef M K 30028762/Southwales/South_wales_tot.csv')

Data Exploration App

This is a simple App created using the Tkinter and Pillow library to model the crime recorded by both the North and South Wales police district. The dataset used for this application has been normalized so it will be showing crime rae per population of 10,000 in both regions.

To-do In this task we would be creating a application for our data exploration of the dataset.

Details

The library used is Tkinter and Pillow which needs to be imported before use
To use this chunk of code you will need to ensure data wrangling phase of this document has been completed already
We also need to confirm that the logo file is in the same directory as this file being used.
This code contains 8 functions each with its own button on the GUI, on clicking any of this button the app would be prompted to run the related function and display the output.
The Seaborn module was used extensively in this code to create beautiful plots.
The App displays the output of any of the button clicked in this consile.
On running this block of code the application is initiated and you should see a pop where you can interact with the applications.
Always close the application after use to avoid interference when using other parts of this document.

import tkinter as tk
from PIL import Image, ImageTk
#set the root of the app
root = tk.Tk()
#define the space the app would cover
canvas = tk.Canvas(root, width=900, height=300)
#define the columns the app would span
canvas.grid(columnspan = 3)


#create a logo using the pillow image library
#create a variable logo to hold the image
logo = Image.open('logo.png')
#use the photoimage method to insert the image as logo
logo = ImageTk.PhotoImage(logo)
#assign the label
logo_label = tk.Label(image = logo)
logo_label.image = logo
#select the row and column image should appear
logo_label.grid(column=1, row =0)


#instructions
instructions = tk.Label(root, text='Data Explorer', font = "Raleway")
instructions.grid(columnspan = 3, column = 0, row=1)


def police_patrol():
    '''This functions shows the top 10 locations within the different regions that has the highest crime rate
    this is to recommend locations in need of frequent police patrol to curb crime in the area.'''
    # define a temporary variable to hold the count of crimes grouped by crime location
    n_policing_temp = n_wales[['LSOA name','Crime type']].groupby('LSOA name').count()
    #reset index of the variable to change the current index
    n_policing_temp.reset_index(inplace=True)
    #sort the values in descending order to show top values
    n_policing = n_policing_temp.sort_values(by=['Crime type'],ascending=False).head(10)
    #use seaborn barplot method to create a bar plot of the top 10 values and set a title
    _ = sns.barplot(x = 'Crime type', y = 'LSOA name',  data= n_policing)\
    .set_title("Top ten Crime locations in North Wales per 10,000 people")
    plt.show()
    # define a temporary variable to hold the count of crimes grouped by crime location
    s_policing_temp = s_wales[['LSOA name','Crime type']].groupby('LSOA name').count()
    #reset index of the variable to change the current index
    s_policing_temp.reset_index(inplace=True)
    #sort the values in descending order to show top values
    s_policing = s_policing_temp.sort_values(by=['Crime type'],ascending=False).head(10)
    #use seaborn barplot method to create a bar plot of the top 10 values and set a title
    _ = sns.barplot(x = 'Crime type', y = 'LSOA name',  data= s_policing)\
    .set_title("Top ten Crime locations in South Wales per 10,000 people")
    plt.show()
    #call help method on this function to show what it does to the user
    help(police_patrol)
    #switch the button status to report generated
    browse_text.set("Report Generated, check console")  

    
def Common_crimes():
    '''This function creates a grid plot using seaborn to show the a monthly barplot of the crime rate of
    the different crime type. The rows shows the month while the column shows the region.'''
    #We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region 
    #North Wales to the Left and South Wales to the right
    #creating a FacetGrid with Seaborn library and this is where we select dataset to be used
    grid = sns.FacetGrid(n_s_wales_grouped, row='Month', col='Region', height=3.5,aspect = 2.65)
    #create a variable crime order to hold the order in which the plots would be displayed
    crime_order = list(n_s_wales_grouped['Crime code'].unique())
    #lets populate the facetGrid with the specific plots and add the order
    _ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order = crime_order )
    grid.fig.subplots_adjust(top=0.96)
    #adding the title of the plot
    _ = grid.fig.suptitle('Most common crime types per 10,000 people in North and South Wales', fontsize=28)
    plt.show()
    #call help method on this function to show what it does to the user
    help(Common_crimes)
    #show the crime codes
    print(crime_mapping)
    browse_text2.set("Report Generated, check console")  


    
    
def Covid_crimes():
    '''This function displays a barplot showing a comparison between the crime rate before covid and during
    covid, This was created using the seaborn gridplot module'''
    #We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region 
    #North Wales to the Left and South Wales to the right
    #creating a FacetGrid with Seaborn library and assigning data to be used
    grid = sns.FacetGrid(n_s_wales_grouped, row='Covid', col='Region', height=3.5,aspect = 2.65)
    #create a variable crime order to hold the order in which the plots would be displayed
    crime_order2 = list(n_s_wales_grouped['Crime code'].unique())
    #lets populate the facetGrid with the specific plots and add the order
    _ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order = crime_order2)
    grid.fig.subplots_adjust(top=0.8)
    #adding the title of the plot
    _ = grid.fig.suptitle('Difference in crime rate before and during COVID 19 per 10,000 people', fontsize=28)
    plt.show()
    #call help method on this function to show what it does to the user
    help(Covid_crimes)
    #show the crime codes
    print(crime_mapping)
    browse_text3.set("Report Generated, check console")  
        
def Seasonal_crimes():
    '''This function displays a barplot showing a comparison between the crime rate grouped by season, 
    This was created using the seaborn gridplot module '''
    #We will plot a grid plot with a barchart showing the Crime count by crime type monthly for each region 
    #North Wales to the Left and South Wales to the right
    #creating a FacetGrid with Seaborn library and assigning data to be used
    grid = sns.FacetGrid(n_s_wales_grouped, row='season', col='Region', height=3.5,aspect = 2.65)
    #create a variable crime order to hold the order in which the plots would be displayed
    crime_order3 = list(n_s_wales_grouped['Crime code'].unique())
    #lets populate the facetGrid with the specific plots
    _ = grid.map(sns.barplot, 'Crime code','Norm Crime count',alpha=0.5,order=crime_order3)
    plt.show()
    #call help method on this function to show what it does to the user
    help(Seasonal_crimes)
    #show the crime codes
    print(crime_mapping)
    browse_text4.set("Report Generated, check console")  

        
        
def crime_corr():
    '''This function displays the correlation between the normalized crime rates of North and South Wales,
    this was created using the scatterplot method of the seaborn library'''
    #splice the required columns of South wales data into a variable
    scat_s = dist_s_wales[['Month','Norm Monthly Crime']]
    #rename the column names using the rename method of pandas
    scat_s =scat_s.rename(columns={"Norm Monthly Crime": "s_wales_cnt"})
    #splice the required columns of North wales data into a variable
    scat_n = dist_n_wales[['Month','Norm Monthly Crime']]
    #rename the column names using the rename method of pandas
    scat_n =scat_n.rename(columns={"Norm Monthly Crime": "n_wales_cnt",'Month':'Month2'})
    #concatenate the north and south wales data into one dataframe
    scatter_w = pd.concat([scat_s, scat_n], axis=1)
    #reset the index to change the current index
    scatter_w = scatter_w.reset_index()
    #drop one of the duplicated months
    scatter_w = scatter_w.drop(['Month2'], axis=1)
    #use seaborn to plot a scatterplot for crime rate and set title of plot
    _ = sns.scatterplot(data=scatter_w, x="s_wales_cnt", y="n_wales_cnt").\
    set_title("Correlation Plot for North and South Wales crime rate")
    plt.show()
    #call help method on this function to show what it does to the user
    help(crime_corr)
    browse_text5.set("Report Generated, check console")  
        

def Highest_crime():
    '''This is a bee swarmplot showing the normalized crime rate in each region i.e crime rate per 10,000 
    people in North and South wales. This was created using the seaborn library'''
    #nsert the required columns into the method and select data to be used
    _ = sns.swarmplot(x = 'Region', y = 'Norm Monthly Crime', data= Monthly_crime_count_tot).\
    set_title("Swarm Plot for North and South Wales crime rate per 10,000 people")
    #create the x axis label
    _ = plt.xlabel('Region')
    #create the y axis label
    _ = plt.ylabel('Monthly Crime count')
    plt.show()
    #call help method on this function to show what it does to the user
    help(Highest_crime)
    browse_text6.set("Report Generated, check console")  
        
def crime_evo():
    '''This function shows the evolution of crime over time in North and South Wales. This function
    uses the normalized dataset which means it shows crime per 10,000 people in both regions'''
    #splice the required columns from the dataframe and set index for north data
    n_plot = dist_n_wales[['Month','Norm Monthly Crime']].set_index('Month')
    #splice the required columns from the dataframe and set index for south data
    s_plot = dist_s_wales[['Month','Norm Monthly Crime']].set_index('Month')
    #assign the northern plot to a variable
    ax = n_plot.plot()
    #use the northern variable to create the southern plot
    s_plot.plot(ax=ax)
    #define the legend for the plot
    ax.legend(["North Wales", "South Wales"])
    plt.show()
    #call help method on this function to show what it does to the user
    help(crime_evo)
    browse_text7.set("Report Generated, check console") 
        
def pop_crime_type():
    '''The goal of this function is to display a piechart with the most popular crime types. This was created
    using the pie method in pandas'''
    #plot the crime types of North wales on a piechart
    plot = crime_cnt_n_wales.set_index('Crime type').plot.pie(y='Percent',figsize=(15, 15))
    #plot the crime types of south wales on a piechart
    plot = crime_cnt_s_wales.set_index('Crime type').plot.pie(y='Percent',figsize=(15, 15))
    plt.show()
    browse_text8.set("Report Generated, check console") 

     
    
#police_patrol button   
#assign the browse text as a string
browse_text = tk.StringVar()
#create a button to call the police_patrol function on click and assign the parameters
browse_btn = tk.Button(root, textvariable=browse_text,command = lambda:police_patrol(), \
                       font="Raleway", bg="#20bebe", fg="white", height=2, width=50)
#set the name of the button
browse_text.set("Generate areas in need of patrol")
#assign the column and row the button should appear
browse_btn.grid(column=1, row=2)


#Common_crimes button   
#assign the browse text as a string
browse_text2 = tk.StringVar()
#create a button to call the Common_crimes function on click and assign the parameters
browse_btn2 = tk.Button(root, textvariable=browse_text2, \
                       font="Raleway", bg="#20bebe", fg="white", command = lambda:Common_crimes(), \
                        height=2, width=50)
#set the name of the button
browse_text2.set("Common Crimes in North and South Wales")
#assign the column and row the button should appear
browse_btn2.grid(column=1, row=4)


#Common_crimes button   
#assign the browse text as a string
browse_text3 = tk.StringVar()
#create a button to call the Common_crimes function on click and assign the parameters
browse_btn3 = tk.Button(root, textvariable=browse_text3, \
                       font="Raleway", bg="#20bebe", fg="white", command = lambda:Covid_crimes(), \
                        height=2, width=50)
#set the name of the button
browse_text3.set("Pre-Covid Crimes Vs Covid Crimes")
#assign the column and row the button should appear
browse_btn3.grid(column=1, row=6)


#Seasonal_crimes button   
#assign the browse text as a string
browse_text4 = tk.StringVar()
#create a button to call the Seasonal_crimes function on click and assign the parameters
browse_btn4 = tk.Button(root, textvariable=browse_text4, \
                       font="Raleway", bg="#20bebe", fg="white", command = lambda:Seasonal_crimes(), \
                        height=2, width=50)
#set the name of the button
browse_text4.set("Crime rate by Season")
#assign the column and row the button should appear
browse_btn4.grid(column=1, row=7)

#crime_corr button   
#assign the browse text as a string
browse_text5 = tk.StringVar()
#create a button to call the crime_corr function on click and assign the parameters
browse_btn5 = tk.Button(root, textvariable=browse_text5, \
                       font="Raleway", bg="#20bebe", fg="white", command = lambda:crime_corr(), \
                        height=2, width=50)
#set the name of the button
browse_text5.set("North and South Wales Crime correlation")
#assign the column and row the button should appear
browse_btn5.grid(column=1, row=8)


#Highest_crime button   
#assign the browse text as a string
browse_text6 = tk.StringVar()
#create a button to call the Highest_crime function on click and assign the parameters
browse_btn6 = tk.Button(root, textvariable=browse_text6, \
                       font="Raleway", bg="#20bebe", fg="white", command = lambda:Highest_crime(), \
                        height=2, width=50)
#set the name of the button
browse_text6.set("Region With Highest Crime Rate")
#assign the column and row the button should appear
browse_btn6.grid(column=1, row=9)

#crime_evo button   
#assign the browse text as a string
browse_text7 = tk.StringVar()
#create a button to call the crime_evo function on click and assign the parameters
browse_btn7 = tk.Button(root, textvariable=browse_text7, \
                       font="Raleway", bg="#20bebe", fg="white", command = lambda:crime_evo(), \
                        height=2, width=50)
#set the name of the button
browse_text7.set("Evolution of crime over time")
#assign the column and row the button should appear
browse_btn7.grid(column=1, row=10)

#pop_crime_type button   
#assign the browse text as a string
browse_text8 = tk.StringVar()
#create a button to call the pop_crime_type function on click and assign the parameters
browse_btn8 = tk.Button(root, textvariable=browse_text8, \
                       font="Raleway", bg="#20bebe", fg="white", command = lambda:pop_crime_type(), \
                        height=2, width=50)
#set the name of the button
browse_text8.set("Popular Crime type in North and South Wales")
#assign the column and row the button should appear
browse_btn8.grid(column=1, row=12)




canvas = tk.Canvas(root, width=600, height=250)
#set how many columns the app should span
canvas.grid(columnspan=3)






root.mainloop()

1.0 Data Preprocessing#

Outcome of Data Preprocessing#

Data cleaning 1.0#

Observations from Data cleaning 1.0#

Data cleaning 2.0#

Observations from Data cleaning 2.0#

Data cleaning 3.0 : Fixing all missing values#

2.0 Data Wrangling and Normalisation#

Step 1: Creating dataset for count of Crime type by month and season#

Step 2 Dataset for Monthly total crime count by region#

step 3 Dataset for percentage of a crime type in a particular region#

Outputs From Wrangling#

3.0 Data Exploration#

4.0 Statistical Testing#

Research Question 1.0 is the Monthly crime rate for north and south wales normal?#

Research Question 2.0 What type of relationship do we have between North and South wales crime rate?#

Research Question 3.0 Is there difference in the combined crime types observed for North and South Wales?#

Research Question 4.0 Is there a difference in Burglary rate during winter and Summer in North Wales?#

Research Question 5.0 Is there a difference in Burglary rate during winter and Summer in South Wales?#

5.0 Heatmap#

Data Exploration App#

1.0 Data Preprocessing

Outcome of Data Preprocessing

Data cleaning 1.0

Observations from Data cleaning 1.0

Data cleaning 2.0

Observations from Data cleaning 2.0

Data cleaning 3.0 : Fixing all missing values

2.0 Data Wrangling and Normalisation

Step 1: Creating dataset for count of Crime type by month and season

Step 2 Dataset for Monthly total crime count by region

step 3 Dataset for percentage of a crime type in a particular region

Outputs From Wrangling

3.0 Data Exploration

4.0 Statistical Testing

Research Question 1.0 is the Monthly crime rate for north and south wales normal?

Research Question 2.0 What type of relationship do we have between North and South wales crime rate?

Research Question 3.0 Is there difference in the combined crime types observed for North and South Wales?

Research Question 4.0 Is there a difference in Burglary rate during winter and Summer in North Wales?

Research Question 5.0 Is there a difference in Burglary rate during winter and Summer in South Wales?

5.0 Heatmap

Data Exploration App