Crime Pattern Analysis in Los Angeles (2024)

1. Data source, format, and content

The primary dataset used in this project was obtained from the U.S. Government’s open data portal, specifically the “Crime Data from 2020 to Present” collection provided by the Los Angeles Police Department (LAPD). This dataset record all reported crime incidents that occurred from 2020 to present.

The data is structured as a flat table in CSV format, containing 1,005,200 rows (including the header) and 28 columns. Each row represents a unique crime report. The columns include details such as: - Incident number (DR_NO), reporting and occurrence dates (Date Rptd, DATE OCC), and time (TIME OCC) - Location information including area codes, reporting district, latitude and longitude - Crime classification codes and descriptions (Crm Cd, Crm Cd Desc) - Victim information such as age, gender, and ethnicity - Weapon and premise details - Case status and administrative codes

The dataset is updated daily by the LAPD and is openly accessible to the public. It provides a rich view of crime trends and characteristics in Los Angeles, which makes it suitable for exploratory and statistical analysis.

2. Data Retrieval

The crime data was manually downloaded from here: https://catalog.data.gov/dataset/crime-data-from-2020-to-present. The dataset is maintained by the Los Angeles Police Department and made publicly available in CSV format.

We downloaded the full dataset and then filtered it to include only crime incidents that occurred during the 2024 calendar year. After filtering, the data was saved and processed locally as an Excel file (Crime_2024_with_Weather_Unemp_Monthly.xlsx) for further cleaning and integration with weather and economic data.

3. Data Cleaning and Transformation

After retrieving the datasets, several cleaning and transformation steps were applied to prepare a tidy, analysis-ready dataframe:

  • Drop unnecessary columns
    A number of administrative or redundant fields were removed, including alternate crime codes (Crm Cd 1–4), raw ID columns (DR_NO, Mocodes), and derived fields not used in analysis.

  • Standardize column names
    The original column names were long and inconsistent, such as TIME OCC and Vict Descent. These were renamed into clear, lowercase snake_case format (e.g., time_of_occurrence, victim_descent_code) to improve readability and code maintainability.

  • Encode categorical features
    Victim sex and descent were originally coded as single-letter abbreviations (M, F, W, H, etc.). These were mapped to numeric codes to facilitate statistical analysis and machine learning:

    • Victim sex: M → 0, F → 1, H → 2, unknowns mapped to None
    • Victim descent: mapped to integers by group (e.g., W → 0, B → 1, H → 2, etc.), with unknowns excluded.
  • Handle missing values in victim age
    Missing values in victim_age were filled using a two-step strategy:

    • First, fill with the mean victim age within the same crime type
    • Then, for remaining missing values, fill with the overall mean age
      This reduced all null values to zero, resulting in a fully populated column.
  • Add derived time-based features
    From the date_of_occurrence and time_of_occurrence fields, new variables such as month, day_of_week, and hour were generated for later time-based trend analysis.

The final dataset was a clean, tidy table with consistent naming, properly formatted columns, and minimal missing values—ready for analysis and visualization.

4. Data Enrichment

To enhance the dataset, we added external information from two sources:

  • Climate Data:
    We retrieved historical weather data from the Open-Meteo API based on the latitude and longitude of each grid point. For each date-location pair, we calculated average temperature by combining daily maximum and minimum temperatures, and also included total daily precipitation.

  • Unemployment Rate:

    We used the U.S. Bureau of Labor Statistics API to collect monthly county-level unemployment rates. Using reverse geocoding and FIPS codes, we matched each geographic grid cell to the appropriate county and merged in the unemployment data by year and month.

The final dataset includes daily crime records enriched with contextual climate and socioeconomic variables, allowing for more robust analysis of external influences on crime patterns.

5. Data Validation

To ensure the integrity of the enriched dataset, we used pytest to perform a series of automated quality checks. The tests covered several key areas:

  • Missing or Invalid Values
    Checked that critical fields (e.g., victim age, sex, and crime descriptions) are present and cleaned properly, with values falling within expected categories or ranges.

  • Geographic Validity
    Verified that latitude and longitude coordinates fall within the expected boundaries of Los Angeles.

  • Coded Variables
    Confirmed that categorical variables like victim sex and descent codes use only valid numerical codes.

  • Enriched Data Ranges
    Ensured that added features—such as unemployment rates, average temperatures, and precipitation levels—stay within realistic and interpretable bounds.

All tests passed successfully, giving us confidence in the consistency and reliability of the data before further analysis.

6 Area crime visualization

We visualize the spatial and temporal distribution of crimes across Los Angeles using a time-based heatmap. First, we filter the dataset to include only records with valid latitude and longitude values. For each month, we extract up to 5,000 random crime locations to avoid overcrowding the map. These monthly subsets are compiled into a list of coordinates required by folium’s HeatMapWithTime. We then generate an interactive map centered on Los Angeles, which displays a dynamic heatmap that animates month by month throughout 2024. This visualization highlights how crime density varies spatially and evolves over time. We can clearly observe that crime are concentrated at Central LA, Northwest LA, Santa Monica and Long Beach areas.

Make this Notebook Trusted to load map: File -> Trust Notebook

7. Data analysis

(1) Top 10 Areas with Highest Number of Crimes

The bar plot shows the Top 10 areas with the most crimes. Central has the highest crime count, and crime numbers gradually decline across areas. Only the Top 10 areas are shown to focus on key regions. Central leads by a large margin with over 10,000 cases, followed by Southwest and Pacific. Areas like Van Nuys, Olympic, and Devonshire still appear in the top 10 but have significantly lower totals, around 5,700–5,900 crimes. This suggests a strong spatial concentration of crime, especially in downtown and coastal regions.

(2) 10 Major Crime Categories

The bar plot displays the Top 10 crime types in 2024. Vehicle theft is by far the most frequent crime, with over 21,000 cases, followed by burglary from vehicles and petty theft. Other frequent categories include Vandalism, Theft from Motor Vehicles, Theft of Identity, Grand Theft, and Trespassing.
Therefore, property-related crimes, especially low-value theft and vehicle-related offenses, make up the majority of the top 10, highlighting their prominence in urban crime patterns.

(3) Monthly Crime Quantity Calculation

The line plot shows monthly crime counts in Los Angeles for 2024. A clear sharp decline starts after March, dropping from over 16,000 crimes to around 8,000 by June. From June onward, crime levels remain relatively stable at a lower rate, with a slight drop again in December. This pattern suggests a seasonal effect, with higher crime rates in early spring and much lower rates during summer and winter months.

(4) Crime Type Distribution by Hour of Day

A stacked bar chart showing the number of misdemeanors and felonies by hour of the day. Crime counts increase throughout the day, peaking in late afternoon and evening.

(5) Average Temperature vs. Daily Crime Amount

A scatter plot showing the relationship between average daily temperature and daily crime counts. Crime incidents generally decrease as temperatures rise. A red regression line highlights the negative trend. A note marks that there are no records below 8°C, ensuring accurate data interpretation.

(6) Weekend/Weekday Temperature vs Crime Quantity

This scatter plot compares the relationship between average temperature and daily crime counts on weekdays and weekends. Each point represents the total crime count for a given temperature. Different markers (circles and triangles) were used to distinguish weekdays and weekends more clearly. Regression lines show that crime rates are generally higher on weekdays, but both decrease as temperature rises.

(7) Unemployment Rate vs. Monthly Crime Volume

This plot examines the relationship between the unemployment rate and monthly crime counts. A slight negative trend is observed, suggesting that higher unemployment rates might be associated with lower crime counts. The regression line with a 95% confidence interval provides a statistical insight into the negative relationship.

8. Data Modeling

(1) Data Pre-processing

We begin by preparing the target variable and extracting key features for our binary classification model. The response variable crime_part_category is recoded so that misdemeanors are labeled as 0 and felonies as 1. To capture temporal patterns in crime, we derive an is_weekend indicator from the date_of_occurrence, assigning a value of 1 if the crime occurred on a Saturday or Sunday, and 0 otherwise. In addition, we extract the hour component from the four-digit time_of_occurrence field to reflect the time of day when the crime took place. These transformations allow us to incorporate both calendar-based and time-of-day information into our model.

(2) One-hot Coding

We perform one-hot encoding on three categorical variables: area_name, victim_sex, and victim_descent_code, using pd.get_dummies with drop_first=False to preserve all category levels. In addition to the encoded categorical features, we include a set of numerical predictors: month, is_weekend, hour, victim_age, daily_precipitation_mm, daily_avg_temperature_celsius, and unemployment_rate_pct. We then construct the final feature matrix X by combining the numerical columns with the newly generated one-hot encoded columns extracted from the transformed dataframe. This matrix serves as the input for our classification model.

(3) Split Train and Test Set

We split the data into training and test sets using a 70/30 split. To ensure that the class distribution (i.e., felony vs. misdemeanor) remains consistent across both sets, we apply stratified sampling based on the target variable y. A fixed random_state is used for reproducibility.

(4) Logistic Regression

We train a logistic regression model using the same feature set to predict whether a crime is a felony. After fitting the model on the training data, we evaluate its performance on the test set using accuracy, ROC AUC, precision, and recall. These metrics allow us to assess both the overall classification quality and the model’s ability to correctly identify serious crimes (felonies). The probabilistic output from the model is also used to generate the ROC curve for visual comparison with tree-based models.

Logistic regression:
Accuracy: 0.740828804347826
ROC AUC: 0.8212366206889112
Precision: 0.5915543900895198
Recall: 0.547722705961152

(5) Random Forest & XGBoost

We train two classification models—Random Forest and XGBoost—to predict whether a given crime is a felony or a misdemeanor. Both models are trained using the same set of features and evaluated on the test set. For Random Forest, we use 100 trees with a fixed random seed. For XGBoost, we specify logloss as the evaluation metric to align with the binary classification objective.

To assess model performance, we report four standard classification metrics: Accuracy, ROC AUC, Precision, and Recall, focusing on the positive class (felony). These metrics help evaluate not only overall correctness but also the model’s ability to detect serious crimes.

Random Forest:
Accuracy: 0.7999059364548495
ROC AUC: 0.8923927752814054
Precision: 0.6770196596728895
Recall: 0.6862022772940388

XGBoost:
Accuracy: 0.8171247909698997
ROC AUC: 0.9100901234908048
Precision: 0.6939367793552436
Recall: 0.7407066309444073

(6) ROC Curve

The ROC curve compares the classification performance of three models: Random Forest, XGBoost, and Logistic Regression. Both Random Forest and XGBoost achieve an AUC of 0.85, indicating strong ability to distinguish between felonies and misdemeanors. Logistic Regression performs slightly lower, with an AUC of 0.82. Overall, tree-based models show a slight edge in predictive power, especially at lower false positive rates. The curves confirm that all models perform significantly better than random guessing.

(7) Precision-Recall curve

The precision-recall curve compares the performance of Random Forest, XGBoost, and Logistic Regression in identifying felonies. Both tree-based models maintain high precision at low recall levels, indicating strong performance in identifying the most confident felony predictions. In contrast, Logistic Regression starts with a sharp drop in precision at very low recall—an expected behavior when very few positive predictions are made early and one misclassification leads to zero precision. As recall increases, Logistic Regression shows a more stable precision-recall trade-off, while Random Forest and XGBoost gradually decline. Overall, the tree-based models outperform Logistic Regression across most recall levels, particularly in high-precision regions.

(8) Feature Importance

The feature importance plot shows the top 15 predictors used by the Random Forest model to distinguish between felonies and misdemeanors. victim_age and daily_avg_temperature_celsius are the two most influential features, suggesting that both demographic and environmental factors significantly impact the likelihood of a crime being a felony. Time-related variables like hour and month also contribute meaningfully, indicating temporal patterns in serious crimes. Several victim-related characteristics—including sex and descent code—appear among the top features, highlighting the relevance of demographic factors. While socioeconomic indicators such as unemployment_rate_pct and is_weekend show lower importance, they still provide additional context in the model’s decision process.

9. Conclusion

Our analysis shows that crime in Los Angeles is spatially and temporally patterned. Central, Southwest, and Pacific are consistently high-crime areas. Theft-related offenses, especially vehicle theft, dominate the crime types. We observed a sharp monthly decline after March, with crime levels remaining low during summer and winter. Crimes tend to occur more often in the afternoon and early evening, and weekends see slightly higher counts than weekdays. Additionally, higher unemployment rates are associated with fewer reported crimes. These findings highlight the need to consider location, time, and social context in crime prevention strategies.
In our modeling of crime severity classification, we compared logistic regression, random forest, and XGBoost. Among these, random forest performed the best in terms of accuracy and AUC. The most important features were victim age, average temperature, and hour of the day. These results suggest that crime severity is not only shaped by who the victim is, but also by when and under what conditions the crime occurs.
Understanding these patterns can help law enforcement agencies optimize patrol timing and resource deployment, and support targeted interventions for vulnerable groups.

10. Git Commit History Graph