Impact of Denoising and Image Masking on Classification Performance

Author

Group: Image 1: Brian Wang, Eden Cai, Kyle Chan, Michael Yip, Yuecheng Wang

Published

1 Jun 2025

Show code
import pandas as pd
from matplotlib import pyplot as plt
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
import pickle
import seaborn as sns
import numpy as np
from shinyapp_func import get_heatmap, get_transform
from PIL import Image

pio.renderers.default = "notebook_connected"

1 Executive summary

This report investigates the influence of denoising and image masking on classification model performance in medical imaging. Reliable diagnosis and treatment planning rely on effective preprocessing because positional variation and image noise may significantly reduce diagnostic accuracy.
The impact of various denoising techniques of classification models is investigated in the first research question. Using traditional denoisers (Gaussian Blur, Median Filter) and deep learning-based techniques (DnCNN, Restormer), we evaluate four models: CNN, SVM, Random Forest (RF), and K-Nearest Neighbours (KNN).
The second research question investigates whether an image’s central areas are more important for classification. We test this using centre, non-centre, and random masking strategies based on the assumption that diagnostic features tend to be central.
This study addresses these questions and provides useful guidance on effective preprocessing techniques. In medical imaging, determining the best denoising and masking strategies can result in higher classification accuracy and more trustworthy diagnostic outcomes.

2 Background

This study categorised cell types into three primary groups. Non-tumorous cells, B Cells and CD4+ T Cells (13,440 samples) as they are key immune effectors in tumour microenvironments (Elena, L., & Lopera, D. E., 2013). Non-invasive tumours, DCIS 1 and DCIS 2 (24,606 samples), as they capture different pre-invasive breast region patterns (Allred, D. C. 2010). The Invasive tumour category included Invasive Tumours and Proliferative Invasive Tumours (38,149).

In medical imaging, noise is abundant in images produced by digital modalities like X-Ray, Ultrasound and CT. This reduces image quality, complicating the process of correctly diagnosing certain illnesses (Sai, G. V., 2023). Denoising is a crucial component in image pre-processing, helping prevent incorrect disease diagnosis (Alanazi & Mercorelli, 2024), improving model accuracy, preventing overfitting, and contributing to the robustness of predictive models (Tamilselvi, C., 2024).

Beyond denoising, techniques such as masking can optimise how machine learning (ML) models learn from specific cellular features. Cells contain distinct characteristics divergent from natural images, suggesting specific image areas may hold greater predictive power for ML models. Exploring masking strategies, such as central vs non-central regions, may provide insight into whether certain crucial diagnostic features are in those areas. In medical imaging, disease-related features are often sparse and localised, with remaining regions typically appearing normal and undifferentiated (Xie, Y. 2024), highlighting the potential for masking to focus model attention on critical, yet limited, areas.

3 Methods

3.1 Research Question 1: Denoising Strategies vs. Classification Performance Analysis

To investigate the first research question, we designed a consistent experimental pipeline that compares four classification models on H&E-stained cell images:

  • CNN automatically learns complex visual patterns, such as nuclear shape, texture, and aggregation, which is ideal for classifying cell types directly from pixel data.
  • SVM works well in separating classes using extracted features like colour and texture, which helps detect subtle differences between similar tumour types effectively.
  • RF works well with high-dimensional and flattened images and is robust to noise, making it ideal for assessing the impact of denoising.
  • KNN is a simple and interpretable model relying on feature similarity, which is a useful baseline for evaluating how denoising improves classification performance.

The CNN model was trained with early stopping, a batch size of 128, and a learning rate of 0.001, while traditional models (SVM, RF, KNN) used flattened image features and were assessed using the F1 score.

In research question 1, the performance of three denoising groups is assessed, including:

  • No Denoiser (Raw Images)
    • Raw H&E images are used directly, considered as a baseline to measure the impact of denoising.
  • Classical Denoisers
    • Gaussian Blur reduces high-frequency noise using a bell-shaped kernel, considered widely used and easy to implement.
    • Median Blur preserves edges better (cell boundaries) and is effective against salt-and-pepper noise, which commonly occurs in microscopy images. Also, it helps compare non-linear smoothing with Gaussian blur.
  • Deep learning-based Denoisers
    • DnCNN (Denoising Convolutional Neural Network) is a deep learning model for removing complex noise in medical images using multiple convolutional layers with batch normalisation and ReLU activation to learn a residual mapping from noisy images to clean images (Zhang, K. 2019).
    • Restormer applies transformer-based self-attention to denoise while keeping fine structural details in H&E images (Liu et al., 2023).

Middle-level denoising settings are used consistently: kernel size 5 for classical filters, DnCNN pretrained with sigma=25, and Restormer using the blind model.
See Figure 2

To fairly assess the impact of each denoising method, a consistent preprocessing and evaluation pipeline was utilised:

  • All images were resized to 224*224, converted to tensors, and normalised using ImageNet statistics to ensure compatibility and stable training.
  • The dataset was split into 50% training and 50% testing, with the test set further divided into 10 equal partitions. Models were trained once on the training set and evaluated across all 10 test partitions, improving the computational efficiency while ensuring full test coverage.
  • Performance across all denoisers and models was measured using the F1 score to reflect classification quality.

3.2 Research Question 2: Spatial Sensitivity Analysis

We use masking-based perturbation techniques to evaluate whether central image regions have a greater influence on classification performance. Additionally, we apply Gradient-weighted Class Activation Mapping (Grad-CAM) to interpret spatial attention in CNNs and highlight influential regions.

3.2.1 Masking Strategies

Three masking strategies were used to examine spatial sensitivity:

  • Centre Mask: While the outer areas are masked, the central circle, which has a radius of 80 pixels, is still visible. This mask only uses central features to test model classification.
  • Non-Centre Mask: By blocking the same central area, this mask enables an evaluation of the importance of peripheral features.
  • Random Mask: 50% of the image is randomly blocked by a 10x10 grid (with a fixed seed). This acts as a baseline to evaluate robustness to unstructured degradation.

See Figure 4 Investigating how models react to various forms of spatial perturbation is the aim of these masking techniques. We evaluate region-specific sensitivity, whether central features are more important than peripheral ones, by comparing centre and non-centre masks. In contrast, the random mask introduces unstructured feature loss, which enables us to assess the model’s robustness to unpredictable degradation. When combined, these techniques provide insight into how various models extract and rely on spatial information by revealing the robustness and distribution of informative features throughout the spatial field.

3.2.1.1 Justification of masking strategies

We selected 50% masking based on prior research (He et al., 2021), which shows that heavier masking (e.g., 80%) removes too much information, while lighter masking (e.g., 10%) does not sufficiently challenge the model. A 50% mask achieves a balance between disruption and learning.

3.2.2 Evaluation Approach

Models are trained on the full dataset, as masking is a lightweight preprocessing step. When it comes to evaluation, due to the computing resource and time constraints, we perform cross-validation on only 50% of the dataset, as cross-validation on CNN models is computationally expensive.

3.2.2.1 Evaluation metrics

Due to class imbalance, the F1-score was employed as the main metric, offering a fair evaluation of recall and precision. Because of figure limits, accuracy was computed but not included in visualisations.

3.2.3 Model Interpretability: Heatmap & Critical Area Contour

We applied Grad-CAM to visualise CNN attention. Red-highlighted regions in the heatmap indicate high model focus. A critical area contour overlay was also extracted to highlight the most influential areas.
See Figure 5

4 Results

4.1 Research Question 1 Results

The results from Research Question 1 show that deep learning-based denoisers, especially DnCNN, achieved the highest F1 scores across all models—0.925 in CNN, 0.931 in SVM, 0.932 in RF, and 0.931 in KNN—highlighting their effectiveness in enhancing feature clarity. Restormer followed closely, with strong performance across models (e.g., 0.900 in CNN, 0.880 in RF).
In contrast, traditional denoisers such as Gaussian and Median Blur produced inconsistent results. While Gaussian Blur does poorly in SVM (0.633) and KNN (0.746), it performed fairly well in CNN (0.898) and Random Forest (0.837). Median Blur does poorly in SVM (0.617) and KNN (0.660) but performs well in CNN (0.903).
These results show how sensitive the model is to the quality of the images. For models without internal feature extraction, such as SVM and KNN, deep learning-based denoisers maintain important structures. Traditional techniques have the potential to oversmooth, eliminating both noise and diagnostic information.
DnCNN emerged as the most effective denoiser in classifying cell images across all models, with Restormer close behind. These results highlight the importance of deep learning-based denoising in improving model performance, particularly for traditional ML models.
See Figure 1

4.2 Research Question 2 Results

By comparing models under three masking strategies, centre, non-centre, and random masking, to an unmasked baseline, we were able to determine whether the centre of an image has a greater impact on classification performance. The F1 scores for the four models (CNN, KNN, SVM, and RF) are shown in side-by-side boxplots in the figure below.
See Figure 2

Overall, models indicated that no region is consistently more influential, with only slight performance differences between centre and non-centre masking. Although there was a small drop for CNN and SVM under centre masking, the difference was insignificant, suggesting weak spatial dependency. As expected, random masking resulted in the biggest variability and performance decline, most likely as a result of the unstructured removal of important features.
For CNN, the no-masking produces the highest F1 score (~0.97), and both centre and non-centre masking produced similar, slightly worse, indicating that important features are dispersed throughout the image rather than just in the centre. There was no apparent preference between centre and non-centre regions in the patterns demonstrated by SVM and RF(~0.93 and ~0.92 respectively). Finally, KNN does not explicitly model spatial information because it only considers the Euclidean distance between two vectors, which reinforces that it relies on global rather than localised features. In summary, models without any masking consistently outperformed those with masking, demonstrating that full image access yields the best classification accuracy. The minor difference between centre and non-centre masking performance indicates that critical information is equally spread throughout the image for all classification models, despite our initial expectation that the centre would be more informative (Islam et al., 2024).

4.3 Shiny App

We developed an interactive Shiny app allowing users to upload a cell image, apply a denoising method or masking strategy, and classify it using one of four models. The app displays the processed image alongside a heatmap showing which regions influenced the most on the model’s prediction, and allows users to download the results.
This tool is especially useful for those in medical imaging and biomedical AI, offering an interactive way to explore how preprocessing choices affect model behaviour and to improve transparency and trust in AI-driven diagnostics.

5 Discussion

5.1 Research Question 1 Discussion

How do different denoising strategies influence the performance on different classification models?

5.1.1 Results Summary & Key Findings

Various denoising strategies, including DnCNN and classical methods were applied across multiple classification models, measuring F1 scores to compare performances for each denoiser over different models.

According to Alanazi and Mercorelli (2024), precise denoising is vital for diagnostic reliability, supported here by the stable performance improvements seen in deep learning-based methods. This has supported our key findings that denoising can improve models’ performance. Given that no denoiser has an F1 score of 0.65 on average, all models have better performance than raw input images in medical imaging.

Our key finding in denoising is matched with the result of Zhang et al. (2017) in DnCNN’s research, that deep learning-based denoisers work better than classical denoisers, with Deep Learning-Based Denoisers (DnCNN, Restormer) average performance F1 score in 0.89-0.93, with DnCNN working the best, scoring >0.92 in average for all models, whereas classical denoiser (Gaussian Blur, Median Blur) results in around 0.79 F1 score in average.

5.1.2 Findings in different models over denoiser

Different models would have different sensitivity on the input image, for example, SVM and KNN models rely on the quality of input images, as they serve a huge difference depending on different denoisers. Yet, CNN and RF models have more stable performance in all denoisers, however, their performance could still be enhanced when inputting better denoised images, like using DnCNN denoiser could boost the performance to F1 score > 0.92. It is proven that “DnCNN consistently yielded the highest F1 scores across all model types.” Zhang et al. (2017).

5.2 Research Question 2 Discussion

5.2.1 Results Summary & Key Findings

​​Based on results from RQ1, where DnCNN achieved the highest average F1 score (0.93), this study investigated whether the central region of cell images is more important for classification models using DnCNN denoiser and spatial masking techniques. To assess spatial sensitivity, we used three masking strategies—centre, non-centre, and random—to test four classifiers (CNN, SVM, Random Forest, and KNN). Grad-CAM heatmaps and critical area contours can further help interpret the model’s focus.
From a high-level perspective, the results showed that no single region was consistently most important across all models, which contradicted our initial hypothesis that central features would probably dominate due to biological intuition (Islam et al., 2024). The spatial distribution of informative features is suggested by the median F1-score difference between the centre and non-centre masks, which was less than 1 across all classification models. All models performed above 0.9 in most cases, further indicating that classification in tumour cell images does not strictly rely on central framing, contrary to assumptions often made in medical imaging tasks.

5.2.2 Analysis of Poor Performance with Random Masking

Random masking increased performance variability across models and consistently resulted in lower F1 scores. It eliminates features in an unpredictable method, which raises the possibility of obscuring important information and disrupting spatial coherence in contrast to the spatially localised and structured centre/non-centre masks.

5.2.3 Surprised Observations

5.2.3.1 Research Question 1

One notable discovery was that deep learning-based denoisers were most effective for KNN. KNN’s F1 score was only 0.572 without denoising, but it significantly increased to 0.931 with DnCNN and 0.833 with Restormer. KNN is more reliant on high-quality inputs than models like CNN because it uses raw pixel distances, which makes it extremely sensitive to noise.

5.2.3.2 Research Question 2

Surprisingly, KNN outperformed more complex models and obtained the highest F1 score under the unmasked condition. This highlights that under the correct circumstances, straightforward techniques can still produce powerful, reliable results, contradicting the idea that sophisticated architectures or preprocessing are always better.

5.3 Limitations

Due to computational resource constraints and limited project time, only three cell classes were evaluated: Immune cells (13,440), Invasive Tumour (24,606), and Non-invasive Tumour (38,149). This significant class imbalance could bias the model towards the majority class. While F1 scores were used to provide balanced evaluation across classes, additional techniques such as weighted sampling or class-balanced loss functions could have mitigated this issue.
The denoising methodology lacked consistency across experimental setups, employing varying hyperparameters, including different kernel sizes for Gaussian (5×5) and Median (5) filters, and different model configurations for advanced denoisers (DnCNN with Sigma25 model, Restormer with Blind model). This inconsistency limits the comparability of the results, making it difficult to determine the true effectiveness of any single denoising approach. Additionally, the evaluation was limited to a small subset of available denoising models, which restricts the generalisability of the findings. Denoising was treated as a separate preprocessing step before classification, making it a two-step process. Rather than building an integrated model that can handle both denoising and classification.
The masking strategy employed a simplified binary classification of image regions into “Centre vs Non-Centre” divisions, which oversimplified feature localisation in medical images. Critical diagnostic features are not always centrally located and can appear in different regions throughout the image. Using a binary approach fails to capture nuances in the distribution of critical cell features, possibly leading to suboptimal model performance.
Other limitations include using data from a single source (H & E), which could limit generalisability to other datasets, and the reliance on F1 scores for evaluation, which may not capture all aspects of model performance relevant to medical applications.

5.4 Future Work

5.4.1 Estimated Calibration Error (ECE)

A direction for further research is to evaluate the models’ Estimated Calibration Error (ECE). While accuracy and F1 score measure predictive performance, they don’t assess how well a model’s confidence aligns with its actual accuracy, a critical factor in high-risk applications like tumour classification. Optimistic but inaccurate predictions can have significant clinical consequences. Calculating ECE would allow for the use of confidence calibration techniques to fine-tune the model, ensuring predicted probabilities are more reliable and consistent with actual uncertainty.

5.4.2 Different Denoising Levels and Parameter Configurations

We compared several denoisers in our current analysis, but each one was applied with a fixed noise level. A model that is overly tuned to a particular denoising setting may not generalise well in real-world situations, where noise characteristics vary based on imaging conditions. Future research could systematically change parameters like Restormer input settings, DnCNN training noise levels, or Gaussian blur sigma values and kernel sizes, then assess the subsequent effects on classification accuracy.

6 Conclusion

With the highest F1 scores across all classification models, our results clearly demonstrate that deep learning-based denoisers, particularly DnCNN, consistently outperform traditional techniques. Restormer follows DnCNN closely and also performs well. In contrast, traditional techniques like Gaussian blur and median blur, yield inconsistent results. They generally perform poorly in feature-sensitive models like SVM and KNN, but they can occasionally help CNN and Random Forest. These results demonstrate how deep learning denoisers can maintain important diagnostic features.
For image masking, we found that the performance differences between centre and non-centre masking is small across models. The fact that CNN, SVM, and RF did not exhibit a strong preference for central features suggests that the image’s diagnostic information is distributed evenly. Random masking consistently led to degraded performance, highlighting the significance of structured feature preservation.
In summary, we recommend prioritising deep learning-based denoising methods in medical image classification tasks. Additionally, image analysis pipelines should not overly rely on central regions, as incorporating information from the entire image improves diagnostic robustness and accuracy.

7 Student contributions

  • Brian Wang (SID: 520070471)
    • Research question 1 model training and evaluation
    • Research question 2 model training and evaluation
    • Image masking(RQ2) implementation
    • Denoising, masking, prediction, heatmap, and critical area contour functionality in Shiny app
    • Shiny app
    • Research question 2 presentation
    • Research question 2 method, results, and discussion in the report
  • Eden Cai (SID: 500002722)
    • Record meeting minutes
    • Research question 1: Classical denoisers
    • Research question 1: Cross-validation and visualisation for SVM, RF, and KNN
    • Presentation: RQ1 and Q&A
    • Report: methods RQ1, results RQ1, and discussion surprising results
  • Kyle Chan (SID: 460120937)
    • Presentation: background
    • Assisting and draft in Shiny app
    • Report research question 1 discussion
    • Research question 1 argumentation
  • Michael Yip (SID: 530485661)
    • Presentation Slide and Speech
    • Research question 1: DnCNN code
    • Report: Background, Limitations
  • Yuecheng Wang (SID: 530214245)
    • Presentation: Demo and Q&A
    • Report: Introduction and Conclusion
    • Research question 1: Data Augmentation
    • Model Analysis and Assist in Shiny app

8 Reference

  1. Alanazi, T. M., & Mercorelli, P. (2024). Precision Denoising in Medical Imaging via Generative Adversarial Network-Aided Low-Noise Discriminator Technique. Mathematics, 12(23), 3705. https://doi.org/10.3390/math12233705
  2. Allred, D. C. (2010). Ductal Carcinoma In Situ: Terminology, Classification, and Natural History. JNCI Monographs, 2010(41), 134–138. https://doi.org/10.1093/jncimonographs/lgq035.
  3. Elena, L., & Lopera, D. E. (2013). Introduction to T and B lymphocytes. Nih.gov; El Rosario University Press. https://www.ncbi.nlm.nih.gov/books/NBK459471/.
  4. He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked Autoencoders Are Scalable Vision Learners. https://arxiv.org/pdf/2111.06377.
  5. Islam, O., Assaduzzaman, M., & Hasan, M. Z. (2024). An explainable AI-based blood cell classification using optimized convolutional neural network. Journal of Pathology Informatics, 100389. https://doi.org/10.1016/j.jpi.2024.100389
  6. Jebur, M. N., Elhoseny, M., Albahli, S., & Mohammed, F. M. (2023). Deep learning-based image denoising: A comprehensive review. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-023-17468-2
  7. Liu, Y., Ma, X., Li, J., Fan, D., & Lin, Z. (2023). Hyper-Restormer: Lightweight transformer for hyperspectral image restoration. arXiv. https://arxiv.org/abs/2312.07016
  8. Sai, G. V., Chekuri Seshank, Krishna, & Jagjit Singh Dhatterwal. (2023). Reduction of Noise in Medical Imaging Quality. 364–368. https://doi.org/10.1109/icdt57929.2023.10150846
  9. Tamilselvi, C., Yeasin, M., Paul, R. K., & Paul, A. K. (2024). Can Denoising Enhance Prediction Accuracy of Learning Models? A Case of Wavelet Decomposition Approach. Forecasting, 6(1), 81–99. https://doi.org/10.3390/forecast6010005
  10. Xie, Y., Gu, L., Harada, T., Zhang, J., Xia, Y., & Wu, Q. (2024). Rethinking masked image modelling for medical image representation. Medical Image Analysis, 98, 103304. https://doi.org/10.1016/j.media.2024.103304
  11. Zhang, K., Zuo, W., Chen, Y., Meng, D., & Zhang, L. (2017). Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Transactions on Image Processing, 26(7), 3142–3155. https://doi.org/10.1109/tip.2017.2662206

9 Appendix

Show code
with open("denoised_cv_results/models_denoising_f1_scores.pkl", 'rb') as f:
    models_denoising_f1_scores = pickle.load(f)

df_nested = pd.DataFrame(models_denoising_f1_scores).T  # Models become rows
df_nested.reset_index(inplace=True)
df_nested = df_nested.melt(id_vars='index', var_name='Denoising', value_name='F1_Score')
df_nested.columns = ['Model', 'Denoising', 'F1 Score']

df_long = df_nested.explode('F1 Score', ignore_index=True)

fig = px.box(
    df_long,
    x="Model",
    y="F1 Score",
    color="Denoising",
    points=False,  # Hide raw data points
    title="F1 Score by Model and Denoising Method",
    hover_data=["Model", "Denoising"]
)

# Get the color mapping used in the plot
color_map = {
    trace.name: trace.marker.color
    for trace in fig.data
    if isinstance(trace, go.Box)
}

# Calculate means per group
mean_df = df_long.groupby(['Model', 'Denoising'])['F1 Score'].mean().reset_index()

# Add mean markers in matching color
for _, row in mean_df.iterrows():
    fig.add_trace(go.Scatter(
        x=[row['Model']],
        y=[row['F1 Score']],
        mode='markers',
        name=f"Mean ({row['Denoising']})",
        marker=dict(
            symbol='diamond',
            size=8,
            color=color_map.get(row['Denoising'], 'black')
        ),
        legendgroup=row['Denoising'],
        showlegend=False,
        hovertemplate=f"Model: {row['Model']}<br>Denoising: {row['Denoising']}<br>Mean: {row['F1 Score']:.4f}"
    ))

# Final plot styling
fig.update_layout(
    boxmode='group',
    template='plotly_white',
    xaxis_title="Model",
    yaxis_title="F1 Score",
    hoverlabel=dict(bgcolor="white", font_size=12)
)

fig.show()

Figure 1. F1 Score by Model and Denoising Method

Show code
with open("masking_cv_results/models_masks_f1_scores.pkl", 'rb') as f:
    models_masks_f1_scores = pickle.load(f)

df_nested = pd.DataFrame(models_masks_f1_scores).T  # Models become rows
df_nested.reset_index(inplace=True)
df_nested = df_nested.melt(id_vars='index', var_name='Mask', value_name='F1_Score')
df_nested.columns = ['Model', 'Mask', 'F1 Score']

df_long = df_nested.explode('F1 Score', ignore_index=True)

fig = px.box(
    df_long,
    x="Model",
    y="F1 Score",
    color="Mask",
    points=False,  # Hide raw data points
    title="F1 Score by Model and Mask Method",
    hover_data=["Model", "Mask"]
)

# Get the color mapping used in the plot
color_map = {
    trace.name: trace.marker.color
    for trace in fig.data
    if isinstance(trace, go.Box)
}

# Calculate means per group
mean_df = df_long.groupby(['Model', 'Mask'])['F1 Score'].mean().reset_index()

# Add mean markers in matching color
for _, row in mean_df.iterrows():
    fig.add_trace(go.Scatter(
        x=[row['Model']],
        y=[row['F1 Score']],
        mode='markers',
        name=f"Mean ({row['Mask']})",
        marker=dict(
            symbol='diamond',
            size=8,
            color=color_map.get(row['Mask'], 'black')
        ),
        legendgroup=row['Mask'],
        showlegend=False,
        hovertemplate=f"Model: {row['Model']}<br>Mask: {row['Mask']}<br>Mean: {row['F1 Score']:.4f}"
    ))

# Final plot styling
fig.update_layout(
    boxmode='group',
    template='plotly_white',
    xaxis_title="Model",
    yaxis_title="F1 Score",
    hoverlabel=dict(bgcolor="white", font_size=12)
)

fig.show()

Figure 2. F1 Score by Model and Mask Method

Show code
image = Image.open("example.png").convert('RGB')
gaussian = get_transform(image, "Gaussian_Blur")
median = get_transform(image, "Median")
restormer = get_transform(image, "Restormer")
dncnn = get_transform(image, "DnCNN")

fig, axes = plt.subplots(1, 5, figsize=(25, 5))
axes[0].imshow(image)
axes[0].axis('off')
axes[0].set_title("None")

axes[1].imshow(gaussian)
axes[1].axis('off')
axes[1].set_title("Gaussian Blur\n(5,5) kernel size")

axes[2].imshow(median)
axes[2].axis('off')
axes[2].set_title("Median Blur\n5 kernel size")

axes[3].imshow(restormer)
axes[3].axis('off')
axes[3].set_title("Restormer\nblind model")

axes[4].imshow(dncnn)
axes[4].axis('off')
axes[4].set_title("DnCNN\nSigma25 model")

plt.tight_layout()
plt.show()

Figure 3. All Denoisers Results

Show code
centre = get_transform(image, "Centre")
non_centre = get_transform(image, "Non-centre")
random = get_transform(image, "Random")

fig, axes = plt.subplots(1, 4, figsize=(20, 5))
axes[0].imshow(image)
axes[0].axis('off')
axes[0].set_title("No Mask")

axes[1].imshow(centre)
axes[1].axis('off')
axes[1].set_title("Centre")

axes[2].imshow(non_centre)
axes[2].axis('off')
axes[2].set_title("Non-Centre")

axes[3].imshow(random)
axes[3].axis('off')
axes[3].set_title("Random")


plt.tight_layout()
plt.show()

Figure 4. All Masking Results

Show code
heatmap, contour = get_heatmap(image, "None")

fig, axes = plt.subplots(1, 2, figsize=(10, 5))
axes[0].imshow(heatmap)
axes[0].axis('off')
axes[0].set_title("Heatmap")

axes[1].imshow(contour)
axes[1].axis('off')
axes[1].set_title("Critical Area Contour")


plt.tight_layout()
plt.show()

Figure 5. Heatmap and Critical Area Contour

9.4 Technical Details

9.4.1 Denoise Model Training

Show code
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, Subset
from torchvision import transforms, models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import LabelEncoder
import time
import numpy as np
import cv2
from PIL import Image
import os
from matplotlib import pyplot as plt
from resotrmer import Restormer_Denoise
from models_DnCNN import DnCNN_Denoiser
from denoise_classical import GaussianBlur, MedianBlur
import shutil
from copy import deepcopy
import pickle
import seaborn as sns
import pandas as pd
import traceback
import pickle

#Turn all the randomisation off to ensure the results of every execution is the same 
seed = 0
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

IMG_SIZE = 224
FOLDER_PATH = "Images/100"

restomer = Restormer_Denoise("blind")
dncnn = DnCNN_Denoiser()

denoise_methods = {
    "None": lambda x:x,
    "Restormer": restomer.denoise_image,
    "Gaussian_Blur": GaussianBlur,
    "Median_Blur": MedianBlur,
    "DnCNN": dncnn.denoise_image
}

classification_models = ["CNN", "KNN", "SVM", "Random Forest"]

models_denoising_accuracies = {model: {denoise_method: -1 for denoise_method in denoise_methods.keys()}  # -1 represent not yet calculated
                               for model in classification_models}
models_denoising_f1_scores = {model: {denoise_method: -1 for denoise_method in denoise_methods.keys()}  # -1 represent not yet calculated
                               for model in classification_models}
models_denoising_classification_times = {model: {denoise_method: -1 for denoise_method in denoise_methods.keys()}  # -1 represent not yet calculated
                               for model in classification_models}
models_denoising_confusion_metrics = {model: {denoise_method: -1 for denoise_method in denoise_methods.keys()}  # -1 represent not yet calculated
                               for model in classification_models}

def build_transform(denoise_method: str) -> transforms.Compose:
    denoise_fn = denoise_methods.get(denoise_method, lambda x:x)

    return transforms.Compose([
        transforms.Lambda(denoise_fn),
        transforms.Resize((IMG_SIZE, IMG_SIZE)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])


# each list argument contains some subfolders, this function merge all images in subfolders into one big folder
# list1: immune cells class
# list2: non-invasive tumor
# list3: invasive tumor
def merge_folder(list1, list2, list3):
    # define folder paths
    source_path = './Images/100/'

    immune_cells = list1

    non_invasive_tumor = list2

    invasive_tumor_cell = list3

    folder_dests = ["Immune_Cells", "Non_Invasive_Tumor", "Invasive_Tumor_Set"]

    # make empty folder
    for i in folder_dests:
        dest_path = source_path + i
        os.makedirs(dest_path, exist_ok=True)
        print(f"{dest_path} is created.")

    # start copying file process
    source_categories = [immune_cells, non_invasive_tumor, invasive_tumor_cell]
    extensions = '*.png'

    count = 1
    cate_count = 0
    for source_dirs in source_categories:
        for src in source_dirs:
            pattern = os.path.join(source_path+src, extensions)
            print(pattern)
            for img_path in glob.glob(pattern):
                print(img_path)
                filename = os.path.basename(img_path)
                dest_path = source_path + folder_dests[cate_count]
                file_dest_path = os.path.join(dest_path, filename)
                # copying images from sub-folder to big folder, (overwirte if same images exist)
                shutil.copy2(img_path, file_dest_path)
                print(f'{count}: Copied {img_path} -> {file_dest_path}')
                count += 1
        cate_count += 1
    return 0

merge_folder(['B_Cells','CD4+_T_Cells'], ['DCIS_1', 'DCIS_2'], ['Invasive_Tumor', 'Prolif_Invasive_Tumor'])

class ImageDataSet(Dataset):
    def __init__(self, image_names, transform):
        self.file_names = []
        self.labels = []
        for numeric_label, names in enumerate(image_names):
            self.labels.extend([numeric_label]*len(names))
            self.file_names.extend(names)

        self.transform = transform

    def __getitem__(self, index):
        img_name = self.file_names[index]
        img = Image.open(img_name).convert('RGB')
        img = self.transform(img)
        label = self.labels[index]
        return img, label
    
    def __len__(self):
        return len(self.file_names)

# labels = ["B_Cells", "CD4+_T_Cells", "DCIS_1", "DCIS_2", "Invasive_Tumor", "Prolif_Invasive_Tumor"]
labels = ["Immune_Cells", "Non_Invasive_Tumor", "Invasive_Tumor_Set"]
le = LabelEncoder()
numeric_labels = le.fit_transform(labels)
image_names = []
for _ in numeric_labels:
    image_names.append([])

for (dir_path, dir_names, file_names) in os.walk(FOLDER_PATH):
    parent_folder = os.path.basename(dir_path)
    if parent_folder in labels: # Read the subset of dataset to reduce training time 
        for file in file_names:
            image = cv2.imread(os.path.join(dir_path, file))
            if image.shape[0] < 100 and image.shape[1] < 100: #skip the small image, it doesn't give much info
                continue
            numeric_label = le.transform([parent_folder])[0]
            image_names[numeric_label].append(os.path.join(dir_path, file))

denoising_datasets = {key : ImageDataSet(image_names, build_transform(key)) for key in denoise_methods.keys()}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #check if the computer has GPU

model = models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, len(image_names))
model = model.to(device)

denoising_cnn_models = {key : deepcopy(model).to(device) for key in denoise_methods.keys()}

# hyper-parameters setting
num_epochs = 100
patience = 10 #for early stopping
batch_size = 128
learning_rate = 0.001

def split_dataset(dataset: ImageDataSet):
    _, subset = train_test_split(list(range(len(dataset))), test_size=0.5, random_state=0)
    # train_idx, temp_idx = train_test_split(list(range(len(dataset))), test_size=0.3, random_state=0)
    train_idx, temp_idx = train_test_split(subset, test_size=0.3, random_state=0)
    val_idx, test_idx = train_test_split(temp_idx, test_size=0.5, random_state=0)

    train_subset = Subset(dataset, train_idx)
    val_subset = Subset(dataset, val_idx)
    test_subset = Subset(dataset, test_idx)

    train_loader = DataLoader(train_subset, batch_size=batch_size, shuffle=False, num_workers=8, pin_memory=True)
    val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False, num_workers=8, pin_memory=True)
    test_loader = DataLoader(test_subset, batch_size=batch_size, shuffle=False, num_workers=8, pin_memory=True)

    return train_loader, val_loader, test_loader

training_loss_curves = {key : [] for key in denoise_methods.keys()}
val_loss_curves = {key : [] for key in denoise_methods.keys()}

for denoise, model in denoising_cnn_models.items():
    print(f"Start training {denoise} model.")
    best_val_loss = float('inf')
    epoch_no_improvement = 0
    best_model_parameters = None
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    train_loader, val_loader, _ = split_dataset(denoising_datasets.get(denoise))
    try:
        scaler = torch.amp.GradScaler()
        for epoch in range(num_epochs):
            # Training phase
            model.train()
            running_loss = 0.0

            for images, labels in train_loader:
                images, labels = images.to(device), labels.to(device)
                
                optimizer.zero_grad()

                with torch.amp.autocast("cuda"):
                    outputs = model(images)
                    loss = criterion(outputs, labels)
                # outputs = model(images)
                # loss = criterion(outputs, labels)

                scaler.scale(loss).backward()
                scaler.step(optimizer)
                scaler.update()

                # loss.backward()
                # optimizer.step()

                print(f"Batch loss: {loss}")
                
                running_loss += loss.item()
            
            training_loss = running_loss/len(train_loader)
            training_loss_curves[denoise].append(training_loss)
            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {training_loss:.4f}")
            
            # Validation phase
            model.eval()
            val_loss = 0.0
            correct = 0
            total = 0
            
            with torch.no_grad():
                for images, labels in val_loader:
                    images, labels = images.to(device), labels.to(device)
                    outputs = model(images)
                    loss = criterion(outputs, labels)
                    
                    val_loss += loss.item()
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
            
            avg_val_loss = val_loss/len(val_loader)
            val_accuracy = 100 * correct / total
            print(f"Validation Loss: {avg_val_loss:.4f}, Accuracy: {val_accuracy:.2f}%")
            val_loss_curves[denoise].append(avg_val_loss)

            if avg_val_loss < best_val_loss:
                best_val_loss = avg_val_loss
                best_model_parameters = model.state_dict()
                epoch_no_improvement = 0
            else:
                epoch_no_improvement += 1
                if epoch_no_improvement == patience:
                    print(f"No improvement for {patience} epoches. Early stopping.")
                    break

        if best_model_parameters is not None:
            model.load_state_dict(best_model_parameters)
    except Exception as e:
        traceback.print_exc()
        # torch.save(model.state_dict(), f'CNN_{denoise}.pth')


for denoise, model in denoising_cnn_models.items():
    torch.save(model.state_dict(), f'denoised_models/CNN_{denoise}.pth')

for denoise, model in denoising_cnn_models.items():
    _, _, test_loader = split_dataset(denoising_datasets.get(denoise))
    model.eval()
    y_true = []
    y_pred = []

    start = time.time()
    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            
            y_true.extend(labels.cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())
    end = time.time()

    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    cm = confusion_matrix(y_true, y_pred)
    elapsed_time = end - start
    
    models_denoising_accuracies["CNN"][denoise] = accuracy
    models_denoising_f1_scores["CNN"][denoise] = f1
    models_denoising_confusion_metrics["CNN"][denoise] = cm
    models_denoising_classification_times["CNN"][denoise] = elapsed_time
    
    print(f"{denoise} accuracy: {accuracy}")
    print(f"{denoise} f1 score: {f1}")
    print(f"{denoise} classification time: {elapsed_time}")

def datasets_feature_extractor(model, dataset):
    model.eval()
    feature_extractor = nn.Sequential(*list(model.children())[:-1]) # remove the last layer
    feature_extractor.eval()
    feature_extractor.to(device)

    train_features = []
    train_labels = []
    test_features = []
    test_labels = []
    train_loader, _, test_loader = split_dataset(dataset)

    with torch.no_grad():
        for images, labels in train_loader:
            images = images.to(device)
            output = feature_extractor(images).squeeze()
            train_features.append(output.cpu().numpy())
            train_labels.append(labels.cpu().numpy())

    X_train = np.vstack(train_features)
    y_train = np.hstack(train_labels)

    with torch.no_grad():
        for images, labels in test_loader:
            images = images.to(device)
            output = feature_extractor(images).squeeze()
            test_features.append(output.cpu().numpy())
            test_labels.append(labels.cpu().numpy())

    X_test = np.vstack(test_features)
    y_test = np.hstack(test_labels)

    return X_train, y_train, X_test, y_test

for denoise_method, dataset in denoising_datasets:
    X_train, y_train, X_test, y_test = datasets_feature_extractor(model, dataset)

    # SVM
    svm = SVC()
    start = time.time()
    svm.fit(X_train, y_train)
    end = time.time()

    y_pred = svm.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    elapsed_time = end - start

    models_denoising_accuracies["SVM"][denoise_method] = accuracy
    models_denoising_f1_scores["SVM"][denoise_method] = f1
    models_denoising_confusion_metrics["SVM"][denoise_method] = cm
    models_denoising_classification_times["SVM"][denoise_method] = elapsed_time

    print(f"SVM {denoise_method} accuracy: {accuracy}")
    print(f"SVM {denoise_method} f1 score: {f1}")
    print(f"SVM {denoise_method} classification time: {elapsed_time}")

    #RF
    rf = RandomForestClassifier()
    start = time.time()
    rf.fit(X_train, y_train)
    end = time.time()

    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)
    elapsed_time = end - start

    models_denoising_accuracies["Random Forest"][denoise_method] = accuracy
    models_denoising_f1_scores["Random Forest"][denoise_method] = f1
    models_denoising_confusion_metrics["Random Forest"][denoise_method] = cm
    models_denoising_classification_times["Random Forest"][denoise_method] = elapsed_time

    print(f"Random Forest {denoise_method} accuracy: {accuracy}")
    print(f"Random Forest {denoise_method} f1 score: {f1}")
    print(f"Random Forest {denoise_method} classification time: {elapsed_time}")

    #Find best k for KNN
    knn_models = []
    knn_accuracies = []
    knn_f1_scores = []
    knn_confusion_metrics = []
    knn_classification_times = []
    for k in range(1, 32, 2): #k = 1 to 31
        knn = KNeighborsClassifier(n_neighbors=k)
        start = time.time()
        knn.fit(X_train, y_train)
        end = time.time()

        y_pred = knn.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        cm = confusion_matrix(y_test, y_pred)
        elapsed_time = end - start

        knn_models.append(knn)
        knn_accuracies.append(accuracy)
        knn_f1_scores.append(f1)
        knn_confusion_metrics.append(cm)
        knn_classification_times.append(elapsed_time)

    knn_accuracies = np.array(knn_accuracies)
    max_idx = np.argmax(knn_accuracies)
    best_k = 2*max_idx+1

    print(f"Best {best_k}NN {denoise_method} accuracy: {knn_accuracies[max_idx]}")
    print(f"Best {best_k}NN {denoise_method} f1 score: {knn_f1_scores[max_idx]}")
    print(f"Best {best_k}NN {denoise_method} classification time: {knn_classification_times[max_idx]}")

    accuracy = float(knn_accuracies[max_idx])
    f1 = knn_f1_scores[max_idx]
    cm = knn_confusion_metrics[max_idx]
    elapsed_time = knn_classification_times[max_idx]

    models_denoising_accuracies["KNN"][denoise_method] = accuracy
    models_denoising_f1_scores["KNN"][denoise_method] = f1
    models_denoising_confusion_metrics["KNN"][denoise_method] = cm
    models_denoising_classification_times["KNN"][denoise_method] = elapsed_time

    with open(f"denoised_models/SVM_{denoise_method}.pkl", "wb") as f:
        pickle.dump(svm, f)
    with open(f"denoised_models/RF_{denoise_method}.pkl", "wb") as f:
        pickle.dump(rf, f)
    with open(f"denoised_models/{best_k}NN_{denoise_method}.pkl", "wb") as f:
        pickle.dump(knn_models[max_idx], f)

with open("denoised_variables/models_denoising_accuracies.pkl", 'wb') as f:
    pickle.dump(models_denoising_accuracies, f)
with open("denoised_variables/models_denoising_f1_scores.pkl", 'wb') as f:
    pickle.dump(models_denoising_f1_scores, f)
with open("denoised_variables/models_denoising_classification_times.pkl", 'wb') as f:
    pickle.dump(models_denoising_classification_times, f)
with open("denoised_variables/models_denoising_confusion_metrics.pkl", 'wb') as f:
    pickle.dump(models_denoising_confusion_metrics, f)
with open("denoised_variables/training_loss_curves.pkl", 'wb') as f:
    pickle.dump(training_loss_curves, f)
with open("denoised_variables/val_loss_curves.pkl", 'wb') as f:
    pickle.dump(val_loss_curves, f)

9.4.2 Denoise Model Evaluation

Show code
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, Subset
from torchvision import transforms, models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import LabelEncoder
import time
import numpy as np
import cv2
from PIL import Image
import os
from matplotlib import pyplot as plt
from resotrmer import Restormer_Denoise
from models_DnCNN import DnCNN_Denoiser
from denoise_classical import GaussianBlur, MedianBlur
import shutil
from copy import deepcopy
import pickle
import seaborn as sns
import pandas as pd
import traceback
import pickle

#Turn all the randomisation off to ensure the results of every execution is the same 
seed = 0
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

IMG_SIZE = 224
FOLDER_PATH = "Images/100"

restomer = Restormer_Denoise("blind")
dncnn = DnCNN_Denoiser()

denoise_methods = {
    "None": lambda x:x,
    "Restormer": restomer.denoise_image,
    "Gaussian_Blur": GaussianBlur,
    "Median_Blur": MedianBlur,
    "DnCNN": dncnn.denoise_image
}

classification_models = ["CNN", "KNN", "SVM", "Random Forest"]

models_denoising_accuracies = {model: {denoise_method: [] for denoise_method in denoise_methods.keys()}  # -1 represent not yet calculated
                               for model in classification_models}
models_denoising_f1_scores = {model: {denoise_method: [] for denoise_method in denoise_methods.keys()}  # -1 represent not yet calculated
                               for model in classification_models}
models_denoising_classification_times = {model: {denoise_method: [] for denoise_method in denoise_methods.keys()}  # -1 represent not yet calculated
                               for model in classification_models}

def build_transform(denoise_method: str) -> transforms.Compose:
    denoise_fn = denoise_methods.get(denoise_method, lambda x:x)

    return transforms.Compose([
        transforms.Lambda(denoise_fn),
        transforms.Resize((IMG_SIZE, IMG_SIZE)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

class ImageDataSet(Dataset):
    def __init__(self, image_names, transform):
        self.file_names = []
        self.labels = []
        for numeric_label, names in enumerate(image_names):
            self.labels.extend([numeric_label]*len(names))
            self.file_names.extend(names)

        self.transform = transform

    def __getitem__(self, index):
        img_name = self.file_names[index]
        img = Image.open(img_name).convert('RGB')
        img = self.transform(img)
        label = self.labels[index]
        return img, label
    
    def __len__(self):
        return len(self.file_names)

# labels = ["B_Cells", "CD4+_T_Cells", "DCIS_1", "DCIS_2", "Invasive_Tumor", "Prolif_Invasive_Tumor"]
labels = ["Immune_Cells", "Non_Invasive_Tumor", "Invasive_Tumor_Set"]
le = LabelEncoder()
numeric_labels = le.fit_transform(labels)
image_names = []
for _ in numeric_labels:
    image_names.append([])

for (dir_path, dir_names, file_names) in os.walk(FOLDER_PATH):
    parent_folder = os.path.basename(dir_path)
    if parent_folder in labels: # Read the subset of dataset to reduce training time 
        for file in file_names:
            image = cv2.imread(os.path.join(dir_path, file))
            if image.shape[0] < 100 and image.shape[1] < 100: #skip the small image, it doesn't give much info
                continue
            numeric_label = le.transform([parent_folder])[0]
            image_names[numeric_label].append(os.path.join(dir_path, file))

denoising_datasets = {key : ImageDataSet(image_names, build_transform(key)) for key in denoise_methods.keys()}


device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #check if the computer has GPU

def model_loader(path_name: str):
    model = models.resnet18(pretrained=False)
    model.fc = nn.Linear(model.fc.in_features, len(labels))
    model.load_state_dict(torch.load(path_name))
    model = model.to(device)
    return model

denoising_cnn_models = {key : model_loader(f"denoised_models/CNN_{key}.pth") for key in denoise_methods.keys()}


# hyper-parameters setting
num_epochs = 100
patience = 10 #for early stopping
batch_size = 128
learning_rate = 0.001

def split_dataset(dataset: ImageDataSet):
    _, subset = train_test_split(list(range(len(dataset))), test_size=0.5, random_state=0)
    # train_idx, temp_idx = train_test_split(list(range(len(dataset))), test_size=0.3, random_state=0)
    train_idx, temp_idx = train_test_split(subset, test_size=0.3, random_state=0)
    val_idx, test_idx = train_test_split(temp_idx, test_size=0.5, random_state=0)

    train_subset = Subset(dataset, train_idx)
    val_subset = Subset(dataset, val_idx)
    test_subset = Subset(dataset, test_idx)

    train_loader = DataLoader(train_subset, batch_size=batch_size, shuffle=False, pin_memory=True)
    val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False, pin_memory=True)
    test_loader = DataLoader(test_subset, batch_size=batch_size, shuffle=False, pin_memory=True)

    return train_loader, val_loader, test_loader

def get_test_loader(dataset: ImageDataSet):
    subset, _ = train_test_split(list(range(len(dataset))), test_size=0.5, random_state=0)
    test_loader = DataLoader(Subset(dataset, subset), batch_size=batch_size, shuffle=False, pin_memory=True)
    return test_loader


kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

# CNN
for denoise, dataset in denoising_datasets.items():
    print(f"Training CNN with {denoise} denoising")

    test_loader = get_test_loader(dataset)
    all_indices = list(range(len(test_loader.dataset)))
    all_labels = [test_loader.dataset[i][1] for i in all_indices]

    model = denoising_cnn_models[denoise]
    model.eval()

    for _, val_idx in kf.split(all_indices, all_labels):
        val_subset = Subset(test_loader.dataset, val_idx)
        val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False, pin_memory=True)
        y_true = []
        y_pred = []

        start = time.time()
        with torch.no_grad():
            for image, label in val_loader:
                image = image.to(device)
                label = label.to(device)

                outputs = model(image)
                _, predicted = torch.max(outputs, 1)
                
                y_true.extend(label.cpu().numpy())
                y_pred.extend(predicted.cpu().numpy())
        end = time.time()

        accuracy = accuracy_score(y_true, y_pred)
        f1 = f1_score(y_true, y_pred, average='weighted')
        classification_time = end - start
        
        models_denoising_accuracies["CNN"][denoise].append(accuracy)
        models_denoising_f1_scores["CNN"][denoise].append(f1)
        models_denoising_classification_times["CNN"][denoise].append(classification_time)
        print(f"Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}, Classification Time: {classification_time:.4f} seconds")

    


def feature_extractor(model, test_loader):
    model.eval()
    feature_extractor = nn.Sequential(*list(model.children())[:-1]) # remove the last layer
    feature_extractor.eval()
    feature_extractor.to(device)

    test_features = []
    test_labels = []

    with torch.no_grad():
        for images, labels in test_loader:
            images = images.to(device)
            output = feature_extractor(images).squeeze()
            test_features.append(output.cpu().numpy())
            test_labels.append(labels.cpu().numpy())

    X_test = np.vstack(test_features)
    y_test = np.hstack(test_labels)

    return X_test, y_test


for denoise, dataset in denoising_datasets.items():
    print(f"Testing {denoise} denoising")
    test_loader = get_test_loader(dataset)
    all_indices = list(range(len(test_loader.dataset)))
    all_labels = [test_loader.dataset[i][1] for i in all_indices]

    model = denoising_cnn_models[denoise]
    model.eval()

    for _, val_idx in kf.split(all_indices, all_labels):
        val_subset = Subset(test_loader.dataset, val_idx)
        val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False, pin_memory=True)
        feature_extraction_start = time.time()
        X_test, y_test = feature_extractor(model, val_loader)
        feature_extraction_end = time.time()
        feature_extraction_time = feature_extraction_end - feature_extraction_start

        with open(f"denoised_models/SVM_{denoise}.pkl", "rb") as f:
            svm = pickle.load(f)
        with open(f"denoised_models/RF_{denoise}.pkl", "rb") as f:
            rf = pickle.load(f)

        if denoise == "None":
            with open(f"denoised_models/9NN_{denoise}.pkl", "rb") as f:
                knn = pickle.load(f)
        elif denoise == "Restormer":
            with open(f"denoised_models/27NN_{denoise}.pkl", "rb") as f:
                knn = pickle.load(f)
        elif denoise == "Gaussian_Blur":
            with open(f"denoised_models/11NN_{denoise}.pkl", "rb") as f:
                knn = pickle.load(f)
        elif denoise == "Median_Blur":
            with open(f"denoised_models/27NN_{denoise}.pkl", "rb") as f:
                knn = pickle.load(f)
        elif denoise == "DnCNN":
            with open(f"denoised_models/21NN_{denoise}.pkl", "rb") as f:
                knn = pickle.load(f)

        start = time.time()
        y_pred = svm.predict(X_test)
        end = time.time()

        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        classification_time = end - start + feature_extraction_time

        models_denoising_accuracies["SVM"][denoise].append(accuracy)
        models_denoising_f1_scores["SVM"][denoise].append(f1)
        models_denoising_classification_times["SVM"][denoise].append(classification_time)
        print(f"SVM - Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}, Classification Time: {classification_time:.4f} seconds")

        start = time.time()
        y_pred = rf.predict(X_test)
        end = time.time()

        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        classification_time = end - start + feature_extraction_time

        models_denoising_accuracies["Random Forest"][denoise].append(accuracy)
        models_denoising_f1_scores["Random Forest"][denoise].append(f1)
        models_denoising_classification_times["Random Forest"][denoise].append(classification_time)
        print(f"Random Forest - Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}, Classification Time: {classification_time:.4f} seconds")

        start = time.time()
        y_pred = knn.predict(X_test)
        end = time.time()
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        classification_time = end - start + feature_extraction_time

        models_denoising_accuracies["KNN"][denoise].append(accuracy)
        models_denoising_f1_scores["KNN"][denoise].append(f1)
        models_denoising_classification_times["KNN"][denoise].append(classification_time)
        print(f"KNN - Accuracy: {accuracy:.4f}, F1 Score: {f1:.4f}, Classification Time: {classification_time:.4f} seconds")


with open("denoised_cv_results/models_denoising_accuracies.pkl", "wb") as f:
    pickle.dump(models_denoising_accuracies, f)
with open("denoised_cv_results/models_denoising_f1_scores.pkl", "wb") as f:
    pickle.dump(models_denoising_f1_scores, f)
with open("denoised_cv_results/models_denoising_classification_times.pkl", "wb") as f:
    pickle.dump(models_denoising_classification_times, f)

9.4.3 Masking Model Training

Show code
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, Subset
from torchvision import transforms, models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import LabelEncoder
import numpy as np
import cv2
from PIL import Image
import os
from matplotlib import pyplot as plt
from copy import deepcopy
import pickle
import seaborn as sns
import pandas as pd
from masking import centre_mask, non_centre_mask, random_mask
import traceback

#Turn all the randomisation off to ensure the results of every execution is the same 
seed = 0
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()




IMG_SIZE = 224
FOLDER_PATH = "Images/100"

masks = {
    "Centre": centre_mask,
    "Non-centre": non_centre_mask,
    "Random": random_mask 
}

classification_models = ["CNN", "KNN", "SVM", "Random Forest"]

models_masks_accuracies = {model: {mask: -1 for mask in masks.keys()}  # -1 represent not yet calculated
                               for model in classification_models}
models_masks_f1_scores = {model: {mask: -1 for mask in masks.keys()}  # -1 represent not yet calculated
                               for model in classification_models}
models_masks_confusion_metrics = {model: {mask: -1 for mask in masks.keys()}  # -1 represent not yet calculated
                               for model in classification_models}

def build_transform(mask: str) -> transforms.Compose:
    mask = masks.get(mask)

    return transforms.Compose([
        transforms.Lambda(mask),
        transforms.Resize((IMG_SIZE, IMG_SIZE)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])

class ImageDataSet(Dataset):
    def __init__(self, image_names, transform):
        self.file_names = []
        self.labels = []
        for numeric_label, names in enumerate(image_names):
            self.labels.extend([numeric_label]*len(names))
            self.file_names.extend(names)

        self.transform = transform

    def __getitem__(self, index):
        img_name = self.file_names[index]
        img = Image.open(img_name).convert('RGB')
        img = self.transform(img)
        label = self.labels[index]
        return img, label
    
    def __len__(self):
        return len(self.file_names)
    

labels = ["Immune_Cells", "Non_Invasive_Tumor", "Invasive_Tumor_Set"]
le = LabelEncoder()
numeric_labels = le.fit_transform(labels)
image_names = []
for _ in numeric_labels:
    image_names.append([])

for (dir_path, dir_names, file_names) in os.walk(FOLDER_PATH):
    parent_folder = os.path.basename(dir_path)
    if parent_folder in labels: # Read the subset of dataset to reduce training time 
        for file in file_names:
            image = cv2.imread(os.path.join(dir_path, file))
            if image.shape[0] < 100 and image.shape[1] < 100: #skip the small image, it doesn't give much info
                continue
            numeric_label = le.transform([parent_folder])[0]
            image_names[numeric_label].append(os.path.join(dir_path, file))



masking_datasets = {key : ImageDataSet(image_names, build_transform(key)) for key in masks.keys()}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #check if the computer has GPU

model = models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, len(image_names))
model = model.to(device)

masking_cnn_models = {key : deepcopy(model).to(device) for key in masks.keys()}

# hyper-parameters setting
num_epochs = 100
patience = 10 #for early stopping
batch_size = 128
learning_rate = 0.001

def split_dataset(dataset: ImageDataSet):
    train_idx, temp_idx = train_test_split(list(range(len(dataset))), test_size=0.3, random_state=0)
    val_idx, test_idx = train_test_split(temp_idx, test_size=0.5, random_state=0)

    train_subset = Subset(dataset, train_idx)
    val_subset = Subset(dataset, val_idx)
    test_subset = Subset(dataset, test_idx)

    train_loader = DataLoader(train_subset, batch_size=batch_size, shuffle=False, num_workers=8, pin_memory=True)
    val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False, num_workers=8, pin_memory=True)
    test_loader = DataLoader(test_subset, batch_size=batch_size, shuffle=False, num_workers=8, pin_memory=True)

    return train_loader, val_loader, test_loader

training_loss_curves = {key : [] for key in masks.keys()}
val_loss_curves = {key : [] for key in masks.keys()}

for mask, model in masking_cnn_models.items():
    print(f"Start training {mask} model.")
    best_val_loss = float('inf')
    epoch_no_improvement = 0
    best_model_parameters = None
    criterion = nn.CrossEntropyLoss()
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    train_loader, val_loader, _ = split_dataset(masking_datasets.get(mask))
    try:
        for epoch in range(num_epochs):
            # Training phase
            model.train()
            running_loss = 0.0

            for images, labels in train_loader:
                images, labels = images.to(device), labels.to(device)
                
                optimizer.zero_grad()

                outputs = model(images)
                loss = criterion(outputs, labels)

                loss.backward()
                optimizer.step()

                print(f"Batch loss: {loss}")
                
                running_loss += loss.item()
            
            training_loss = running_loss/len(train_loader)
            training_loss_curves[mask].append(training_loss)
            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {training_loss:.4f}")
            
            # Validation phase
            model.eval()
            val_loss = 0.0
            correct = 0
            total = 0
            
            with torch.no_grad():
                for images, labels in val_loader:
                    images, labels = images.to(device), labels.to(device)
                    outputs = model(images)
                    loss = criterion(outputs, labels)
                    
                    val_loss += loss.item()
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()
            
            avg_val_loss = val_loss/len(val_loader)
            val_accuracy = 100 * correct / total
            print(f"Validation Loss: {avg_val_loss:.4f}, Accuracy: {val_accuracy:.2f}%")
            val_loss_curves[mask].append(avg_val_loss)

            if avg_val_loss < best_val_loss:
                best_val_loss = avg_val_loss
                best_model_parameters = model.state_dict()
                epoch_no_improvement = 0
            else:
                epoch_no_improvement += 1
                if epoch_no_improvement == patience:
                    print(f"No improvement for {patience} epoches. Early stopping.")
                    break

        if best_model_parameters is not None:
            model.load_state_dict(best_model_parameters)
    except Exception as e:
        traceback.print_exc()


for mask, model in masking_cnn_models.items():
    torch.save(model.state_dict(), f'masking_models/CNN_{mask}.pth')

for mask, model in masking_cnn_models.items():
    _, _, test_loader = split_dataset(masking_datasets.get(mask))
    model.eval()
    y_true = []
    y_pred = []

    with torch.no_grad():
        for images, labels in test_loader:
            images, labels = images.to(device), labels.to(device)
            outputs = model(images)
            _, predicted = torch.max(outputs.data, 1)
            
            y_true.extend(labels.cpu().numpy())
            y_pred.extend(predicted.cpu().numpy())

    accuracy = accuracy_score(y_true, y_pred)
    f1 = f1_score(y_true, y_pred)
    cm = confusion_matrix(y_true, y_pred)
    
    models_masks_accuracies["CNN"][mask] = accuracy
    models_masks_f1_scores["CNN"][mask] = f1
    models_masks_confusion_metrics["CNN"][mask] = cm
    
    print(f"{mask} accuracy: {accuracy}")
    print(f"{mask} f1 score: {f1}")

def datasets_feature_extractor(model, dataset):
    model.eval()
    feature_extractor = nn.Sequential(*list(model.children())[:-1]) # remove the last layer
    feature_extractor.eval()
    feature_extractor.to(device)

    train_features = []
    train_labels = []
    test_features = []
    test_labels = []
    train_loader, _, test_loader = split_dataset(dataset)

    with torch.no_grad():
        for images, labels in train_loader:
            images = images.to(device)
            output = feature_extractor(images).squeeze()
            train_features.append(output.cpu().numpy())
            train_labels.append(labels.cpu().numpy())

    X_train = np.vstack(train_features)
    y_train = np.hstack(train_labels)

    with torch.no_grad():
        for images, labels in test_loader:
            images = images.to(device)
            output = feature_extractor(images).squeeze()
            test_features.append(output.cpu().numpy())
            test_labels.append(labels.cpu().numpy())

    X_test = np.vstack(test_features)
    y_test = np.hstack(test_labels)

    return X_train, y_train, X_test, y_test

for denoise_method, dataset in masking_datasets:
    X_train, y_train, X_test, y_test = datasets_feature_extractor(model, dataset)

    # SVM
    svm = SVC()
    svm.fit(X_train, y_train)

    y_pred = svm.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    models_masks_accuracies["SVM"][denoise_method] = accuracy
    models_masks_f1_scores["SVM"][denoise_method] = f1
    models_masks_confusion_metrics["SVM"][denoise_method] = cm

    print(f"SVM {denoise_method} accuracy: {accuracy}")
    print(f"SVM {denoise_method} f1 score: {f1}")

    #RF
    rf = RandomForestClassifier()
    rf.fit(X_train, y_train)

    y_pred = rf.predict(X_test)
    accuracy = accuracy_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    cm = confusion_matrix(y_test, y_pred)

    models_masks_accuracies["Random Forest"][denoise_method] = accuracy
    models_masks_f1_scores["Random Forest"][denoise_method] = f1
    models_masks_confusion_metrics["Random Forest"][denoise_method] = cm

    print(f"Random Forest {denoise_method} accuracy: {accuracy}")
    print(f"Random Forest {denoise_method} f1 score: {f1}")
    print(f"Random Forest {denoise_method} classification time: {elapsed_time}")

    #Find best k for KNN
    knn_models = []
    knn_accuracies = []
    knn_f1_scores = []
    knn_confusion_metrics = []
    for k in range(1, 32, 2): #k = 1 to 31
        knn = KNeighborsClassifier(n_neighbors=k)
        knn.fit(X_train, y_train)

        y_pred = knn.predict(X_test)
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        cm = confusion_matrix(y_test, y_pred)

        knn_models.append(knn)
        knn_accuracies.append(accuracy)
        knn_f1_scores.append(f1)
        knn_confusion_metrics.append(cm)

    knn_accuracies = np.array(knn_accuracies)
    max_idx = np.argmax(knn_accuracies)
    best_k = 2*max_idx+1

    print(f"Best {best_k}NN {denoise_method} accuracy: {knn_accuracies[max_idx]}")
    print(f"Best {best_k}NN {denoise_method} f1 score: {knn_f1_scores[max_idx]}")

    accuracy = float(knn_accuracies[max_idx])
    f1 = knn_f1_scores[max_idx]
    cm = knn_confusion_metrics[max_idx]

    models_masks_accuracies["KNN"][denoise_method] = accuracy
    models_masks_f1_scores["KNN"][denoise_method] = f1
    models_masks_confusion_metrics["KNN"][denoise_method] = cm

    with open(f"denoised_models/SVM_{denoise_method}.pkl", "wb") as f:
        pickle.dump(svm, f)
    with open(f"denoised_models/RF_{denoise_method}.pkl", "wb") as f:
        pickle.dump(rf, f)
    with open(f"denoised_models/{best_k}NN_{denoise_method}.pkl", "wb") as f:
        pickle.dump(knn_models[max_idx], f)

9.4.4 Masking Model Evaluation

Show code
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset, Subset
from torchvision import transforms, models
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import classification_report, accuracy_score, f1_score, confusion_matrix, ConfusionMatrixDisplay
from sklearn.preprocessing import LabelEncoder
from models_DnCNN import DnCNN_Denoiser
import numpy as np
import cv2
from PIL import Image
import os
from matplotlib import pyplot as plt
from copy import deepcopy
import pickle
import seaborn as sns
import pandas as pd
from masking import centre_mask, non_centre_mask, random_mask
import time

#Turn all the randomisation off to ensure the results of every execution is the same 
seed = 0
np.random.seed(seed)
torch.manual_seed(seed)
torch.cuda.manual_seed(seed)

torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

IMG_SIZE = 224
FOLDER_PATH = "Images/100"
dncnn = DnCNN_Denoiser()

masks = {
    "None": lambda x: x,
    "Centre": centre_mask,
    "Non-centre": non_centre_mask,
    "Random": random_mask 
}

classification_models = ["CNN", "KNN", "SVM", "Random Forest"]

models_masks_accuracies = {model: {mask: [] for mask in masks.keys()}
                               for model in classification_models}
models_masks_f1_scores = {model: {mask: [] for mask in masks.keys()}
                               for model in classification_models}
models_masks_classification_times = {model: {mask: [] for mask in masks.keys()}
                               for model in classification_models}

def build_transform(mask: str) -> transforms.Compose:
    mask = masks.get(mask)

    return transforms.Compose([
        transforms.Lambda(dncnn.denoise_image),
        transforms.Lambda(mask),
        transforms.Resize((IMG_SIZE, IMG_SIZE)),
        transforms.ToTensor(),
        transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])


class ImageDataSet(Dataset):
    def __init__(self, image_names, transform):
        self.file_names = []
        self.labels = []
        for numeric_label, names in enumerate(image_names):
            self.labels.extend([numeric_label]*len(names))
            self.file_names.extend(names)

        self.transform = transform

    def __getitem__(self, index):
        img_name = self.file_names[index]
        img = Image.open(img_name).convert('RGB')
        img = self.transform(img)
        label = self.labels[index]
        return img, label
    
    def __len__(self):
        return len(self.file_names)
    

labels = ["Immune_Cells", "Non_Invasive_Tumor", "Invasive_Tumor_Set"]
le = LabelEncoder()
numeric_labels = le.fit_transform(labels)
image_names = []
for _ in numeric_labels:
    image_names.append([])

for (dir_path, dir_names, file_names) in os.walk(FOLDER_PATH):
    parent_folder = os.path.basename(dir_path)
    if parent_folder in labels: # Read the subset of dataset to reduce training time 
        for file in file_names:
            image = cv2.imread(os.path.join(dir_path, file))
            if image.shape[0] < 100 and image.shape[1] < 100: #skip the small image, it doesn't give much info
                continue
            numeric_label = le.transform([parent_folder])[0]
            image_names[numeric_label].append(os.path.join(dir_path, file))



masking_datasets = {key : ImageDataSet(image_names, build_transform(key)) for key in masks.keys()}

device = torch.device("cuda" if torch.cuda.is_available() else "cpu") #check if the computer has GPU

model = models.resnet18(pretrained=True)
model.fc = nn.Linear(model.fc.in_features, len(image_names))
model = model.to(device)

masking_cnn_models = {key : deepcopy(model).to(device) for key in masks.keys()}

# hyper-parameters setting
num_epochs = 100
patience = 10 #for early stopping
batch_size = 128
learning_rate = 0.001

def feature_extractor(model, train_loader, test_loader):
    model.eval()
    feature_extractor = nn.Sequential(*list(model.children())[:-1]) # remove the last layer
    feature_extractor.eval()
    feature_extractor.to(device)

    train_features = []
    train_labels = []
    test_features = []
    test_labels = []

    with torch.no_grad():
        for images, labels in train_loader:
            images = images.to(device)
            output = feature_extractor(images).squeeze()
            train_features.append(output.cpu().numpy())
            train_labels.append(labels.cpu().numpy())

    X_train = np.vstack(train_features)
    y_train = np.hstack(train_labels)

    with torch.no_grad():
        for images, labels in test_loader:
            images = images.to(device)
            output = feature_extractor(images).squeeze()
            test_features.append(output.cpu().numpy())
            test_labels.append(labels.cpu().numpy())

    X_test = np.vstack(test_features)
    y_test = np.hstack(test_labels)

    return X_train, y_train, X_test, y_test

kf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)

for mask, dataset in masking_datasets.items():
    print(f"Start training {mask} model.")
    _, all_indices = train_test_split(list(range(len(dataset))), test_size=0.5, random_state=0) #only use 50% of the dataset
    all_labels = [dataset[i][1] for i in all_indices]

    for train_idx, test_idx in kf.split(all_indices, all_labels):
        train_idx, val_idx = train_test_split(train_idx, test_size=0.2, random_state=0)

        train_subset = Subset(dataset, train_idx)
        val_subset = Subset(dataset, val_idx)
        test_subset = Subset(dataset, test_idx)

        train_loader = DataLoader(train_subset, batch_size=batch_size, shuffle=False, pin_memory=True)
        val_loader = DataLoader(val_subset, batch_size=batch_size, shuffle=False, pin_memory=True)
        test_loader = DataLoader(test_subset, batch_size=batch_size, shuffle=False, pin_memory=True)

        model = models.resnet18(pretrained=True)
        model.fc = nn.Linear(model.fc.in_features, len(image_names))
        model = model.to(device)

        best_val_loss = float('inf')
        epoch_no_improvement = 0
        best_model_parameters = None
        criterion = nn.CrossEntropyLoss()
        optimizer = optim.Adam(model.parameters(), lr=learning_rate)

        for epoch in range(num_epochs):
            model.train()
            running_loss = 0.0
            for images, labels in train_loader:
                images, labels = images.to(device), labels.to(device)
                optimizer.zero_grad()
                outputs = model(images)
                loss = criterion(outputs, labels)
                loss.backward()
                optimizer.step()
                running_loss += loss.item()

            print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {running_loss/len(train_loader):.4f}")

            model.eval()
            val_loss = 0.0
            correct = 0
            total = 0

            with torch.no_grad():
                for images, labels in val_loader:
                    images, labels = images.to(device), labels.to(device)
                    outputs = model(images)
                    loss = criterion(outputs, labels)

                    val_loss += loss.item()
                    _, predicted = torch.max(outputs.data, 1)
                    total += labels.size(0)
                    correct += (predicted == labels).sum().item()

            avg_val_loss = val_loss / len(val_loader)
            val_accuracy = 100 * correct / total
            print(f"Validation Loss: {avg_val_loss:.4f}, Validation Accuracy: {val_accuracy:.2f}%")

            if avg_val_loss < best_val_loss:
                best_val_loss = avg_val_loss
                epoch_no_improvement = 0
                best_model_parameters = deepcopy(model.state_dict())
            else:
                epoch_no_improvement += 1
                if epoch_no_improvement >= patience:
                    print(f"Early stopping at epoch {epoch}")
                    break


        if best_model_parameters is not None:
            model.load_state_dict(best_model_parameters)      
        
        model.eval()
        y_true = []
        y_pred = []

        start = time.time()
        with torch.no_grad():
            for images, labels in test_loader:
                images, labels = images.to(device), labels.to(device)
                outputs = model(images)
                _, predicted = torch.max(outputs.data, 1)
                y_true.extend(labels.cpu().numpy())
                y_pred.extend(predicted.cpu().numpy())
        end = time.time()

        accuracy = accuracy_score(y_true, y_pred)
        f1 = f1_score(y_true, y_pred, average='weighted')
        elapsed_time = end - start

        models_masks_accuracies["CNN"][mask].append(accuracy)
        models_masks_f1_scores["CNN"][mask].append(f1)
        models_masks_classification_times["CNN"][mask].append(elapsed_time)

        print(f"CNN - Test Accuracy: {accuracy:.2f}%, Test F1 Score: {f1:.4f}, Classification Time: {elapsed_time:.2f} seconds")

        feature_extraction_start = time.time()
        X_train, y_train, X_test, y_test = feature_extractor(model, train_loader, test_loader)
        feature_extraction_end = time.time()
        feature_extraction_time = feature_extraction_end - feature_extraction_start

        svm = SVC(probability=True)
        svm.fit(X_train, y_train)
        start = time.time()
        y_pred = svm.predict(X_test)
        end = time.time()
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        classification_time = end - start + feature_extraction_time

        models_masks_accuracies["SVM"][mask].append(accuracy)
        models_masks_f1_scores["SVM"][mask].append(f1)
        models_masks_classification_times["SVM"][mask].append(classification_time)
        print(f"SVM - Test Accuracy: {accuracy:.2f}%, Test F1 Score: {f1:.4f}, Classification Time: {classification_time:.2f} seconds")

        rf = RandomForestClassifier()
        rf.fit(X_train, y_train)
        start = time.time()
        y_pred = rf.predict(X_test)
        end = time.time()
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred, average='weighted')
        classification_time = end - start + feature_extraction_time

        models_masks_accuracies["Random Forest"][mask].append(accuracy)
        models_masks_f1_scores["Random Forest"][mask].append(f1)
        models_masks_classification_times["Random Forest"][mask].append(classification_time)
        print(f"Random Forest - Test Accuracy: {accuracy:.2f}%, Test F1 Score: {f1:.4f}, Classification Time: {classification_time:.2f} seconds")

        knn_accuracies = []
        knn_f1_scores = []
        knn_classification_times = []
        for k in range(1, 32, 2): #k = 1 to 31
            knn = KNeighborsClassifier(n_neighbors=k)
            knn.fit(X_train, y_train)

            start = time.time()
            y_pred = knn.predict(X_test)
            end = time.time()
            accuracy = accuracy_score(y_test, y_pred)
            f1 = f1_score(y_test, y_pred)
            classification_time = end - start + feature_extraction_time

            knn_accuracies.append(accuracy)
            knn_f1_scores.append(f1)
            knn_classification_times.append(classification_time)

        knn_accuracies = np.array(knn_accuracies)
        max_idx = np.argmax(knn_accuracies)
        best_k = 2*max_idx+1

        accuracy = knn_accuracies[max_idx]
        f1 = knn_f1_scores[max_idx]
        classification_time = knn_classification_times[max_idx]

        models_masks_accuracies["KNN"][mask].append(accuracy)
        models_masks_f1_scores["KNN"][mask].append(f1)
        models_masks_classification_times["KNN"][mask].append(classification_time)
        print(f"KNN - Test Accuracy: {accuracy:.2f}%, Test F1 Score: {f1:.4f}, Classification Time: {classification_time:.2f} seconds")

with open("masking_cv_results/models_masks_accuracies.pkl", "wb") as f:
    pickle.dump(models_masks_accuracies, f)
with open("masking_cv_results/models_masks_f1_scores.pkl", "wb") as f:
    pickle.dump(models_masks_f1_scores, f)
with open("masking_cv_results/models_masks_classification_times.pkl", "wb") as f:
    pickle.dump(models_masks_classification_times, f)