Group: Image 1: Brian Wang, Eden Cai, Kyle Chan, Michael Yip, Yuecheng Wang
Published
1 Jun 2025
Show code
import pandas as pdfrom matplotlib import pyplot as pltimport plotly.express as pximport plotly.graph_objects as goimport plotly.io as pioimport pickleimport seaborn as snsimport numpy as npfrom shinyapp_func import get_heatmap, get_transformfrom PIL import Imagepio.renderers.default ="notebook_connected"
1 Executive summary
This report investigates the influence of denoising and image masking on classification model performance in medical imaging. Reliable diagnosis and treatment planning rely on effective preprocessing because positional variation and image noise may significantly reduce diagnostic accuracy. The impact of various denoising techniques of classification models is investigated in the first research question. Using traditional denoisers (Gaussian Blur, Median Filter) and deep learning-based techniques (DnCNN, Restormer), we evaluate four models: CNN, SVM, Random Forest (RF), and K-Nearest Neighbours (KNN). The second research question investigates whether an image’s central areas are more important for classification. We test this using centre, non-centre, and random masking strategies based on the assumption that diagnostic features tend to be central. This study addresses these questions and provides useful guidance on effective preprocessing techniques. In medical imaging, determining the best denoising and masking strategies can result in higher classification accuracy and more trustworthy diagnostic outcomes.
2 Background
This study categorised cell types into three primary groups. Non-tumorous cells, B Cells and CD4+ T Cells (13,440 samples) as they are key immune effectors in tumour microenvironments (Elena, L., & Lopera, D. E., 2013). Non-invasive tumours, DCIS 1 and DCIS 2 (24,606 samples), as they capture different pre-invasive breast region patterns (Allred, D. C. 2010). The Invasive tumour category included Invasive Tumours and Proliferative Invasive Tumours (38,149).
In medical imaging, noise is abundant in images produced by digital modalities like X-Ray, Ultrasound and CT. This reduces image quality, complicating the process of correctly diagnosing certain illnesses (Sai, G. V., 2023). Denoising is a crucial component in image pre-processing, helping prevent incorrect disease diagnosis (Alanazi & Mercorelli, 2024), improving model accuracy, preventing overfitting, and contributing to the robustness of predictive models (Tamilselvi, C., 2024).
Beyond denoising, techniques such as masking can optimise how machine learning (ML) models learn from specific cellular features. Cells contain distinct characteristics divergent from natural images, suggesting specific image areas may hold greater predictive power for ML models. Exploring masking strategies, such as central vs non-central regions, may provide insight into whether certain crucial diagnostic features are in those areas. In medical imaging, disease-related features are often sparse and localised, with remaining regions typically appearing normal and undifferentiated (Xie, Y. 2024), highlighting the potential for masking to focus model attention on critical, yet limited, areas.
3 Methods
3.1 Research Question 1: Denoising Strategies vs. Classification Performance Analysis
To investigate the first research question, we designed a consistent experimental pipeline that compares four classification models on H&E-stained cell images:
CNN automatically learns complex visual patterns, such as nuclear shape, texture, and aggregation, which is ideal for classifying cell types directly from pixel data.
SVM works well in separating classes using extracted features like colour and texture, which helps detect subtle differences between similar tumour types effectively.
RF works well with high-dimensional and flattened images and is robust to noise, making it ideal for assessing the impact of denoising.
KNN is a simple and interpretable model relying on feature similarity, which is a useful baseline for evaluating how denoising improves classification performance.
The CNN model was trained with early stopping, a batch size of 128, and a learning rate of 0.001, while traditional models (SVM, RF, KNN) used flattened image features and were assessed using the F1 score.
In research question 1, the performance of three denoising groups is assessed, including:
No Denoiser (Raw Images)
Raw H&E images are used directly, considered as a baseline to measure the impact of denoising.
Classical Denoisers
Gaussian Blur reduces high-frequency noise using a bell-shaped kernel, considered widely used and easy to implement.
Median Blur preserves edges better (cell boundaries) and is effective against salt-and-pepper noise, which commonly occurs in microscopy images. Also, it helps compare non-linear smoothing with Gaussian blur.
Deep learning-based Denoisers
DnCNN (Denoising Convolutional Neural Network) is a deep learning model for removing complex noise in medical images using multiple convolutional layers with batch normalisation and ReLU activation to learn a residual mapping from noisy images to clean images (Zhang, K. 2019).
Restormer applies transformer-based self-attention to denoise while keeping fine structural details in H&E images (Liu et al., 2023).
Middle-level denoising settings are used consistently: kernel size 5 for classical filters, DnCNN pretrained with sigma=25, and Restormer using the blind model. See Figure 2
To fairly assess the impact of each denoising method, a consistent preprocessing and evaluation pipeline was utilised:
All images were resized to 224*224, converted to tensors, and normalised using ImageNet statistics to ensure compatibility and stable training.
The dataset was split into 50% training and 50% testing, with the test set further divided into 10 equal partitions. Models were trained once on the training set and evaluated across all 10 test partitions, improving the computational efficiency while ensuring full test coverage.
Performance across all denoisers and models was measured using the F1 score to reflect classification quality.
3.2 Research Question 2: Spatial Sensitivity Analysis
We use masking-based perturbation techniques to evaluate whether central image regions have a greater influence on classification performance. Additionally, we apply Gradient-weighted Class Activation Mapping (Grad-CAM) to interpret spatial attention in CNNs and highlight influential regions.
3.2.1 Masking Strategies
Three masking strategies were used to examine spatial sensitivity:
Centre Mask: While the outer areas are masked, the central circle, which has a radius of 80 pixels, is still visible. This mask only uses central features to test model classification.
Non-Centre Mask: By blocking the same central area, this mask enables an evaluation of the importance of peripheral features.
Random Mask: 50% of the image is randomly blocked by a 10x10 grid (with a fixed seed). This acts as a baseline to evaluate robustness to unstructured degradation.
See Figure 4 Investigating how models react to various forms of spatial perturbation is the aim of these masking techniques. We evaluate region-specific sensitivity, whether central features are more important than peripheral ones, by comparing centre and non-centre masks. In contrast, the random mask introduces unstructured feature loss, which enables us to assess the model’s robustness to unpredictable degradation. When combined, these techniques provide insight into how various models extract and rely on spatial information by revealing the robustness and distribution of informative features throughout the spatial field.
3.2.1.1 Justification of masking strategies
We selected 50% masking based on prior research (He et al., 2021), which shows that heavier masking (e.g., 80%) removes too much information, while lighter masking (e.g., 10%) does not sufficiently challenge the model. A 50% mask achieves a balance between disruption and learning.
3.2.2 Evaluation Approach
Models are trained on the full dataset, as masking is a lightweight preprocessing step. When it comes to evaluation, due to the computing resource and time constraints, we perform cross-validation on only 50% of the dataset, as cross-validation on CNN models is computationally expensive.
3.2.2.1 Evaluation metrics
Due to class imbalance, the F1-score was employed as the main metric, offering a fair evaluation of recall and precision. Because of figure limits, accuracy was computed but not included in visualisations.
3.2.3 Model Interpretability: Heatmap & Critical Area Contour
We applied Grad-CAM to visualise CNN attention. Red-highlighted regions in the heatmap indicate high model focus. A critical area contour overlay was also extracted to highlight the most influential areas. See Figure 5
4 Results
4.1 Research Question 1 Results
The results from Research Question 1 show that deep learning-based denoisers, especially DnCNN, achieved the highest F1 scores across all models—0.925 in CNN, 0.931 in SVM, 0.932 in RF, and 0.931 in KNN—highlighting their effectiveness in enhancing feature clarity. Restormer followed closely, with strong performance across models (e.g., 0.900 in CNN, 0.880 in RF). In contrast, traditional denoisers such as Gaussian and Median Blur produced inconsistent results. While Gaussian Blur does poorly in SVM (0.633) and KNN (0.746), it performed fairly well in CNN (0.898) and Random Forest (0.837). Median Blur does poorly in SVM (0.617) and KNN (0.660) but performs well in CNN (0.903). These results show how sensitive the model is to the quality of the images. For models without internal feature extraction, such as SVM and KNN, deep learning-based denoisers maintain important structures. Traditional techniques have the potential to oversmooth, eliminating both noise and diagnostic information. DnCNN emerged as the most effective denoiser in classifying cell images across all models, with Restormer close behind. These results highlight the importance of deep learning-based denoising in improving model performance, particularly for traditional ML models. See Figure 1
4.2 Research Question 2 Results
By comparing models under three masking strategies, centre, non-centre, and random masking, to an unmasked baseline, we were able to determine whether the centre of an image has a greater impact on classification performance. The F1 scores for the four models (CNN, KNN, SVM, and RF) are shown in side-by-side boxplots in the figure below. See Figure 2
Overall, models indicated that no region is consistently more influential, with only slight performance differences between centre and non-centre masking. Although there was a small drop for CNN and SVM under centre masking, the difference was insignificant, suggesting weak spatial dependency. As expected, random masking resulted in the biggest variability and performance decline, most likely as a result of the unstructured removal of important features. For CNN, the no-masking produces the highest F1 score (~0.97), and both centre and non-centre masking produced similar, slightly worse, indicating that important features are dispersed throughout the image rather than just in the centre. There was no apparent preference between centre and non-centre regions in the patterns demonstrated by SVM and RF(~0.93 and ~0.92 respectively). Finally, KNN does not explicitly model spatial information because it only considers the Euclidean distance between two vectors, which reinforces that it relies on global rather than localised features. In summary, models without any masking consistently outperformed those with masking, demonstrating that full image access yields the best classification accuracy. The minor difference between centre and non-centre masking performance indicates that critical information is equally spread throughout the image for all classification models, despite our initial expectation that the centre would be more informative (Islam et al., 2024).
4.3 Shiny App
We developed an interactive Shiny app allowing users to upload a cell image, apply a denoising method or masking strategy, and classify it using one of four models. The app displays the processed image alongside a heatmap showing which regions influenced the most on the model’s prediction, and allows users to download the results. This tool is especially useful for those in medical imaging and biomedical AI, offering an interactive way to explore how preprocessing choices affect model behaviour and to improve transparency and trust in AI-driven diagnostics.
5 Discussion
5.1 Research Question 1 Discussion
How do different denoising strategies influence the performance on different classification models?
5.1.1 Results Summary & Key Findings
Various denoising strategies, including DnCNN and classical methods were applied across multiple classification models, measuring F1 scores to compare performances for each denoiser over different models.
According to Alanazi and Mercorelli (2024), precise denoising is vital for diagnostic reliability, supported here by the stable performance improvements seen in deep learning-based methods. This has supported our key findings that denoising can improve models’ performance. Given that no denoiser has an F1 score of 0.65 on average, all models have better performance than raw input images in medical imaging.
Our key finding in denoising is matched with the result of Zhang et al. (2017) in DnCNN’s research, that deep learning-based denoisers work better than classical denoisers, with Deep Learning-Based Denoisers (DnCNN, Restormer) average performance F1 score in 0.89-0.93, with DnCNN working the best, scoring >0.92 in average for all models, whereas classical denoiser (Gaussian Blur, Median Blur) results in around 0.79 F1 score in average.
5.1.2 Findings in different models over denoiser
Different models would have different sensitivity on the input image, for example, SVM and KNN models rely on the quality of input images, as they serve a huge difference depending on different denoisers. Yet, CNN and RF models have more stable performance in all denoisers, however, their performance could still be enhanced when inputting better denoised images, like using DnCNN denoiser could boost the performance to F1 score > 0.92. It is proven that “DnCNN consistently yielded the highest F1 scores across all model types.” Zhang et al. (2017).
5.2 Research Question 2 Discussion
5.2.1 Results Summary & Key Findings
Based on results from RQ1, where DnCNN achieved the highest average F1 score (0.93), this study investigated whether the central region of cell images is more important for classification models using DnCNN denoiser and spatial masking techniques. To assess spatial sensitivity, we used three masking strategies—centre, non-centre, and random—to test four classifiers (CNN, SVM, Random Forest, and KNN). Grad-CAM heatmaps and critical area contours can further help interpret the model’s focus. From a high-level perspective, the results showed that no single region was consistently most important across all models, which contradicted our initial hypothesis that central features would probably dominate due to biological intuition (Islam et al., 2024). The spatial distribution of informative features is suggested by the median F1-score difference between the centre and non-centre masks, which was less than 1 across all classification models. All models performed above 0.9 in most cases, further indicating that classification in tumour cell images does not strictly rely on central framing, contrary to assumptions often made in medical imaging tasks.
5.2.2 Analysis of Poor Performance with Random Masking
Random masking increased performance variability across models and consistently resulted in lower F1 scores. It eliminates features in an unpredictable method, which raises the possibility of obscuring important information and disrupting spatial coherence in contrast to the spatially localised and structured centre/non-centre masks.
5.2.3 Surprised Observations
5.2.3.1 Research Question 1
One notable discovery was that deep learning-based denoisers were most effective for KNN. KNN’s F1 score was only 0.572 without denoising, but it significantly increased to 0.931 with DnCNN and 0.833 with Restormer. KNN is more reliant on high-quality inputs than models like CNN because it uses raw pixel distances, which makes it extremely sensitive to noise.
5.2.3.2 Research Question 2
Surprisingly, KNN outperformed more complex models and obtained the highest F1 score under the unmasked condition. This highlights that under the correct circumstances, straightforward techniques can still produce powerful, reliable results, contradicting the idea that sophisticated architectures or preprocessing are always better.
5.3 Limitations
Due to computational resource constraints and limited project time, only three cell classes were evaluated: Immune cells (13,440), Invasive Tumour (24,606), and Non-invasive Tumour (38,149). This significant class imbalance could bias the model towards the majority class. While F1 scores were used to provide balanced evaluation across classes, additional techniques such as weighted sampling or class-balanced loss functions could have mitigated this issue. The denoising methodology lacked consistency across experimental setups, employing varying hyperparameters, including different kernel sizes for Gaussian (5×5) and Median (5) filters, and different model configurations for advanced denoisers (DnCNN with Sigma25 model, Restormer with Blind model). This inconsistency limits the comparability of the results, making it difficult to determine the true effectiveness of any single denoising approach. Additionally, the evaluation was limited to a small subset of available denoising models, which restricts the generalisability of the findings. Denoising was treated as a separate preprocessing step before classification, making it a two-step process. Rather than building an integrated model that can handle both denoising and classification. The masking strategy employed a simplified binary classification of image regions into “Centre vs Non-Centre” divisions, which oversimplified feature localisation in medical images. Critical diagnostic features are not always centrally located and can appear in different regions throughout the image. Using a binary approach fails to capture nuances in the distribution of critical cell features, possibly leading to suboptimal model performance. Other limitations include using data from a single source (H & E), which could limit generalisability to other datasets, and the reliance on F1 scores for evaluation, which may not capture all aspects of model performance relevant to medical applications.
5.4 Future Work
5.4.1 Estimated Calibration Error (ECE)
A direction for further research is to evaluate the models’ Estimated Calibration Error (ECE). While accuracy and F1 score measure predictive performance, they don’t assess how well a model’s confidence aligns with its actual accuracy, a critical factor in high-risk applications like tumour classification. Optimistic but inaccurate predictions can have significant clinical consequences. Calculating ECE would allow for the use of confidence calibration techniques to fine-tune the model, ensuring predicted probabilities are more reliable and consistent with actual uncertainty.
5.4.2 Different Denoising Levels and Parameter Configurations
We compared several denoisers in our current analysis, but each one was applied with a fixed noise level. A model that is overly tuned to a particular denoising setting may not generalise well in real-world situations, where noise characteristics vary based on imaging conditions. Future research could systematically change parameters like Restormer input settings, DnCNN training noise levels, or Gaussian blur sigma values and kernel sizes, then assess the subsequent effects on classification accuracy.
6 Conclusion
With the highest F1 scores across all classification models, our results clearly demonstrate that deep learning-based denoisers, particularly DnCNN, consistently outperform traditional techniques. Restormer follows DnCNN closely and also performs well. In contrast, traditional techniques like Gaussian blur and median blur, yield inconsistent results. They generally perform poorly in feature-sensitive models like SVM and KNN, but they can occasionally help CNN and Random Forest. These results demonstrate how deep learning denoisers can maintain important diagnostic features. For image masking, we found that the performance differences between centre and non-centre masking is small across models. The fact that CNN, SVM, and RF did not exhibit a strong preference for central features suggests that the image’s diagnostic information is distributed evenly. Random masking consistently led to degraded performance, highlighting the significance of structured feature preservation. In summary, we recommend prioritising deep learning-based denoising methods in medical image classification tasks. Additionally, image analysis pipelines should not overly rely on central regions, as incorporating information from the entire image improves diagnostic robustness and accuracy.
7 Student contributions
Brian Wang (SID: 520070471)
Research question 1 model training and evaluation
Research question 2 model training and evaluation
Image masking(RQ2) implementation
Denoising, masking, prediction, heatmap, and critical area contour functionality in Shiny app
Shiny app
Research question 2 presentation
Research question 2 method, results, and discussion in the report
Eden Cai (SID: 500002722)
Record meeting minutes
Research question 1: Classical denoisers
Research question 1: Cross-validation and visualisation for SVM, RF, and KNN
Presentation: RQ1 and Q&A
Report: methods RQ1, results RQ1, and discussion surprising results
Kyle Chan (SID: 460120937)
Presentation: background
Assisting and draft in Shiny app
Report research question 1 discussion
Research question 1 argumentation
Michael Yip (SID: 530485661)
Presentation Slide and Speech
Research question 1: DnCNN code
Report: Background, Limitations
Yuecheng Wang (SID: 530214245)
Presentation: Demo and Q&A
Report: Introduction and Conclusion
Research question 1: Data Augmentation
Model Analysis and Assist in Shiny app
8 Reference
Alanazi, T. M., & Mercorelli, P. (2024). Precision Denoising in Medical Imaging via Generative Adversarial Network-Aided Low-Noise Discriminator Technique. Mathematics, 12(23), 3705. https://doi.org/10.3390/math12233705
Allred, D. C. (2010). Ductal Carcinoma In Situ: Terminology, Classification, and Natural History. JNCI Monographs, 2010(41), 134–138. https://doi.org/10.1093/jncimonographs/lgq035.
Elena, L., & Lopera, D. E. (2013). Introduction to T and B lymphocytes. Nih.gov; El Rosario University Press. https://www.ncbi.nlm.nih.gov/books/NBK459471/.
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., & Girshick, R. (2021). Masked Autoencoders Are Scalable Vision Learners. https://arxiv.org/pdf/2111.06377.
Islam, O., Assaduzzaman, M., & Hasan, M. Z. (2024). An explainable AI-based blood cell classification using optimized convolutional neural network. Journal of Pathology Informatics, 100389. https://doi.org/10.1016/j.jpi.2024.100389
Jebur, M. N., Elhoseny, M., Albahli, S., & Mohammed, F. M. (2023). Deep learning-based image denoising: A comprehensive review. Multimedia Tools and Applications. https://doi.org/10.1007/s11042-023-17468-2
Liu, Y., Ma, X., Li, J., Fan, D., & Lin, Z. (2023). Hyper-Restormer: Lightweight transformer for hyperspectral image restoration. arXiv. https://arxiv.org/abs/2312.07016
Sai, G. V., Chekuri Seshank, Krishna, & Jagjit Singh Dhatterwal. (2023). Reduction of Noise in Medical Imaging Quality. 364–368. https://doi.org/10.1109/icdt57929.2023.10150846
Tamilselvi, C., Yeasin, M., Paul, R. K., & Paul, A. K. (2024). Can Denoising Enhance Prediction Accuracy of Learning Models? A Case of Wavelet Decomposition Approach. Forecasting, 6(1), 81–99. https://doi.org/10.3390/forecast6010005
Xie, Y., Gu, L., Harada, T., Zhang, J., Xia, Y., & Wu, Q. (2024). Rethinking masked image modelling for medical image representation. Medical Image Analysis, 98, 103304. https://doi.org/10.1016/j.media.2024.103304
Zhang, K., Zuo, W., Chen, Y., Meng, D., & Zhang, L. (2017). Beyond a Gaussian Denoiser: Residual Learning of Deep CNN for Image Denoising. IEEE Transactions on Image Processing, 26(7), 3142–3155. https://doi.org/10.1109/tip.2017.2662206
9 Appendix
Show code
withopen("denoised_cv_results/models_denoising_f1_scores.pkl", 'rb') as f: models_denoising_f1_scores = pickle.load(f)df_nested = pd.DataFrame(models_denoising_f1_scores).T # Models become rowsdf_nested.reset_index(inplace=True)df_nested = df_nested.melt(id_vars='index', var_name='Denoising', value_name='F1_Score')df_nested.columns = ['Model', 'Denoising', 'F1 Score']df_long = df_nested.explode('F1 Score', ignore_index=True)fig = px.box( df_long, x="Model", y="F1 Score", color="Denoising", points=False, # Hide raw data points title="F1 Score by Model and Denoising Method", hover_data=["Model", "Denoising"])# Get the color mapping used in the plotcolor_map = { trace.name: trace.marker.colorfor trace in fig.dataifisinstance(trace, go.Box)}# Calculate means per groupmean_df = df_long.groupby(['Model', 'Denoising'])['F1 Score'].mean().reset_index()# Add mean markers in matching colorfor _, row in mean_df.iterrows(): fig.add_trace(go.Scatter( x=[row['Model']], y=[row['F1 Score']], mode='markers', name=f"Mean ({row['Denoising']})", marker=dict( symbol='diamond', size=8, color=color_map.get(row['Denoising'], 'black') ), legendgroup=row['Denoising'], showlegend=False, hovertemplate=f"Model: {row['Model']}<br>Denoising: {row['Denoising']}<br>Mean: {row['F1 Score']:.4f}" ))# Final plot stylingfig.update_layout( boxmode='group', template='plotly_white', xaxis_title="Model", yaxis_title="F1 Score", hoverlabel=dict(bgcolor="white", font_size=12))fig.show()
Figure 1. F1 Score by Model and Denoising Method
Show code
withopen("masking_cv_results/models_masks_f1_scores.pkl", 'rb') as f: models_masks_f1_scores = pickle.load(f)df_nested = pd.DataFrame(models_masks_f1_scores).T # Models become rowsdf_nested.reset_index(inplace=True)df_nested = df_nested.melt(id_vars='index', var_name='Mask', value_name='F1_Score')df_nested.columns = ['Model', 'Mask', 'F1 Score']df_long = df_nested.explode('F1 Score', ignore_index=True)fig = px.box( df_long, x="Model", y="F1 Score", color="Mask", points=False, # Hide raw data points title="F1 Score by Model and Mask Method", hover_data=["Model", "Mask"])# Get the color mapping used in the plotcolor_map = { trace.name: trace.marker.colorfor trace in fig.dataifisinstance(trace, go.Box)}# Calculate means per groupmean_df = df_long.groupby(['Model', 'Mask'])['F1 Score'].mean().reset_index()# Add mean markers in matching colorfor _, row in mean_df.iterrows(): fig.add_trace(go.Scatter( x=[row['Model']], y=[row['F1 Score']], mode='markers', name=f"Mean ({row['Mask']})", marker=dict( symbol='diamond', size=8, color=color_map.get(row['Mask'], 'black') ), legendgroup=row['Mask'], showlegend=False, hovertemplate=f"Model: {row['Model']}<br>Mask: {row['Mask']}<br>Mean: {row['F1 Score']:.4f}" ))# Final plot stylingfig.update_layout( boxmode='group', template='plotly_white', xaxis_title="Model", yaxis_title="F1 Score", hoverlabel=dict(bgcolor="white", font_size=12))fig.show()