Experimental Results

Comparison between Models and Hyper-parameter Tuning

In this experiment, we have implemented two models one based on VGG and logistic regression and the other one with the ResNet as explained above. The validation accuracy obtained for both the models are summarized in Table II.

As shown in Table II, ResNet-50 outperforms the VGG + Logistic Regression model. Based on its performance, we chose to perform hyperparameter tuning on ResNet-50. The higher validation accuracy of ResNet-50 compared to the Logistic Regression model can be attributed to its ability to capture the complex and subtle differences in the images that even the human eye might struggle to distinguish. The deeper architecture of ResNet-50, excels at extracting intricate features from images, making it effective for our binary classification task. This capability leads to its higher accuracy of 93.36% compared to the Logistic Regression model.

With the ResNet-50 model, we initially experimented with a batch size of 32, 10 epochs, and a learning rate of 0.0001, achieving an accuracy of 89.16%. Since it is a common practice to vary the batch size to evaluate model performance, we tested with batch sizes of 64 and 128 under the same configurations. We observed higher accuracy with a batch size of 64. This improvement is likely due to the model generalizing better when trained with a larger batch size, enabling more accurate updates to gradients and other parameters, ultimately improving validation accuracy.

Upon analyzing the training and validation loss, we noticed that the model began overfitting after 5 epochs. Therefore, we decided to limit the number of epochs to 5 and retained a batch size of 64. Subsequently, we tested learning rates of 0.001, 0.0001, and 0.00001 to assess their impact on model convergence and performance. The best performance was achieved with a learning rate of 0.00001, yielding a validation accuracy of 90.31%. Finally, we saved the model weights and used this model to perform binary classification on the test set, achieving a test accuracy of 90.32%.

TABLE II: Validation Accuracy with different models

S.No.	Model Validation	Accuracy
1.	VGG + Logistic Regression	77.01%
2.	ResNet-50	93.36%

Performance Metrics

a) Accuracy: A traditional way to measure the performance of any machine learning model is by evaluating its accuracy. Therefore, during training, we calculated both the training and validation accuracy to monitor the model’s performance. The accuracy and loss plots for our final model are attached in Figure 8 and Figure 9. The test accuracy obtained on the test set is reported to be 90.32%. This test accuracy reflects the model’s ability to generalize well to unseen data, demonstrating that the training process was effective in capturing the underlying patterns of the dataset without significant overfitting.

b) Confusion Matrix and F1-score: Since this was a binary classification task, we evaluated the model’s performance using Precision, Recall, and F1 Score to gain a comprehensive understanding of model performance. The model achieved both precision and recall values of approximately 0.9, resulting in a high F1 Score. This balance between precision and recall indicates that the model performs well in correctly identifying both classes while minimizing false positives and false negatives. The detailed performance metrics, including the confusion matrix, are reported in Table IV. The F1 score of 0.9054 indicates that the model achieves a strong balance between precision and recall, effectively minimizing false positives and false negatives. This demonstrates the model’s robustness and reliability in binary classification tasks.

The F1 score of 0.9054 indicates that the model achieves a strong balance between precision and recall, effectively minimizing false positives and false negatives. This demonstrates the model’s reliability in classifying images as either sweet or savory. However, there are still room for improvement in some ambiguous samples. For this project, we experimented with two different model architectures as described earlier. The classification model that gave the best result was ResNet-50, given we chose the correct hyperparameters. As the network depth increases, the model’s ability to handle edge cases improves, but this comes at the cost of higher computational time and energy consumption. Hence, this trade-off must be considered when using deep networks like ResNet. On the other hand, logistic regression, when provided with high-quality features—such as the top 100 features, achieved around 77% accuracy. While it is faster to compute and more efficient to train, it doesn’t perform as well on more complex tasks as deep learning models like ResNet.

c) Model Interpretability: Features in deep learning models such as ResNet-50 are difficult to visualize. Therefore, to interpret our model’s classification, we used the Local Interpretable Model-Agnostic Explanations (LIME) [11] which provide insights into how the model makes predictions. LIME works by approximating the complex model with an interpretable model that highlights which features or regions of the image contribute most to the model’s classification decision. Figure 11 shows the output produced by LIME, where we can see that it outlines a particular texture of the ice cream, which the model used to conclude that it must be a sweet dish. On the other hand, the model highlights the cheese and the pizza toppings as the parts of the image that contributed most to classifying the image as belonging to the ”savory” class. Hence, LIME helps us focus on specific areas of the image, showing the parts that the model is paying attention to distinguish between the ”Sweet” and ”Savory” categories. This visualization helps us to understand which aspects of the image the model considers important when making its predictions, offering transparency into the workings of our model.

TABLE III: Model Training Details

Batch Size	Epochs	Learning Rate	Validation Accuracy	Validation Loss
32	10	0.0001	89.16%	0.3231
64	10	0.0001	90.21%	0.3636
128	10	0.0001	79.90%	0.8041
64	5	0.001	71.19%	0.5659
64	5	0.0001	90.03%	0.3715
64	5	0.00001	91.09%	0.3731

TABLE IV: ResNet-50 Model Performance Metrics

Model	Test Accuracy
Precision	0.8976
Recall	0.9134
F1 Score	0.9054
ROC AUC	0.9699
PRC AUC	0.9719

unnamed (6).png

Fig. 8: Training and Validation Loss

unnamed (7).png

Fig. 9: Training and Validation Accuracy

unnamed (8).png

Fig. 10: Confusion Matrix