Application of Stacked Ensemble Techniques on Ensemble Feature Selection Techniques for Classifying Recurrent Head and Neck Squamous Cell Carcinoma Prognosis

This study aimed to identify the optimal combination of the stacked ensemble (SE) and the heterogeneous ensemble feature selection (HETR-EFS) technique for classifying HNSCC recurrence patterns. Four SE classification models were developed based on various EFS techniques, using GBM meta-classifiers in each case. The results showed that implementing the SE technique consisting of five base classifiers on the heterogeneous ensemble feature (HETR-EF) subset achieved better performance than other EF subsets and HETR-EFs. Thus, learning SE technique having five base classifiers on HETR-EFs is clinically appropriate as a prognostic model for classifying and predicting HNSCC patients' recurrence data. The SE technique, which combines base classifier models, is clinically appropriate for classifying and predicting HNSCC patients' recurrence data. The study highlights the importance of finding a machine learning algorithm that performs best given varied distributions, as not all algorithms are equally created.
 
 
 
 
 
 
 


INTRODUCTION
For successful and complete destruction of malignant cells in the body, the treatment of recurrent head and neck squamous cell carcinoma (HNSCC) requires correct prognosis linked with it in order to define the kind and extent of the therapy.In the interim, many prognostic models based on clinical and histopathologic parameters for recurrent HNSCC have been researched and developed, not from a medical perspective but from a scientific point of view in different fields using statistical methods, Artificial Intelligence (AI), and ML techniques; addressing the issue of the patient's disease recurrence [5,9].In medicine, a patient's disease is determined by its signs and symptoms (called a diagnosis) and the prognosis is the study of how the disease will affect the patient.Cancer has been classified as a heterogeneous disease with various subgroups.The goal of applying ML approaches has been to construct a model for the progression and management of cancer subtypes.Various machine learning (ML) techniques, such as but not limited to Artificial Neural Networks (ANNs), Random Forest (RF), GBM, NB, and GLM, have been applied in a wide range of cancer research to build predictive models from complex datasets.These models are known to offer effective and high accuracy in decision making, highlighting their importance.Even though the majority of these ML approaches produce some useful results, they fall short of the high accuracy requirements for predicting the complex cancer environment, and they also exhibit insufficient generalization ability when predicting the labels of upcoming unobserved data or scenarios.Finding a classifier model that successfully predicts and classifies the labels of future, ambiguous data is the aim of classification.Therefore, a classification model should not overfit the training data; rather, it should be general enough to encompass previously undiscovered situations.
However, ensemble learning, which turns a number of base classifier models into a strong one by combining them, may produce a strong classifier with good generalization ability.It is extremely difficult to get a single classification model for both feature selection and model training with this ability.But in contrast to ensembles, bagging and boosting, stacking, or stacked generalization is the most successful ensemble learning technique, according to], if ensemble learning can improve the classification model's generalization capacity.This method of heterogeneous ensemble learning uses a meta-learning algorithm to aggregate the strongest set of numerous base learners into a strong learner.In fact, hold the opinion that this method (stacking ensemble learning) has been observed to yield more accurate findings in numerous techniques and studies for which it has been used.The integration of stacked ensemble technique with heterogeneous ensemble feature selection technique is the only option to fully overcome these drawbacks and build a classification model with strong generalization ability.
One of the key issues in machine learning is feature selection (FS).Analysis of a classifier's ability to predict a label and the features it makes use of can help researchers learn more about numerous applications, such as bioinformatics or neurology.Additionally, efficient feature selection creates classifiers that are parsimonious, need less memory, and train and test more quickly.It can also lower feature extraction costs and improve generalization ability.When looking for linear dependencies between features and labels, linear feature selection algorithms like LARS are quite effective.When characteristics interact nonlinearly, they fall short though.Nonlinear interactions can be handled by nonlinear homogeneous ensemble feature selection techniques like random forest [or recently developed kernel methods [30,24].But when the size of the training set increases, its computational and memory cost often increases super-linearly.This becomes more of an issue as datasets get bigger.Scalability and nonlinear feature selection must be balanced; however, this is still an unsolved issue.A feature selection method should, in theory, be able to extract pertinent features with high reliability, recognize non-linear feature interactions, scale linearly with the number of features and dimensions, and take into account known sparsity structures.This study investigated how the stacked ensemble techniques of varying base classifiers can be employed in the prognosis of HNSCC recurrence based on ensemble (heterogeneous or homogenous) features, provided by some selected EFS techniques by building a prognostic model able to classify the recurrent for HNSCC prognosis.The evaluation of these stacked ensemble models on various ensemble feature subsets of varying difficulty and size, and the demonstration that HETR-EFS tends to match or outperform the accuracy and feature selection trade-off of random forest FS, the current state-of-the-art in nonlinear feature selection-gradient boosted feature selection (GBM-EFS).The study showedcase the ability of HETR-EFS to naturally incorporate sideinformation about inter-feature dependencies on a real-world biological classification task.

LITERATURE REVIEW
Several studies in the domain of cancer cases have been studied using stacking ensemble learning techniques.Some of which are: proposed a onelayer stacked ensemble-based model (with 10-fold CV) having two single base classifiers; NB and SVM, for the classification of recurrent breast cancer prognosis where DT was used as a meta-classifier to stack base classifiers.developed a one-layer stacking ensemble technique having three single base classifiers; KNN, NB, and DT (C4.5), and GLM as a meta-learner, able to predict the types of cancer around the HNC regions (Sinonasal, nasopharyngeal, laryngeal, and thyroid) applied the same technique (KNN, NB, and DT (C4.5) as base learners, and GLM as a meta-learner) in the diagnosis of HNC susceptibility to facilitate prompt referral.
To generate stacked ensemble model proposed a stacking ensemblebased algorithm, a technique that found the optimal weighted average of diverse base learners for classification of various healthcare datasets (Wisconsin Breast Cancer, Pima Indian Diabetes Dataset, and Indian Liver Patient Dataset using GBM, DRF, and DNN as base learners, and GLM as a meta learner.Their techniques consisted of stacking or super learner having two base classifiers (GBM and DRF) and that having three base classifiers (GBM, DRF, and DNN) and concluded that a super learner having three base learners outperformed that having two base learners.Thus, their recommendation was that, future study should investigate by including diverse base learners and meta-learners in stacking ensemble for various healthcare datasets.Based on this recommendation, [19] proposed a stacking ensemble-based algorithm, a technique that found the best meta-learner in a stacking ensemble for classifying breast cancer, using GBM, DRF, DNN, and GLM as base learners in a stacking ensemble, each of which was re-learned as a meta-learner to determine the best meta-learner in the stacking ensemble having four base learners.Their study showed that using specific models as a meta-learner resulted in better performance than single classifiers, and using GBM or GLM as a meta-learner is appropriate as a supporting tool for classifying breast cancer data.Thus, the overall purpose of the present study was to develop a stacking ensemble classification model as a supporting tool that combines weak/base ensemble classifiers and single base classifiers needed for robust prognosis for early diagnosis and treatment outcomes based on the optimal feature subset of clinical, histopathologic (pathologic) and genomic markers, including other risk factors and treatment types associated with HNSCC recurrence in Ghana for accurate prognosis.There has not been any study yet on recurrent HNSCC prognosis using the same technique or an adapted stacking ensemble technique in Ghana.Base on the ML algorithms considered by [19] as the most effective algorithms to providing the most effective ensemble classification model for HNSCC prognosis, all have been employed under this study with the inclusion of NB to experiment a stacked ensemble consisting of five (5), at least one more than that of the state-of-the-art stacked ensemble model consisting of a maximum of four (4) base classifiers in HNC prognosis.Thus, NB was chosen from among the most effective single base classifiers (DT, KNN, NB, and SVM) considered by the previous studies, based on its performance on the experimental data to experiment more than a maximum of five base classifiers in a stacking ensemble.
Several studies in the domain of cancer cases have been studied using stacking ensemble learning techniques.Some of which are: [2] proposed a onelayer stacked ensemble-based model (with 10-fold CV) having two single base classifiers; NB and SVM, for the classification of recurrent breast cancer prognosis where DT was used as a meta-classifier to stack base classifiers.[4] developed a one-layer stacking ensemble technique having three single base classifiers; KNN, NB, and DT (C4.5), and GLM as a meta-learner, able to predict the types of cancer around the HNC regions (Sinonasal, nasopharyngeal, laryngeal, and thyroid).[3] applied the same technique (KNN, NB, and DT (C4.5) as base learners, and GLM as a meta-learner) in the diagnosis of HNC susceptibility to facilitate prompt referral.
To generate stacked ensemble model, [18] proposed a stacking ensemblebased algorithm, a technique that found the optimal weighted average of diverse base learners for classification of various healthcare datasets (Wisconsin Breast Cancer, Pima Indian Diabetes Dataset, and Indian Liver Patient Dataset using GBM, DRF, and DNN as base learners, and GLM as a meta learner.Their techniques consisted of stacking or super learner having two base classifiers (GBM and DRF) and that having three base classifiers (GBM, DRF, and DNN) and concluded that a super learner having three base learners outperformed that having two base learners.Thus, their recommendation was that, future study should investigate by including diverse base learners and meta-learners in stacking ensemble for various healthcare datasets.Based on this recommendation, [19] proposed a stacking ensemble-based algorithm, a technique that found the best meta-learner in a stacking ensemble for classifying breast cancer, using GBM, DRF, DNN, and GLM as base learners in a stacking ensemble, each of which was re-learned as a meta-learner to determine the best meta-learner in the stacking ensemble having four base learners.Their study showed that using specific models as a meta-learner resulted in better performance than single classifiers, and using GBM or GLM as a meta-learner is appropriate as a supporting tool for classifying breast cancer data.Thus, the overall purpose of the present study was to develop a stacking ensemble classification model as a supporting tool that combines weak/base ensemble classifiers and single base classifiers needed for robust prognosis for early diagnosis and treatment outcomes based on the optimal feature subset of clinical, histopathologic (pathologic) and genomic markers, including other risk factors and treatment types associated with HNSCC recurrence in Ghana for accurate prognosis.There has not been any study yet on recurrent HNSCC prognosis using the same technique or an adapted stacking ensemble technique in Ghana.Base on the ML algorithms considered by [19] as the most effective algorithms to providing the most effective ensemble classification model for HNSCC prognosis, all have been employed under this study with the inclusion of NB to experiment a stacked ensemble consisting of five ( 5), at least one more than that of the state-of-the-art stacked ensemble model consisting of a maximum of four (4) base classifiers in HNC prognosis.Thus, NB was chosen from among the most effective single base classifiers (DT, KNN, NB, and SVM) considered by the previous studies, based on its performance on the experimental data to experiment more than a maximum of five base classifiers in a stacking ensemble.

Bagging
Bagging, sometimes referred to as Bootstrap Aggregation, is an ensemble machine learning technique that combines weaker base learners into a stronger learner.Bootstrapping is the process of creating replacement datasets at random and using these varied random subsets of the data to train various classifiers.Bagging is the term used to describe the process of using this technique to integrate different classifiers (decision trees).As a result, bagging simply refers to building each classifier or tree using a unique random subset of the dataset that is drawn via replacement.To create the final forecast, the predictions from each independent classifier might be averaged (regression) or decided upon by a majority (classification).Random Forest (RF) is an algorithm or method that is frequently utilized.A model that overfits the training data will have its complexity reduced using the ensemble-based technique random forest [8,23].Random forest is explored and employed in the study's feature selection and classification model learning processes.

Boosting
In order to reduce training error, boosting is an ensemble learning method for homogeneous learning that combines a homogenous group of weak learners into a strong learner.In boosting, a randomly chosen sample of data is chosen, fitted with the learner, and then consecutively learned.In other words, each student seeks to make up for the shortcomings of its elder.One strong prediction rule is created by combining the weak rules from each learner during each cycle.The three widely used approaches of adaptive, gradient, and extreme gradient boosting (AdaBoost, GradientBoost, and XGBoost) are the main emphasis of the strategies for boosting [29].GradientBoost, or GBM, is discussed and used for feature selection and learning a classification model for the study's purposes.

Stacking
Stacking is a technique that combines heterogeneous of multiple weak/base learners into a more robust learner than individual base learners.This technique combines the predictions of different individual base learners to make a final robust prediction.Where weak or base learning algorithms are rightfully blended, a meta-learner with lower variance and bias can be developed.Stacking uses cross-validation to estimate the performance of multiple base learning algorithms.The output from the base learners, called "level-one" data in the stacking literature, serves as input to the meta-learning algorithm(s).Stacking learns a high-level classifier on top of the base classifiers.It can be viewed as a form of meta learning where the base classifiers, also known as first-level classifiers, are combined to train a second-level classifier, also known as a meta-classifier.Based on the literature review and the study's objectives, the base classifiers GBM, DRF, DNN, NB, and GLM were employed for feature selection and classification learning.Step 1: Learn first-level classifiers 2: for t to T do 3: Learn a base classifier based on D 4: end for 5: Step 2: Construct new data sets from D 6: for i to n do 7: Construct a new data set that contains { y }, where { }.

8:
end for 9: Step 3: Learn a second-level classifier 10: Learn a new classifier based on the newly constructed data set 11: return ( )

Proposed Approach: Multi-Level Stacked Ensemble Learning
For the purpose of this study, a novel approach for classifying and predicting recurrent HNSCC prognosis was proposed in HNCs environment.This approach extended the existing stacking techniques of as discussed in the literature by improving in the areas of:  Strong and more diverse base classifiers against small number of classifiers as used in previous studies. More diverse meta-classifiers against small number of meta-classifiers classifiers as used in previous studies. Combining meta-classifier models with heterogeneous ensemble feature selectors against the previous existing approach where no such combination was not learned.
The detail description of the technique is explained as follows: Using GBM, DRF, DNN, NB, and GLM techniques for feature selection, by ordering the features according to their importance, in order to provide optimal feature subset, consider a labeled dataset D { } < on patients (instances) with feature vectors .Consider also an ensemble { } which consists of t base feature selectors (FS), where t is the number of feature selectors.Each feature selector provides a feature subset { ; }, where n-1 is the number of selected features by the feature selection method.To implement the aggregation in EFS technique, the sum of the subsets generated by t FS algorithms is estimated according to equation ( ) or ( ).For each feature j in the subset SUM, compute an index (weight) of importance according to equation ( ), and obtain the weighted feature subset SUMW according equation ( ), where m is number of features in summation.The feature j importance is determined by the ratio of the number of times it is present in the feature subset SUM to t FS algorithms.Sort the m features according to their importance.Finally, based on the threshold , select the features (features that exceed a threshold) from the feature ordered according to their importance to obtain the optimal feature subset ; , so that, the new dataset becomes { } < .
Where is the weight of the feature j, and is the number of times the feature j is present in subset SUM.
This paper presented four different techniques of stacked ensemble learning on two different techniques of ensemble feature selection: one being the heterogeneous EFS (the combination of GBM, DRF, DNN, GLM, and NB) and the other one being homogeneous EFS.The first stacked ensemble learning used two base classifiers, namely gradient boosting machine (GBM) and distributed random forest (DRF); the second one used three base classifiers, namely GBM, DRF, and deep neural network (DNN); the third one used four base classifiers, namely GBM, DRF, DNN, and generalized linear model (GLM); and the fourth one used five base classifiers, namely GBM, DRF, DNN, GLM, and Naïve bayes (NB); and in each case, a meta-classifier called GBM was used.Various cancer data subsets related to HNSCC provided by various feature selection techniques used in this study were used, and compare the performance of stacked ensemble models on these various data subsets.The evaluation results confirmed that stacked ensemble techniques built on Gradient Boosted feature subset (GBM-FS) has the ability to perform better compared to stacked ensemble techniques built on feature subsets provided by other feature selection techniques.Similarly, the evaluation results confirmed that stacked ensemble techniques consisting of five base classifiers has the ability to perform better compared to other stacked ensemble techniques considered on five feature subsets of HNSCC dataset.To achieve better performance using these base classifiers from H2o, GBM, DRF, DNN, GLM, and NB were selected.For the meta-classifier, GBM model was used as it was the best performing base classifier among the base classifiers considered in this study as shown in Figure 1.To obtain data subsets for learning stacked ensemble techniques, each base classifier was used to perform feature selection, each of which ranked the features according to their importance; and using 80% threshold, feature subsets were obtained as shown in Table 4 and Table 5 respectively.The Algorithm 4 shows the learning of stacked ensemble models with 10-fold cross-validation.

Dataset
To evaluate the performance of the stacked ensemble classification models, a retrospective cohort study of 125 HNSCC patients out of the population of 185 patients aged ≥15 years, previously diagnosed of HNSCC subtypes including laryngeal cancer, hypopharyngeal cancer, nasopharyngeal cancer, and oropharyngeal cancer and treated with curative intent at KBTH, where cancer reached remission but overtime, had either recurrence or nonrecurrence between 2016 and 2020 were sampled.For each patient, information on his/her Gender, Age at diagnosis, Alcohol drinking habit, Smoking habit, Quid chewing habit, Primary site of tumor, Tumor stage at diagnosis, Histological grade, Tumor size, Depth of invasion front, Cervical lymph/Neck nodes, Pathological tumor staging, Pathological lymph nodes, Family history of cancer, Human papillomavirus level, p16 type, p63 type, and type of treatment are taken into consideration.The dataset has a total of 125 instances, 18 attributes (features), and a class label with binary outcome coded 1 (as recurrence) or 0 (as nonrecurrence).There are 33 and 92 female and male records respectively.The summary of this dataset is shown Table 1.In medical research, it takes time to collect sufficient samples as most patients are usually lost to follow-up to check whether or not they had a recurrence and so, the sample size is usually small.HNSCC is considered recurrence if the patient was treated with curative intent and after the cancer reaches its remission, they redeveloped HNSCC termed as recurrence.Patients that received palliative treatment intent and still had cancer are not considered cancer recurrent patients.Unfortunately, most patients received palliative intent treatment and only a few could receive curative intent due to financial difficulties, causing small number of instances.The number of features in the dataset is considered too many (18 attributes) if compared to the sample size (125 instances).Thus, the feature selection method is needed to reduce the number of features and select only those that are significant to the classification model.The original dataset was subjected to five feature selection techniques, namely GBM, DRF, DNN, GLM, and NB, each provided feature subset of the data as shown Table 4. Training data (75%) and test data (25%) were constructed for each data subset.A machine learning library for R programming language was used.Data augmentation was generally used to improve a model's performance.This is a technique that comprises a set of methods used to artificially increase the number of data samples present in the dataset.This was done as deep learning models generalize well when the number of data samples available to train on is large.In this way, state-of-the-art models can be created with fewer data samples available.The data augmentation technique is usually applied to computer vision applications where domain-specific data, such as medical data, is not abundantly available.Thus, the usage of data augmentation technique.Data Pre-Processing A normalised predictive mode approach was used to identify and fill the missing instances.This approach of imputation is suitable for categorical (nominal) data; therefore, this technique is feasible in this study as the size of training examples is very small without needing to discard or delete the case having missing training instances under any feature.For data discretisation and transformation, one hot-encoding was used for features with more than two levels in order to have a normalised dataset for training, evaluation and prediction.So that, the initial 18 number of features then became 35 number of features in the dataset to be considered for learning.Table 1 presents a description of HNSCC dataset and Table 6 presents feature subsets that were ready for ingestion into the process of model training and evaluation.

Performance Metrics
To measure the performance of a classification model on the ensemble feature subsets of recurrent HNSCC prognosis dataset, the most commonly used performance measures of accuracy in cancer prognosis were utilised.These performance metrics are; accuracy, logloss, recall, specificity, and Area Under Receiver Operating Characteristic Curve (AUROC).

DISCUSSION
This study compared the performance of stacked ensemble techniques implemented on various feature subsets of the HNSCC dataset provided by various ensemble feature selection techniques used in this study.The stacked ensemble techniques were trained on the training set, and were evaluated on the test set for each data subset.Table 7 shows the performance of the proposed stacked ensemble technique having two base classifiers (GBM and DRF) on the test set for different feature subsets of the data; Table 8 shows the performance of the proposed stacked ensemble technique having three base classifiers (GBM, DRF, and DNN) on the test set for all the data subsets used in this study; Table 9 shows the performance of the proposed stacked ensemble technique having four base classifiers (GBM, DRF, DNN, and GLM) on the test set for all the data subsets used in this study.while Table 10 shows the performance of the proposed stacked ensemble technique having five base classifiers (GBM, DRF, DNN, GLM, and NB) on the test set for all the data subsets used in this study.Table 8 shows the top 20 most important features (out of 35) by FS methods.The base feature selectors were learned in an ensemble on the overall dataset, each of which ranked features according to their importance to the class label.Out of 35 features, and by default, each base feature selector ranked the top most 20 features considered important and ignored the rest 15 in this case.To obtain feature subset for heterogeneous ensemble feature selection (HETR-EFS), the top 20 features from all base selectors were aggregated in an ensemble.Features whose ranking range between 80% and 100% were potentially considered important and so, can to be included in the feature subsets summation for optimal feature subset.Then, to determine the optimal feature subset, based on the feature importance in the feature subsets summation, the frequency of each feature was divided by the number of base feature selectors being five (5).Therefore, the feature was considered important or significant in the feature subsets summation if its ratio was at least 0.80 (80%).Table 6 shows the features considered important by each base selector as well as in their ensemble.Based on these feature subsets, the classification models were learned to measure the robustness and effectiveness of each ensemble feature selection technique as well as each stacked-ensemble technique using accuracy, logloss, recall, specificity, and AUC as evaluation metrics.
Table 9 HPV,TreatCCRT,Nodes,GradeG3,Drink,Smoke,PlNN2,Age,TreatRT,Invasion,p16,p63,StageIV,Nodes,TreatCCRT,TreatRT,Invasion,p63 Table 10 shows the optimal features obtained by various EFS techniques; and these features are: smoking habit (Smoke), cervical lymph/neck nodes (Nodes), treatment with concurrent chemoradiotherapy (TreatCCRT), treatment with radiotherapy (TreatRT), depth of invasion font (Invasion), and p63 type as the most accurate prognosis for HNSCC recurrence based on the available HNSCC dataset.In addition, Tables 15, 16, and 17 show the performance comparison of various stacked ensemble techniques implemented on each ensemble feature subset of the data used in this study.For data subsets provided by each ensemble feature selection technique, the best results are obtained using stacked ensemble learning.Table 15 shows the performance comparison of various stacked ensemble techniques implemented on the test set of HETR-EFS data subset.It can be observed that stacked ensemble technique having five base classifiers performed better than other techniques implemented on the same HETR-EFS subset of the data used in this study.For this data subset, the best accuracy (93.55%), log loss (0.2038), recall (90.91%), specificity (94.37%), and AUC (0.9671) are obtained using stacked ensemble technique having five base classifiers followed by stacked ensemble technique having four base classifiers with accuracy (90.32%), logloss (0.2993), and recall (82.61%).The best specificity (96.83%) is obtained for stacked ensemble technique having three base classifiers.
In Table 16, the best accuracy (90.63%), log loss (0.2959), specificity (100%), and AUC (0.9251) are obtained using stacked ensemble technique having five base classifiers followed by stacked ensemble technique having four base classifiers with accuracy (88.17%) and log loss (0.3042) for GBM-EFS feature subset of the data.The best recall (92.06) is obtained for stacked ensemble technique having three base classifiers.For DRF-EFS subset data, the best accuracy (88.17%), recall (92.65) with the highest log loss (0.3041) are obtained using the stacked ensemble technique consisting of five base classifiers followed by the stacked ensemble technique having four base classifiers; accuracy (84.38%) with the log loss (0.4141), and the specificity (95.00%).The best AUC (86.23%) is obtained for stacked ensemble technique having three base classifiers.The graphs of the information in Tables 11, 12 and 13 are represented in Figure 2 Even though the individual ensemble feature selectors GBM and DRF, performed well under various stacked ensemble models, the performance

Figure 1 :
Figure 1: Architecture of Stacked Ensemble Model

Figure 2 :
Figure 2: Performance plots of Stacked ensemble techniques on various ensemble feature subsets

Table 4 .
Stacking with K-fold (K=10) Cross Validation E n s e m bl e fe at u re s el e

Table 8 .
Top 20 Most Important Features (out of 35) by Ensemble Feature Selection . Feature Subset Selected

Table 11 .
Performance of Stacked Ensemble Model (Model-GBM2) consisting of Two Base Classifiers (GBM and DRF) on Test Data

Table 15 .
Performance comparison of stacked ensemble models on HETR-EFS test set