What Machine Learning Model Would You Use To Predict Medical Diagnoses

Introduction

In that location are many concerns about increasing reliance on technology. Notwithstanding, as a society, we keep to push technology to new heights. From how nosotros order food to how nosotros provide healthcare, machine learning, and artificial intelligence continue to help us surpass our wildest dreams. These are some advantages:

1. Improve the diagnosis procedure

This is very important, specially if early detection and handling tin bring the best results. Using such an algorithm tin literally relieve lives.

Some universities have plant through the cosmos of databases that their artificial intelligence can do as well every bit doctors in the diagnosis process just as well do ameliorate in early detection. Artificial Intelligence can also give diagnosis suggestions based on the structured data entered by symptoms, requite medication suggestions based on the diagnosis code, and predict adverse drug reactions based on other medications taken.

two. Prevention and rapid treatment of infection

Organizations such as Wellness Catalyst are using artificial intelligence to reduce hospital-acquired infections (HAI). If we can find these unsafe infections early, we tin reduce the bloodshed and morbidity associated with them.

3. Crowdsourcing research

With the aid of machine learning, the company is working to understand medical bug through crowdsourcing amend. Having a larger and more various database can help the enquiry exist more accurate. In addition, these research methods go more accessible to people from marginalized communities who might otherwise not be able to participate. Finally, participating in research helps patients feel more than capable while providing meaningful feedback.

iv. Medications

Machine learning has been used to amend medication from anesthesia to breast cancer treatment to daily medication. The famous IBM supercomputer Watson is working with companies such equally Pfizer to enhance drug discovery, especially for immune diseases and cancer. Google has been involved for several years and found that car learning has impressive potential in guiding and improving handling ideas. Personalized medicine is also the fundamental to treating patients in the time to come, and machine learning is helping to personalize patients' feedback on their medications.

Objective

This article aims to contribute to the development of technologies related to Machine Learning applied to medicine by edifice a project where a neural network model can requite a diagnosis from a chest ten-ray image of a patient and explicate its performance every bit but equally possible. So, as nosotros are going to deal mainly with lung-related issues, it'due south appropriate to know more about what they really are, in addition to their causes, effects, and history.

Respiratory illnesses are the ones that affect the respiratory organisation, which is responsible for the production of oxygen to feed the whole body. These illnesses are produced by infections, tobacco, smoke inhalation, and exposure to substances such as redon, asbestos… To this group of illnesses belongs illnesses such as asthma, pneumonia, tuberculosis, and Covid-xix.

i. Respiratory illnesses

The first big epidemic belonging to this group of illnesses was the i produced by tuberculosis that affected the lungs. This epidemic was caused past the unacceptable labor conditions of the Industrial Revolution. This health problem was known lots of centuries before only was in that moment when it was first considered a huge health trouble that provoked plenty of deaths and remarkable losses. Respiratory illnesses started existence treated at the starting time of the Nineteen century with the invention of the stethoscope by the french doctor René Théophile Hyacinthe. Since that moment, the measures against this kind of affliction have been divided into prevention (vaccines) and medical assistance to ill people.

2. Covid-nineteen

Covid-xix, which has caused a great impact on our world in the last 2 years, is a branch of the respiratory disease caused by the virus SARS-CoV-2 that affects the respiratory tract. It's transmitted by the air and the drops emitted when talking, sneezing, or cough. It appeared for the start time in Dec 2019 in Wuhan, China, and it rapidly expanded amid the whole earth until on eleven March of 2020, information technology was considered a pandemic by the WHO.

Previous attempts

Several decades ago, Artificial Intelligence (AI) became a image that was the footing of a lot of computing projects to be practical to very different fields of our life. Ane of them was Health , where the AI influence is growing every 24-hour interval. Even more, nowadays nobody knows the limits in this area. Due to the nowadays pandemic worldwide situation, AI has also been applied to Covid-nineteen disease handling. In this work, an X-ray thorax image classification arrangement is proposed using Machine Learning. In particular, a Deep Learning prototype was implemented to carry out the corresponding prototype recognition. More than precisely, it is made upwards of several Convolutional Artificial Neurons layers, every bit well as a set up of dense neuron layers (Multilayer Perceptron).

The classification accuracy obtained was greater than 95% using images never input into our system. In addition, a recent paradigm interpretation belonging to Artificial Vision techniques has been proved, in particular, Grad-CAM, that tries to render the nearly influential image areas used by a Convolutional Neural Network in a nomenclature problem. As for now, it is not verified if the areas obtained by Grad-CAM are similar to the lung specialist physicians apply to consider for the pneumonia diagnosis.

Although several studies accept proved that the efficiency of using neural networks is over 95% , some people still call up that the human eye is more efficient than AI. Due to the rapid increase of Covid-19 in less than three months, nosotros oasis't got a stock of annotated images. Nevertheless, some researchers have developed a mechanism that can transfer information from generic object recognition to the specific, which is called "DeTraC". Many researchers consider this method to be easy and quick to utilise. The actual efficiency of this method is 95,12% with a sensitivity of 97.91%, a specificity of 91.87%, and a precision of 93.36%.

There are other models similar "Alibaba" which works over computed tomography images. This Chinese AI algorithm achieved 96% effectiveness in the detection of Covid-19 pneumonia.

"Covid net", is a gratuitous-access model which lets researchers improve the AI tool that detects SARS-CoV-2. The target of this AI network is to promote the development of highly accurate and practical deep learning solutions to discover COVID-nineteen cases and advance the treatment of those most in need. In decision, due to the lack of laboratory analysis kits, in improver to delayed results and limited resource, AI would be a valuable tool for wellness systems and patients.

How does it work?

To solve the trouble of classifying a patient'southward breast 10-ray image providing an accurate diagnostic, as nosotros discussed before, we will use an Artificial Intelligence technique called Convolutional Neural Networks. This kind of network consists of an algorithm that takes an prototype as input and detects a set of visual features for which it has been trained. The goal hither in this particular project is to classify the input epitome in one of the 5 possible diagnostics that our model tin can perform. So the network will employ the recognized image characteristics to feed a classical Fully-Connected Neural Network and get a probability prediction for the five different classes to which the prototype may vest equally output. The class with the highest activation (probability) is the 1 the model selects as the correct diagnostic for the ten-ray image.

Nevertheless, Convolutional Neural Networks can be helpful too in similar calculator vision tasks like object and face recognition, prototype sectionalization, cocky-driving cars, video game automation, weather forecasting, natural linguistic communication processing ( NLP ), and even climatic change fighting like this initiative that uses machine learning to allocate waste using modern Motorcar Learning algorithms.

To simplify and visualize in a better way what the model does, let'due south consider this example:

Prototype extracted from https://world wide web.siemens-healthineers.com/

For this image, the model would output the following predictions:

[ 0.001 99.925  0.005  0.067  0.001] ['Bacterial Pneumonia', 'COVID', 'Normal', 'Tuberculosis', 'Viral Pneumonia']

The previous ready of values means that the candy epitome has a 0.001% probability of belonging to the 'Bacterial Pneumonia' class, a 99.925% probability of belonging to the 'Covid' class, 0.005% probability for 'Normal' form, 0.067% probability for 'Tuberculosis' grade and 0.001% probability for 'Viral Pneumonia' class.

As yous tin can see, the form with the highest probability (activation) is the second. Therefore, it indicates that the most correct diagnostic possible is ' Covid.' Now, you might be thinking, is the highest value of the predictions set always chosen for the last Classification? Well, it depends on the type of problem you are facing. A threshold number may determine the value from which a class has to be active or non. In this way, dealing with activations is indeed a pretty practiced approximation of an artificial process for our brain works.

Convolutional networks in deep

First of all, before edifice our Convolutional Neural Network and the Fully Continued Network of the model that nosotros are going to use to solve the problem, we demand to collect a large amount of data in the grade of labeled images. With 'labeled,' it ways that experts in radiology and medicine have already classified the pictures that we become from public datasets.

These are some of the best places to go large amounts of information for artificial intelligence and machine learning projects:

In this case, nosotros built a dataset with a full of 23472 image files extracted from the following sites:

– COVID, Pneumonia and Normal

– Tuberculosis

Preprocessing data

The part of data preprocessing is crucial on motorcar learning, where missing, unlabeled, mislabeled, or inconsistent sized data can ruin the training of the model that will acquire features from that data. So information technology's vital to apply the corresponded technique to ensure the information is ready to exist fed into the training process.

The typical way information preprocessing is practical is by using Python libraries, like Keras in this example. For example, if an image file is cleaved or empty, it should be removed from the dataset to preclude errors. In another case, if a file has a different size than the rest, it should be resized to the correct dimensions. Thereby, in that location is a data preprocessing technique called' data augmentation,' which tin can sometimes help meliorate the performance of our model. By rotating, rescaling, moving, and applying a set of transformations to the input images, it can increase the capability of the model to generalize the features it learns.

In the case of diagnosing chest ten-ray images, we don't need to 'broaden' our dataset due to the minor variety of positions that 10-ray pictures will have. Nevertheless, we need to need still to preprocess our dataset to make sure the images are ready to exist used in the training process.

Hither you have an instance of how the data is preprocessed in this project:

Just, first, nosotros import the libraries that we are going to need to develop the project.

import tensorflow as tf import numpy as np import matplotlib.pyplot as plt import itertools import math import os

2. The dataset is loaded on 'GitHub' and so we have to download it by cloning the repository containing information technology.

!git clone https://github.com/cardstdani/covid-classification-ml.git

3. After cloning the repository, the dataset will be located in the path indicated in the 'DIR' variable. The following code splits the master dataset into a training and a test gear up with a rate of xc/ten, respectively.

This data splitting aims to train the model with the training set and test the accuracy and generalization capability with the test set, and then the model will never 'meet' any data from the test gear up. Random images are selected from all the dataset classes in the splitting process. Also, the 'seed' parameter tin can alter the randomness of the method used to feed the random number generator involved in selecting images that are going to exist in each gear up.

As you tin come across below, the 'image_dataset_from_directory' role from the Keras API has a parameter called 'image_size,' which normalizes the size of all images size within the dataset directory. If some sample has an inconsistent size, it uses the 'smart_resize' option to resize information technology to the 'image_size' parameter.

DIR = "/content/covid-classification-ml/Covid19_Dataset"

train_dataset = tf.keras.preprocessing.image_dataset_from_directory(DIR, validation_split=0.1, subset="training", seed=42, batch_size=32, smart_resize=True,          image_size=(256, 256))

test_dataset = tf.keras.preprocessing.image_dataset_from_directory(DIR, validation_split=0.1, subset="validation", seed=42, batch_size=32, smart_resize=True,          image_size=(256, 256))

4. The number of classes is determined automatically by the number of subfolders inside the main dataset binder, equally you tin run across in this image.

Side by side, all the dataset's classes are contained within the 'class_names' holding of the 'train_dataset' object. Finally, the rest of the code improves the operation of the training process. Yous can find more data about how 'tf.data.AUTOTUNE' is optimizing information technology at the following link.

classes = train_dataset.class_names numClasses = len(train_dataset.class_names) print(classes)

AUTOTUNE = tf.data.AUTOTUNE train_dataset = train_dataset.prefetch(buffer_size=AUTOTUNE) test_dataset = test_dataset.prefetch(buffer_size=AUTOTUNE)

Artificial Neural Networks (ANN)

As we are working with 'labeled' data, it's essential to know that we face up a supervised learning problem. This learning process refers to a set of problems in which the input data is tagged with the result that the algorithm should come up upwardly with on its ain. That'southward the significance of 'labeled' data. Usually, these issues are mainly Classification and regression. Still, there are other learning methods in Artificial Intelligence, like unsupervised learning. This last procedure consists of a deep learning algorithm that leans patterns on an unlabeled dataset. That ways it doesn't take articulate instructions on what it should give as output. Hence, the model automatically finds correlations in the information by extracting useful features and analyzing its construction, all without labeled data. To sympathise more than in-depth about the different learning methods used in machine learning, you can read the following resources:

Simply before explaining its structure and functioning, permit'south first know a little more than nigh Artificial Networks (ANN). In full general, artificial networks are computational models inspired by a biological encephalon that constitutes the cadre of deep learning algorithms. Their main components are neurons , named due to their similarity with the nearly bones unit that composes a concrete neural network. Furthermore, these neurons are disposed of in a series of layers in that each neuron is continued with all the neurons in the next layer. This connexion process is sometimes called synapsis . So, with that initial structure, the information can propagate through the input to the network'southward output with the aim to make the whole network learn.

On the above prototype, you have a representation of a neuron, which is the node that takes n input values x0, x1, x2, xn and multiplies each value with a specific weight number assigned for each input w0, w1, w2, wn . Afterward performing the weighted sum , it adds the results with a bias value and feeds an activation function, whose output will finally be the neuron'due south output.

Here you have some of the most used activation functions in Machine Learning. Usually, the two almost common are ReLU (Rectified Linear Unit) and Softmax, to calculate probabilities. In general, its chief application is to help the whole network to acquire more than complex patterns about the information that is input into the network. Although, sometimes, this concept is referred to as nonlinearity. You tin can larn more than almost it in the following resource:

To visualize what this process is doing, let's encounter ane of the most famous cases for an Artificial Neural Network: linear regression.

In the above image, yous tin see a two-dimensional dataset represented as blueish points on a chart. Also, you can observe a red line that is the model's approximation to the data trend. Every bit the formula for a line is y=mx+n, simply a neuron with one input value ten ( his respective parameter m) and a bias b ( intercept ) could be plenty to build the model for this particular case due to the linearity of the dataset.

But with a random value selected model, we can't fit the model's trend to the dataset's trend. So we need to train the model to go far larn the blueprint that models the information. And the almost basic and widely used manner to practise this is to modify the neuron'due south parameters until it gives the best upshot at fitting the data.

The indicator used to evaluate the model's operation is a loss part that tin vary depending on the problem nosotros face up. Still, a Mean Squared Error or Mean Absolute Error could aid united states solve the problem. If you don't fully sympathize these terms, enquire yourself the following question: How can I calculate how well the model performed in the dataset?

The most intuitive solution would exist computing the average of all the differences betwixt the predictions of the model and the bodily values. That's the Mean Absolute Error. Just if you square all the differences while you add the to become the average, you will become the Mean Squared Error. So at that place are many more ways to calculate the loss that has to be minimized to get an authentic model. If yous want to know more, please accept a look at this resource:

So with a loss office already selected to calculate the model's functioning, we need a method to change the model's parameters to make it acquire while minimizing the loss function. This method is denominated as the optimizer of the model. To intuitively understand how these optimization algorithms work, allow's expect at the following animation:

As y'all tin see on the left side, we have our model trying to fit the provided dataset. And on the right, there is a graph with the main parameter of the model on the 10-centrality , which in this instance refers to the slope of the line and the respective loss office value for the model's output on the y-centrality. And so to optimize (fit) the model parameters and minimize the loss simultaneously, we apply an algorithm called Slope Descent based on an elementary set of steps.

To explicate it with an analogy, let's imagine you are in a pool where the water temperature is different at specific zones. Suppose that your goal is to reach the h2o's location with maximum temperature in that pool, just you only have a thermometer. A first approach to reach that point would be an algorithm made with the following steps:

First, start at a random point and evaluate in which direction the temperature volition be the most.
Become forrard i step in the direction yous found suitable.
Echo this procedure until you lot think you have reached your goal.

Similarly, the Gradient Descent algorithm can be defined with the following steps:

Starting time, start at a random signal and calculate the slope (derivative) of the part that relates the parameters with the loss of the model at that point.
Tweak the parameters in a sure amount ( learning charge per unit ) to follow the contrary direction of the gradient ( slope of derivative ). So if the learning charge per unit is also high, it volition have difficulties reaching the minimum value. In dissimilarity, information technology will accept a lot of unnecessary time to cease the procedure if information technology's too modest.
Echo this process until the process finds the parameters that minimize the loss.

As you can imagine, optimizing a neural network it'due south not so simple at all. Furthermore, in more complex models, local minimums might limit the potential of the compages used to solve the trouble. So more advanced techniques and parameters like momentum or Nesterov momentum are added to ameliorate the results of these optimization algorithms. To acquire more than well-nigh optimizers in Machine Learning , visit the following resources:

Afterward apprehending the possibilities of an Artificial Neural Network made with one simple neuron, information technology'south of import to visualize how it can be scaled to solve large and complex bug. The beneath prototype shows a list of the primary network architectures used nowadays in a broad range of real-life applications.

To clarify the concept of Neural Networks without overextending this commodity, you can learn more about this topic in more detail with the post-obit video series:

Detecting features in images using Convolutions

In this department, we will focus on the problem of making a machine able to 'see' things like we humans do in images. Kickoff, let's look at what a feature is and why it's essential in computer vision with the following case.

With the clearest paradigm, imagine that nosotros want to detect white pixels on a ii×2 grayscale moving picture like this:

In this example, white pixels ane are our feature, which nosotros desire to detect in the paradigm, and 0 refers to blackness pixels. And then, therefore, a commencement arroyo would be comparing all input values with the pixel value of the feature nosotros are searching for and storing the results in a ii×two matrix full of zeros and ones representing how much our feature is present in each pixel. Then when a pixel value of the image is black 0, a zero is stored in the effect matrix, also known every bit a feature map. That approach is called Convolution in calculator vision. But to sympathise information technology ameliorate, let'due south learn how it works on a large scale with a bigger image and a more complex feature.

In this case, you tin can meet how the feature we want to detect looks on the right side. And then while performing the convolution operation (slicing the filter through all possible positions of the prototype) in the shown 4×four image, nosotros will employ a filter equal to the feature to know how much a zone of the paradigm is similar to the filter. This process is the central to comprehending how a motorcar can emulate the homo sight sense.

In the first example, we were using a filter of i×1 size because we wanted to go pixel by pixel detecting a white one, only that isn't enough when we accept pixel dependencies and complex features in images like edges, crosses, shapes, faces, wheels, and any other holding that humans can recognize instantly on a pic. So that's why in the 2d example, a two×2 filter is used, and in larger or more circuitous images, larger filters with more specific number values in them are applied.

To visually sympathise what's going on inside a convolution operation, let'due south take a look at this GIF:

As y'all tin see, the filter colored in yellow is sliced beyond all along the image pixels colored in light-green, starting from the left upper corner and finishing in the right lesser corner of the input image. At the aforementioned time, all the coefficients inside information technology are multiplied by the values of the pixels that it covers. The multiplication operation here results in the all-time mode to compute the 'presence' of the feature divers by the filter in the prepare of prototype pixels it compares to each step. Once these calculations are washed, the resulting numbers are added and divided by the number of pixels the filter has, as if it were an average. Finally, the resulting value is stored in the corresponding position in the feature map, colored in low-cal cherry.

Here you have a more complex example where the filter, too called 'kernel' and colored in bluish, is used to apply a convolution transformation to the input image represented as the matrix 'I.'

When processing an image in RGB color fashion, which is usual, three filters are applied in the convolution procedure, one for each color channel representing the picture. Strictly speaking, it'due south not three separated filters. In that location is only i filter, simply information technology has a depth dimension of value three which can reach every color aqueduct in the convolution operation.

At this stage, you may notice that the size of the feature map provided by the Convolution as output is smaller than the original paradigm size. This phenomenon is due to the nature of this algorithm, specifically to the stride belongings of the kernel, which determines how many pixels it must translate horizontally or vertically in each step. For case, in the following animation, you take a convolution with footstep value 2. Notice that the kernel moves ii pixels each step, causing the feature map to be much smaller than before.

The solution to the 'problem' of the output size relapses on adding 'padding' to the input image. The padding adds one or several outer frames, commonly of zero values, to the matrix representing the picture. So the kernel will have more infinite to cover the entire image and produce a feature map with the exact dimensions as the provided input.

To summarize, Convolution is an functioning that we can utilise to images to notice specific patterns and features inside them. The most significant advantage of using Convolutions in this kind of task is that it can recognize an object even when its advent varies in some style, making it invariant to translations, rotations, changes in light, and size. As yous tin can encounter below, we have a ready of kernels that can detect a different feature which would be helpful when building a convolutional neural network that detects man faces.

In the left ready, we tin can meet that the filters seem also unproblematic to find faces. Indeed they tin only detect straight lines, edges, and peradventure basic shapes. But, conversely, in the second and third sets of filters, we can observe a meaning increase in their complexity, which now tin detect eyes, noses, mouths, ears, hair, whole faces, so on. That'southward because they might exist a combination of the initial prepare. Then with a mixture of kernels that detect basic shapes, we tin achieve better results at detecting more elaborated paradigm characteristics.

Afterwards exposing the functioning of the convolution procedure, at present you may better empathize how a automobile tin extract patterns and features from images to solve a classification or detection problem. If y'all desire to know more about the previously mentioned proceeding, you can employ the following resource to reply all the possible doubts near it:

Convolutional Neural Networks Architecture (CNN)

Now, allow'south notice and implement a Convolutional Neural Network that volition solve our original problem of classifying chest ten-ray images in Python using Tensorflow and Keras libraries.

As you can meet in the above analogy of the Convolutional Network architecture, it splits into Feature Learning for pattern extraction and Nomenclature for translating the activations of the patterns to the activations of the result classes . That's so considering the main objective is to reduce the images into a shape that is easier to process, maintaining all the necessary data well-nigh its pixel dependencies. Simply sometimes, when yous attempt to build a vast network with a very complex feature learning part, it takes too much time to railroad train and a lot of computer resources that might not exist available. And so additionally, nosotros will be using a technique chosen Transfer Learning, in which we replace the convolutional part of our model with an already trained feature extraction model.

Characteristic Learning

When performing the characteristic learning section of the network, nosotros are converting our original problem of classifying images to a more straightforward problem of classifying a set of activations where each of them refers to a detected feature in the input image. Furthermore, to adequately reduce the dimensionality of the inputs, we will stack a prepare of Convolutional layers that apply the Convolution operation over the input image using multiple filters and run an activation function (normally ReLU) at the output feature map to introduce nonlinearity. After each convolution layer is added to the network, a Pooling layer must exist stacked on top of it. The Pooling layers are the primal to downsampling the output of the Convolutional layers as much equally possible without losing essential information nearly the detected patterns. Its cardinal operating is the same as Convolution, except that Pooling doesn't perform a weighted sum of the elements covered by the kernel in each step. In dissimilarity, information technology directly selects the maximum, the minimum, or the boilerplate of these elements. These variations of the Pooling operation are called Max , Min , and Average Pooling , respectively, but in Convolutional Neural Networks, Max Pooling is the most used due to its performance.

In the above image, you lot accept a representation of a max-pooling operation performed on a iv×4 matrix with a two×ii kernel using a horizontal and vertical step of two. As yous can detect, the role is downsampling the original image without breaking its pixel/feature dependency. In this case, you tin can see it equally the input and output colors.

Nomenclature

Once the input prototype features have been detected, the network must transform the shape of the characteristic maps of the Convolution role and feed a Fully-Continued Network to perform Classification from feature activations. To do that, the most common mode to proceed is using a Flatten layer, which takes all the feature maps and returns a one-dimensional listing with all their elements stacked, as y'all can meet in the below image.

Although this is non the unique way to chain the Convolution and Classification function of the network, it can sometimes cause overfitting problems or merely not the appropriate way to continue. Such methods similar GlobalAveragePooling or GlobalMaxPooling unremarkably solve the overfitting issues caused by the Flatten layer. Its working is very like to the Pooling that nosotros saw before. It performs a Pooling operation over all the feature maps, but this time using a kernel with the verbal dimensions as the characteristic map.

Afterwards reshaping the feature maps, a classification network made of stacked Dumbo layers in which all the nodes are connected with the nodes of the next layer receives the reshaped feature maps and, using a Softmax activation function, returns the ready of activations for each form that we are expecting.

On the above image, you have a representation of the classification function of the network, but if yous want to know more in-depth how this entire procedure works, you lot can take a look at the following lectures:

Implementation Details

The beginning thing that we accept to do to build the model is to load the pre-trained Convolution part from the Keras API as the base model , without its terminal layers, from which we will stack the rest of our layers. There are different pre-trained models that we tin can choose for this kind of feature extraction job, simply the one that gives the all-time functioning and takes the least space to work is MobileNetV3Small. Yous tin see a list of available models at the following link. These are trained with imagenet , one of the earth's largest datasets composed of millions of images, and focused on improving the performance of Convolutional Networks.

baseModel = tf.keras.applications.MobileNetV3Large(input_shape=(256, 256,three), weights='imagenet', include_top=False, classes=numClasses)

Afterward having a base convolutional model, nosotros have to set up the Classification office, which is crucial to take advantage of the power of the base of operations model architecture. In this case, we are using a GlobalMaxPooling2D layer to reduce the characteristic maps dimensionality and fed it into the Dumbo network, made by a hidden layer of 256 neurons using the ReLU activation function, and an output layer with the same number of neurons as classes in our problem (5). This office of the network also has a Batch Normalization layer that increases the result accuracy of the model and several techniques to reduce overfitting similar Dropout and L2 regularizers.

last_output = baseModel.layers[-i].output 10 = tf.keras.layers.Dropout(0.5) (last_output) x = tf.keras.layers.GlobalMaxPooling2D() (last_output) x = tf.keras.layers.Dense(256, activation = 'relu', kernel_regularizer=tf.keras.regularizers.l2(0.02), activity_regularizer=tf.keras.regularizers.l2(0.02),  kernel_initializer='he_normal')(ten) x = tf.keras.layers.BatchNormalization() (10) ten = tf.keras.layers.Dropout(0.45) (x) 10 = tf.keras.layers.Dense(numClasses, activation='softmax')(x)

model = tf.keras.Model(inputs=baseModel.input, outputs=10)

Once the model architecture is built, it needs to be trained and tested to fit the input dataset. So now, we will define the loss function used to evaluate the performance during preparation and the optimizer algorithm that will tune the model parameters to minimize the loss part. In this case, we use the Stochastic Gradient Descent with an initial learning rate of 0.ane every bit optimizer and Sparse Categorical Crossentropy, one of the virtually usual losses in multiclass classification tasks similar this. In addition to the loss, nosotros also add the 'accuracy' metric, which, equally its name indicates, shows the model's performance in the form of a percentage. However, we tin can use the loss function as the metric itself, but the main reason for non doing that is the employ in which each one is used. For example, the main goal of the loss function is to be minimized during grooming to optimize the model. At the same time, the metric is an indicator of how well the model is performing, not just in grooming but in testing and inference also.

model.compile(optimizer=tf.keras.optimizers.SGD(learning_rate=0.1), loss=tf.keras.losses.SparseCategoricalCrossentropy(), metrics=['accuracy'])

Finally, we define the number of epochs that the model volition be training. Then in each epoch, the preparation set is divided into a series of batches with every bit many images as the batch size (commonly 32) and fed into the network to fit its parameters. Thus, the model volition train during a single epoch in equally many steps equally batches the training set can be divided. The objective of that information splitting is to control how stable the learning algorithm trains the network, as you can see in the following image:

And last only non least, we define a callback of the blazon 'LearningRateScheduler' to decrease the learning rate stepwise during grooming so that it volition be divided by ten every six epochs.

epochs = 40 stepDecay = tf.keras.callbacks.LearningRateScheduler(lambda epoch: 0.1 * 0.i**math.floor(epoch / six))

history = model.fit(train_dataset, validation_data=test_dataset, epochs=epochs, callbacks=[stepDecay])

Now nosotros have a 'history' variable created with the fit() role of the model, which contains all the information generated during training, such as loss and accurateness over fourth dimension, steps per epoch, etc. That data is beneficial when visualizing how the model evolved from a random initialization to a state where it minimizes the loss function and gives a decent accurateness for the job to solve.

Results

On the in a higher place image, you accept a graph that shows the loss value of both training and test datasets on the y axis and the current epoch on ten . In the beginning, as the model parameters are randomly assigned, the overall loss takes a high value and quickly decreases during the 2–iii subsequent epochs. Then, later on information technology keeps dropping and stabilizing, it finally ends with a fixed value of 0.fifteen approximately. However, information technology'southward a skilful signal that both training and test losses end up stabilizing with very similar values. For example, if the training loss had been smaller than the test loss at the end of the training, the model would accept overfitted the grooming data. Thus its generalization capability and performance on unseen data would have been junior. To empathize more than in-depth how to interpret this kind of graphs:

After completing the training process, the model ends with an overall accuracy of 95.iii% , which is a skilful value for a classification neural network. Still, it's non enough to be fully implemented on the sanitary system equally i more tool considering approximately 5 out of 100 persons would exist misdiagnosed. Then, to visualize what accuracy means, let's build a Confusion Matrix and test the last model by making some predictions:

In this defoliation matrix, the objective is to visualize the relation betwixt the true labels and the predicted labels of the tested data via plotting information technology in a matrix graph. So in the y axis, we represent the true labels equally rows, and in the x-axis, we identify the same labels as earlier, but this time the columns represent the predicted labels that the model gave as output. The resulting graph for a model that perfectly fits the data should look like an identity matrix, where all the predicted labels of the ten-axis match the respective actual labels of the y centrality. In this case, we can find a stiff tendency to fit the identity matrix of a perfect model. Although, nosotros tin see some mispredicted values, especially on the ii types of pneumonia, every bit the accurateness doesn't reach 100%. That confusion is causing the most problems when plumbing fixtures the data by reducing the overall accurateness.

In add-on to classifying an prototype into a class, we can also visualize the activation map that the input epitome generates on the network neurons. This map represents a significant activation over the features that the network learned to extract with a more yellow color. In dissimilarity, the nighttime blue colour indicates a deficient or cipher activation. In elementary terms, the activation map contains the zones of the epitome where the model thinks the features will be located.

Conclusion

Once having reached a 95% accurateness, we can conclude that the model does a high number of correct guesses, but there is also much piece of work to be done in order to ameliorate the results of this kind of Machine Learning algorithm on such circuitous tasks. However, articles like this, which attempt to explain the power of mod technology applied to the resolution of large-scale issues that couldn't have been solved before, are useful to provide a better full general comprehension of artificial intelligence. As well, knowing more nigh how the engineering science we use works improves our power to use it correctly and contributes to critical and creative thinking evolution.

Resources

Link to the Colab Notebook with the full implementation: https://github.com/cardstdani/covid-classification-ml/blob/main/ModelNotebook.ipynb

With the collaboration of: Alejandro Pascual, Javier Nieto, Quetzal Gómez, Alberto Ruiz.