Predicting The Onset Of Diabetes Mellitus Pt. 1 (Exploring The Data)

9 min readMar 20, 2021

Selection of Data-Set & Identification of Problem

The dataset being used is the case of diabetes mellitus amongst Pima Indian women. The source of this dataset is from the database by the National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK). NIDDK has continually observed the Pima Indian population since 1965 and has undergone consistent standardised testing for diabetes every two years. This testing, amongst many other things, involved an oral Glucose test. This method of testing is in line with the World Health Organisation’s (WHO) method of defining subjects with diabetes mellitus; as WHO state that a subject has diabetes mellitus if the 2-hour post plasma Glucose is at least 200mg/dl. NIDDK received this database on 9 May 1990.

The dataset in its current form was gathered by Smith, Everhart & Dickson (1988). As they were using this database as a method of proving whether the ADAP learning algorithm could forecast diabetes mellitus within this population. When selecting the most important attributes for their analysis, Smith, Everhart & Dickson (1988) selected eight indicators from this database. These eight indicators are:

· Number of times Pregnant

· Plasma Glucose concentration at 2 hours in a fasting oral glucose tolerance test (GTT)

· Diastolic Blood Pressure

· Triceps Skin Fold Thickness

· 2-hour serum Insulin

· Body Mass Index (BMI)

· Diabetes Pedigree Function

· Age (in years)

To highlight the significance of these indicators, Smith, Everhart & Dickson (1988) also created a Class Variable, labeled ‘Outcome’, which provides a 1 or 0 class; with 1 being interpreted as “tested positive for diabetes” and 0 being “tested negative for diabetes”.

The subjects chosen for their analysis were female, of Pima Indian descent and greater than the age of 21 at the time of examination. Using these criteria, 768 examinations were used to form this dataset from the original database.

The question that this project will explore is, using this dataset, can we extrapolate any meaningful insight that can be used to adequately predict the onset of diabetes mellitus?

Brief Evaluation Of The Data Structure

Before we deep dive into the data, it’s always important to first quickly look at the overall structure of the data. Looking at the top 5 rows (see Figure 1), we can see that each row represents one patient that was tested for diabetes mellitus.

Now, the very first thing that is noticeable from checking the first 5 rows is that we can see 3 cases of patients that have a 2-hour serum Insulin level of 0. This is particularly quite odd as having this 2-hour serum Insulin level is not physically possible. Upon further investigation of the source of the dataset, it looks as though the data has most likely gone through imputation; whereby missing values have been replaced with 0. This is not unusual in medical research data; however, it does mean that we must be mindful that the data has already been subjugated to changes already.

To check if indeed this imputation technique has indeed been applied, we can get a summary of the number of non-null values (see Figure 2).

Figure 2: Brief Overview Of Datatypes & Non-Null Values

As we can see in Figure 2, there don’t seem to be any null values, which therefore verifies that imputation has indeed been applied to on the dataset already. Furthermore, we can see that all the attributes, as well as the target, are numerical datatypes. This means that we don’t need to consider applying any techniques that are normally necessary in the presence of categorical datatypes later on.

To surmise this brief overview of the dataset, we now know that:

There are 8 attributes, and a clearly defined target column (“Outcome”).
All the datatypes are numerical.
There are only 768 instances in the dataset, meaning it’s very small relative to the norm in Machine Learning.
While there are not any null values; this is not because the dataset has no missing values, rather the imputation techniques have already been applied to the dataset before-hand. Specifically, missing values have been replaced with 0.

Initial Exploratory Data Analysis Pt.1

Now that we have an understanding of the general structure of the dataset, it’s time to explore a little bit deeper into each attribute. More specifically, since all the attributes are numerical, it’d be very useful to know more about the mean, standard deviation, minimum, maximum, and quartiles of each attribute (see Figure 3).

Figure 3: Descriptive Statistics of Dataset

At an initial glance, it looks like there is a lot going on here. But as you take a closer look, you’ll notice that there are some incredibly useful nuggets of information here. Let’s first take a look at the average of each attribute.

Average number of pregnancies is: 4
Average level of GTT is: 120.89 (mg/dl)
Average Blood Pressure is: 69.11 (mm Hg)
Average Tricep Fold Skin Thickness is 20.54 (mm)
Average 2-Hour Serum Insulin level is 79.80 (mU/ml)
Average BMI is 31.99 (Weight in kg / Height in m2)
Average Diabetes Pedigree Function is 0.47
Average Age is 33 (Years)

Relative to other populations such as the rest of the US (as Pima Indians reside in Arizona), there are several key points regarding the descriptive statistics amongst this sample.

First, the average number of pregnancies is almost twice as high (at 4 pregnancies) as that of the US during the time of examination (1990), with an average fertility rate of 2.1 in the US. This is important, as according to the National Health Service (NHS), pregnancies can cause diabetes as

“your body cannot produce enough insulin… to meet the extra needs in pregnancy.” [1]

And so, by having a relatively higher number of pregnancies, the risk of diabetes mellitus subsequently rises. We see further evidence of abnormally high pregnancy rates within this dataset, as looking at the maximum value we see a case where a patient has been pregnant 17 times! And 75% of cases have been pregnant less than 6 times, which is almost triple the national average at that period time.

Looking next to the GTT levels, we again see some abnormal results. First, let’s get a better understanding of GTT as a test for diabetes mellitus. According to the National Institute for Health and Care Excellence (NICE), GTT is a method that can help to diagnose instances of diabetes mellitus or insulin resistance. And as per the guidance information provided by NICE on this subject[2], the GTT results show that:

At 2hrs, people without diabetes will be under 140 mg/dL
At 2hrs, people at risk of diabetes (prediabetes) will be 140 and 199 mg/dL
At 2hrs, people at diabetic levels will be over 200 mg/dL

And so, as we can see from both the average (121 mg/dl) and 75th percentile (140 mg/dl) GTT results from the dataset, these patients are largely exhibiting nondiabetic levels. Why is this?

Well, upon further investigation into the case selection methodology by Smith, Everhart & Dickson (1988), the reason becomes apparent. Cases were drawn from a pool of examinations whereby, amongst other criteria, the examination,

“revealed a nondiabetic GTT and met one of the following two criteria: a. Diabetes was diagnosed within five years of the examination OR b. A GTT performed five or more years later failed to reveal diabetes mellitus.”

In short, the low GTT levels are due to previous case selection criteria in order to mitigate the impact of GTT in the prediction analysis later on.

Furthermore, according to a paper published by US National Heart, Lung and Blood Institute and the US National Institutes of Health (2000)[3], a BMI of over 25 for females is considered “overweight or obese”. Relative to this, the subjects BMIs’ are significantly high as the average BMI sample is 31.99, with a maximum value as high as 67! The standard deviation of 7.8 shows that the BMI values are very dispersed.

While we could dive more into the other attributes, it’s already become incredibly clear that the cases in this dataset are suffering from serious health issues. And are much unhealthier relative to the rest of the US population. This is an unfortunate indication of the socio-economic situation the Pima Indian tribe were in, as we know historically these tribes have faced many hardships.

But, in short, by observing some basic descriptive statistics we have learned a lot more about the data itself and each of the attributes. The story of the dataset is becoming much clearer, and in turn, will be incredibly useful in the later stages of this project.

Initial Exploratory Data Analysis Pt.2

Alongside descriptive statistics, another great way to better understand the story behind the data we’re investigating is to visualise it. Let’s first plot histograms (Figure 4) and boxplots (Figure 5) for each of the attributes.

From the histogram, there are several important points to denote:

1. As mentioned in previous sections, it looks like there are a number of attributes that have a high distribution of 0 values. From further investigation we know that it is because of imputation techniques, however, the severity of missing data in this already small dataset will most likely present an issue for our machine learning models later on.

2. Many of the histograms are tail-heavy. This may make it harder for some machine learning models to detect patterns. And so, there is definitely a need to perform meaningful transformations so that they are more unimodal in distribution.

3. These attributes all have different scales, and so feature engineering must be applied before performing any machine learning.

4. Looking at the Age histogram we see that most of the subjects are under 35 years old, and yet the Blood Pressure and BMI histograms show the highest distribution at surprisingly high numbers; further showcasing the poor health of these women.

And as is apparent in Figure 5 below, there are several attributes that exhibit many outliers. In which case, to ensure the efficacy of producing effective Machine Learning models to predict diabetes mellitus, it’s likely that we’ll need to clip these features later on.

The last part of this dataset that we’ll take a look at is the target variable and its relation to the features in this dataset. As is evident in figure 6, we can see that we are dealing with an imbalanced class distribution problem.

With ~65% of cases in this dataset being non-diabetic, this poses a challenge for our Machine Learning algorithms. As classification models are typically designed assuming an equal number of instances for each class. Since the dataset is already quite small, we cannot perform any techniques to try to rebalance this class distribution for the training and testing sets.

And so, due to this imbalance, there is a high likelihood that the models will return poor predictive performance. Nonetheless, we should still try to get the best possible predictive results.

Since the dataset is quite small, it’s quite easy to assess the correlation coefficient between every pair of attributes (see figure 7). As we can see from this heatmap, GTT is the most correlated with our target variable, as well as BMI and Age. This is unsurprising since we already know that GTT is a direct testing method for diabetes mellitus, and BMI is a direct indicator of obesity — which is highly correlated with diabetes metllitus.

End of Part 1.

That is the end of Part 1. of this blog, stay tuned for Part 2., where I then clean & prepare the data for Machine Learning, and then build, fine-tune and finally assess key Machine Learning models to best predict diabetes mellitus.

References

[1] https://www.nhs.uk/conditions/gestational-diabetes/

[2] https://www.nice.org.uk/guidance/ng17/ifp/chapter/testing-your-own-blood-glucose-and-target-levels

[3] U.S. National Heart, Lung and Blood Institute; U.S. National Institutes of Health (2000).The Practical Guide: Identification, Evaluation, and Treatment of Overweight and Obesity in Adults. (NIH Publication №00–4084). Available online: http://www.nhlbi.nih.gov/guidelines/obesity/prctgd_c.pdf