Analyzing Behavior Analysis Sessions for Autistic Patients using Machine Learning Techniques

Souvik Mazumdar
6 min readJun 20, 2021

Introduction

Autism spectrum disorder (ASD) is a broad term used to describe a group of neurodevelopmental disorders.

These disorders are characterized by problems with communication and social interaction. People with ASD often demonstrate restricted, repetitive, and stereotyped interests or patterns of behavior.

ASD is found in individuals around the world, regardless of race, culture, or economic background. According to the Centers for Disease Control and Prevention (CDC)Trusted Source, autism does occur more often in boys than in girls, with a 4 to 1 male-to-female ratio.

This blog will analyzes anonymized therapy data from a repository of Behavior Analysis sessions and cases to draw insights which can be used to understand the effectiveness of the programs.

Dataset:

The data has approximately 1,000 anonymized clients over approximately 550,000 targeted therapy sessions. The dataset includes three behavioral domains; namely Adaptive, Communication, and Language.

The original data had approximately 2.5 million observations of target skill trials within the now unique therapy sessions. Each targeted trial observation was collected in trial-by-trial methodology. The binary successes were summed, and a ratio of percent correct was calculated for each session as (total successes/total trials) per session.

The outcome variable is the percent correct ratio or a derivative. A goal of 80% is featured in the data to represent a positive outcome, as that threshold is commonly used in studies. There are also lags in the ratio that are sometimes used for goal thresholds. The independent variables featured in the dataset were targeted to help discover insights. Added features include aggregated Trial Group counts by other variables across the data set. The data also includes time periods to explore insights related to patient (Client), therapist (Author) and care.

Data Visualization:

Lets start visualization of dataset by finding correlation of every feature and outcome using visualizing techniques.

Some interesting comparisons for our dataset.

Gender Goal Met Comparison

We cannot differentiate much within the Genders on achieving the goal ratio.

Age based comparison

The maximum candidates are the kids in this dataset and mostly in the age group 6–15 years.

We can see a huge number of outliers in the dataset.

Box Plot for Sessions Count vs Goal Met

We can analyze that with more sessions success rate is high. So more sessions should be organized per day/month/year. Highest Concentration of successes is in the initials sessions. So if regular sessions are kept in the initial days, the success rate will be higher.

Using multiple iterations of correlation calculations we arrived at the features to be used for the final predictions. We removed all the the features which had more than 90% correlation coefficient.

Correlation Coefficient Plot

Data Pre-Processing:

For normalization first identify and remove any noise or incomplete data points.

We have lot of missing values. We will just normalize the missing values using mean or median for the final set of columns which we got from the previous step.

We need to now encode the columns which are in String format. There are multiple ways to do it. In this case we used Label Encoding to encode our string columns.

After we have all columns in similar format we need to standardize it.

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(input_ds_corr)
data_x = pd.DataFrame(X, columns = input_ds_corr.columns)

Now we have done the cleaning of our data. Lets divide it into Training 70% and Testing 30% dataset using Python Library’s Sklearn train split technique.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data_x, Y, test_size = 1/3, random_state = 45)

Classification Models:

The data are processed in a standardized way using a Python script that prepares the data for the machine learning classifiers.

Models to be analysed for best prediction

We can now train our model . We will be using 5 different classification algorithms. Since the model is readily available in sklearn, the training process is quite easy and we can do it in few lines of code.

K-Nearest Neighbor

from sklearn.neighbors import KNeighborsClassifier

test_scores = []
train_scores = []

for i in range(1,15):

knn = KNeighborsClassifier(i)
knn.fit(X_train,y_train)
train_scores.append(knn.score(X_train,y_train))
test_scores.append(knn.score(X_test,y_test))

Max test score 97.45698237208022 % and k = [1]
Test Score Vs Train Score

Logistic Regression

from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred = logreg.predict(X_test)
print(‘Accuracy of logistic regression classifier on test set: {:.2f}’.format(logreg.score(X_test, y_test)))

Accuracy of logistic regression classifier on test set: 0.73

Random Forrest Classifier

from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(n_estimators=300, bootstrap = True, max_features = ‘sqrt’)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print(‘Accuracy of Random Forest on test set: {:.2f}’.format(model.score(X_test, y_test)))

Out of all the algorithms. Random Forest performed the best as we can see in the below table.

Validation

We use Confusion Matrix for analyzing the performance of our model.

What is Confusion Matrix and why you need it?

Well, it is a performance measurement for machine learning classification problem where output can be two or more classes. It is a table with 4 different combinations of predicted and actual values as shown in the below diagram.

We can see that Confusion Matrix gives a high accuracy.

An ROC curve (receiver operating characteristic curve) is a graph showing the performance of a classification model at all classification thresholds.

ROC curve of our model

ROC-AUC curve gave a 97.6% accuracy for our prediction as shown above. This clearly states that our observations will be accurate.

Conclusion:

We get the highest accuracy for Random Forest, with the score reaching 99%. This implies, our model predicted classified correctly 99% of the times. The Precision score stood at 0.99, implying our model correctly classified observations with high importance 99% of the times. The Recall stood at 0.99. We also have an F1 score of 0.99. The F1 score is the harmonic mean of precision and recall. It assigns equal weight to both the metrics. However, for our analysis it is relatively more important for the model to have low false negative cases.

Now to understand which factors are important for the treatment of the Autistic children, we need a feature importance plot.

All the analysis that we did led us to the conclusion that the change of Authors or the tutors for the children impact their success rate. Our study suggest children to have lesser tutors and more sessions for better success.

This dataset was mostly on sessions and tutors. Autism being a vast field and changing with every subject we need more study using Machine Learning tools for further analysis and treatment.

--

--

Souvik Mazumdar
0 Followers

Passionate Data Scientist and a full stack developer by profession. I want to help the community with my blogs..