Personality in Four Letters or Less

An analysis of MBTI data performed by UC Berkeley Students and Data Science Society members: Angeline Lee, Jessica Yu, Juno Lee, and Shani Lyubomirsky. Project mentored by Claudea Jennefer.

5 min readNov 24, 2020

Ross, McKenzie. “A Guide To MBTI: How To Use It As An Educator.” *HonorsGradU*, 10 Oct. 2019, www.honorsgradu.com/a-guide-to-mbti-how-to-use-it-as-an-educator/.

Personality is something held very close to each person, yet not entirely understood. The innate drive for humans to understand ourselves pushes people to dissect their personalities. We take quizzes, consult friends, and seek validation through the identification of our personality traits. The MBTI (Myers Briggs) personality test has exploded in popularity around the world, often regarded as one of the most comprehensive and accurate personality assessments ever made. However, it hurts the ego to consider that our seemingly nuanced and carefully nursed personalities could be boiled down to one of sixteen types. In this data exploration, we analyzed a data set of individuals organized by MBTI type and the last 50 things said online with the question of learning if even the most well regarded personality assessment holds up when put through a classifier.

Our dataset was sourced from Kaggle. As seen in the image below, the data frame is formatted by an ID column for each ‘participant’, a type column with one of the 16 MBTI types for each individual, and a posts column that holds a string with the last 50 posts of that individual. To clean this dataset, we split up each long string into individual words or symbols and made all word text lowercase.

We performed one-hot encoding to put our categorical data — the individual words — into a form that is easier to work with. We then used words-in-text and the bag-of-words method to create a vector of words for each participant. We cleaned our bag of words by only keeping words in English and filtering out any symbols.

With our data cleaned and organized into a vector, we made a deep forest classifier to predict a person’s MBTI type based on their posts/words using one-hot encoding. Specifically, we used a random forest classifier from the sklearn library. Visit: https://github.com/angelinelykk/dss2020/blob/main/mbti_analytics.ipynb to take a look at our raw code.

Regardless of the number of estimators in our classifier, the accuracy was 45–50~ %, which is relatively low.

There are a number of factors that could have contributed to this low accuracy. Using the bag of words method removes word order/sentence structure, causing the words on their own to lose a lot of the contextual meaning. Additionally, the classifier has no ‘partial credit’ system. Each MBTI type is made of four letters, each indicating a different aspect of personality. If our classifier correctly predicted three out of the four of these letters, it received no partial credit. In such a system, it is very difficult to reach completely accurate predictions, making the result of 50~ accuracy actually seem rather high.

Thus, we implemented a partial credit system that determined one out of the two letters possible in each of the four categories instead, and found much higher accuracies with the classifier for ’n’ (intuition) or ‘s’ (sensing) specifically yielding an accuracy over 90%! While this could be due to unbalanced data, it might suggest that these certain personality traits are more easily predictable based on what a person says.

To better visualize how the most common words between MBTI types compared, we created word clouds for each type. Below are the word clouds for ENFP and ISTJ, two types with completely contrasting letter types.

<might switch out for ones that filter out the most common words like i’m, think, really….

We were also curious about whether the MBTI types were correlated with other tangible behavioral tendencies — such as aggression, which has clear societal implications as well as ramifications for the workforce. In order to determine a metric for measuring aggression, we first found another data set with tweets labeled with aggressive tone (‘1’: aggressive, ‘0’: non-aggressive).

After cleaning our data, we removed words that were not common to both the aggression data set and the MBTI data set. Then, similarly to our Myers Briggs Type Predictor, we used bag-of-words paired with random forest classification to create a classifier that predicted the aggressiveness of a given tweet (about 83% accurate when tested on the aggression dataset). After partitioning the MBTI data set to match the format of our new aggression classifier, we ran the classifier on the MBTI-labeled tweets and obtained the following results:

Notably, the aggression percentages of the different personality types are fairly close, which is expected — an average person, regardless of MBTI type, is unlikely to be aggressive in every single tweet. However, we did find some slight variance in aggression between the types, which may indicate that certain MBTI personalities are more prone to verbal aggression on the Internet.

As technology expands people are able to gain more insights into personality, developing deeper understandings of each other, and ourselves. However, it still holds true that personality is something unique that belongs to each person and them only. Through data, we are able to gain valuable insights and stoke the flames of curiosity, but the truth of personality will remain as an enigma to humankind.

Personality in Four Letters or Less

An analysis of MBTI data performed by UC Berkeley Students and Data Science Society members: Angeline Lee, Jessica Yu, Juno Lee, and Shani Lyubomirsky. Project mentored by Claudea Jennefer.

Written by Shani Lyubomirsky