Abstract:
In qualitative studies, Natural Language Processing (NLP) can be utilized to enhance and
streamline a number of steps in the investigation process. The main goal of this study is to
determine whether employing NLP techniques in qualitative research can improve the efficiency and rigor of the research process by enabling researchers to process, analyze, and extract meaningful information from qualitative data. It also aims to determine whether NLP can support researchers in identifying patterns, identifying key themes, and deriving important insights from textual data, facilitating a deeper understanding of the phenomena under investigation.
To achieve the above goals this study compares the results of manual coding and Natural
Language Processing (NLP) methods. NLP was used in qualitative data analysis activities, such as creating a codebook and finding relevant responses. Those were done with the help of NLP topic modelling techniques, which involve measuring how similar responses are and grouping responses that are similar together. We have used transcripts obtained from the AHDP project for the case study. The manual analysis of those transcripts (the creation of the codebook, quotes, and themes) is done by four people from the Center for Impact, Innovation, and Capacity Building for Health Information Systems and Nutrition (CIICHIN). To reach our goals, we turned all of the transcripts into a Pandas dataframe, which is a “two dimensional labelled data structure with columns of potentially different types” and is made up of a column of questions and a column of answers. We made a collection of responses using the programming language Python and its regular expression library. Then summed them up using Spacy, which is an “open-source Python library for advanced natural language processing.” After we grouped responses from the same question and generated a dataset of questions and answers, we finally built topics from each group of responses to compare them to the codes created in manual coding.
Using concepts of paraphrase mining, we created clusters of responses based on conceptual similarity, which helped measure the relevance of responses based on the number of responses we have in each cluster. Agglomerative clustering was used because does not require a number of clusters as hyperparameters which is an important feature needed in qualitative data analysis since vii is difficult to know the number cluster transcript can produce if you set it before sometimes can Couse your result to be biased.
NLP text summarization result (fig:7) produce word occurrence almost similar to the original transcripts, which is good news for the social researcher because it can help to get an insight into whole transcripts before even actual analysis start.
NLP topic modelling produces results conceptually similar to codebook produced with manual coding, this implies that topic modelling can be used by social researchers before or after codebook creation to measure if the codebook created is relevant or not.
With the paraphrase mining concept NLP was able to identify responses which are relevant to the question or not, this can help to save time during analysis since analysis are not losing time by analyzing irrelevant response.