The AI Book
    Facebook Twitter Instagram
    The AI BookThe AI Book
    • Home
    • Categories
      • AI Media Processing
      • AI Language processing (NLP)
      • AI Marketing
      • AI Business Applications
    • Guides
    • Contact
    Subscribe
    Facebook Twitter Instagram
    The AI Book
    AI Language processing (NLP)

    Research on gender equality with NLP and Elicit

    6 July 2023No Comments11 Mins Read

    [ad_1]

    Introduction

    NLP (Natural Language Processing) can help us understand large amounts of textual data. Instead of manually skimming and reading documents manually, we can use this technique to speed up our understanding and get to key messages quickly. In this blog post, we explore the possibility of using Panda data frameworks and NLP tools in Python to gain insight into what people were writing about gender equality research in Afghanistan using Elicit. These insights can help us understand what has worked and what hasn’t to advance gender equality in recent decades in a country considered one of the most difficult places for women and girls (World Economic Forum, 2023).

    learning objective

    • Gain knowledge for text analysis in CSV files.
    • Learn how to do natural language processing in Python.
    • Developing effective data visualization skills for communication.
    • Learn how research on gender equality in Afghanistan has evolved over time.

    This article was published as part of the Data Science Blogathon.

    Using Elicit for literature review

    To generate key data, I use Elicit, an AI-powered tool from Literature Reviews (Elicit). I am asking the tool to generate a list of papers related to the question: Why has gender equality failed in Afghanistan? I then download the resulting list of papers (I count a random number of papers over 150) in CSV format. What does this data look like? Let’s take a look!

    Parsing CSV data from Elicit in Python

    We will first read in the CSV file as a pandas dataframe:

    import pandas as pd
    
    #Identify path and csv file
    file_path="./elicit.csv"
    
    #Read in CSV file
    df = pd.read_csv(file_path) 
    
    #Shape of CSV
    df.shape
    
    #Output: (168, 15)
    
    #Show first rows of dataframe
    df.head()

    The df.head() command displays the first rows of the resulting pandas dataframe. The data frame consists of 15 columns and 168 rows. We create this information with the df.shape command. Let us first examine the year in which most of these studies were published. To explore this, we can use a column that shows the year each article was published. There are several tools for generating figures in Python, but let’s rely on sea and the matplotlib library. To analyze in which year the papers were mostly published, we can use the so-called countplot and also make the axis labels and axis ticks look nice:

    Timely distribution analysis of published papers

    import matplotlib.pyplot as plt
    import seaborn as sns
    
    %matplotlib inline
    
    #Set figure size
    plt.figure(figsize=(10,5))
    
    #Producte a countplot
    chart = sns.countplot(x=df["Year"], color="blue")
    
    #Set labels
    chart.set_xlabel('Year')
    chart.set_ylabel('Number of published papers')
    
    #Change size of xticks
    # get label text
    _, xlabels = plt.xticks()
    
    # set the x-labels with
    chart.set_xticklabels(xlabels, size=5)
    
    plt.show()

    The data show that the number of papers has increased over time, probably also due to the availability of more data and better opportunities to conduct research in Afghanistan since the Taliban fell from power in 2001.

    bar graph |  Gender equality  word cloud |  NLP

    Content analysis of papers

    Number of words written

    While this gives us a first glimpse of research on gender equality in Afghanistan, we are mostly interested in what the researchers actually wrote about. To get an idea of ​​the content of these papers, we can use the abstract that Elicit has kindly included in the CSV file of the tool we created. To do this, we can follow standard text analysis procedures, such as those outlined by Jan Kierentz in one of his blog posts. We start by simply counting the number of words in each abstract using the lambda method:

    #Split text of abstracts into a list of words and calculate the length of the list
    df["Number of Words"] = df["Abstract"].apply(lambda n: len(n.split()))
    
    #Print first rows
    print(df[["Abstract", "Number of Words"]].head())
    
    #Output: 
    
                                                Abstract  Number of Words
    0  As a traditional society, Afghanistan has alwa...              122
    1  The Afghanistan gender inequality index shows ...              203
    2  Cultural and religious practices are critical ...              142
    3  ABSTRACT Gender equity can be a neglected issu...              193
    4  The collapse of the Taliban regime in the latt...              357
    
    #Describe the column with the number of words
    df["Number of Words"].describe()
    
    count     168.000000
    mean      213.654762
    std       178.254746
    min        15.000000
    25%       126.000000
    50%       168.000000
    75%       230.000000
    max      1541.000000

    big. Most abstracts seem to be rich in words. They have an average of 213.7 words. The minimum abstract consists of only 15 words, while the maximum abstract contains 1541 words.

    What do the researchers write?

    Now that we know that most abstracts are rich in information, let’s ask what they mainly write about. We can do this by doing a frequency distribution for each written word. However, we are not interested in certain words such as stopwords. Accordingly, we need to process the text:

    # First, transform all to lower case
    df['Abstract_lower'] = df['Abstract'].astype(str).str.lower()
    df.head(3)#import csv
    
    # Let's tokenize the column
    from nltk.tokenize import RegexpTokenizer
    
    regexp = RegexpTokenizer('\w+')
    
    df['text_token']=df['Abstract_lower'].apply(regexp.tokenize)
    
    #Show the first rows of the new dataset
    df.head(3)
    
    # Remove stopwords
    import nltk
    
    nltk.download('stopwords')
    
    from nltk.corpus import stopwords
    
    # Make a list of english stopwords
    stopwords = nltk.corpus.stopwords.words("english")
    
    # Extend the list with your own custom stopwords
    my_stopwords = ['https']
    stopwords.extend(my_stopwords)
    
    # Remove stopwords with lampda function
    df['text_token'] = df['text_token'].apply(lambda x: [item for item in x if item not in stopwords])
    
    #Show the first rows of the dataframe
    df.head(3)
    
    # Remove infrequent words (words shorter than or equal to two letters)
    df['text_string'] = df['text_token'].apply(lambda x: ' '.join([item for item in x if len(item)>2]))
    
    #Show the first rows of the dataframe
    df[['Abstract_lower', 'text_token', 'text_string']].head()

    What we do here is convert all words to lower case and later tokenize them using natural language processing tools. word tokenization is a crucial step in natural language processing and involves dividing text into individual words (tokens). we use RegexpTokenizer and tokenize the text of our abstracts based on alphanumeric characteristics (referred to as: ‘\w+’). Save the obtained marks in the column text_token. We then extract terms from this list using the natural language processing toolbox dictionary nltk, the Python NLTK (Natural Language Toolkit) library. Delete words that are shorter than two letters. This type of text processing helps us focus our analysis on more important terms.

    Create a Word Cloud

    To visually analyze the resulting list of words, we create a list of strings from the text we processed and tokenize this list and then create a word cloud:

    from wordcloud import WordCloud
    
    # Create a list of words
    all_words=" ".join([word for word in df['text_string']])
    
    # Word Cloud
    wordcloud = WordCloud(width=600, 
                         height=400, 
                         random_state=2, 
                         max_font_size=100).generate(all_words)
    
    plt.figure(figsize=(10, 7))
    plt.imshow(wordcloud, interpolation='bilinear')
    plt.axis('off');
    Gender equality  word cloud |  NLP

    The word cloud shows that the mentioned words are mostly those that are also part of our search query: Afghanistan, gender, gender equality. However, some other words that are substitutes are also part of the list of the most mentioned words: women and men. These words themselves are not very informative, but some others are: within the framework of research on gender equality in Afghanistan, researchers seem to be very concerned about education, human rights, society and the state. Surprisingly, Pakistan is also part of the list. This may mean that the results generated by the search query are inaccurate and also include research on gender equality in Afghanistan, although we did not ask for it. Alternatively, they could mean that the gender equality of Afghan women is also an important research topic in Pakistan, perhaps because many Afghans have settled in Pakistan as a result of the difficult situation in their country.

    Analyze the moods of the authors

    Ideally, the research will be neutral and free from emotions and opinions. However, it is in our human nature to have opinions and sentiments. To investigate the extent to which researchers reflect their own sentiments in what they write about, we can do a sentiment analysis. Sentiment analysis is a method to analyze a set of text as positive, neutral or negative. In our example, we will use the VADER sentiment analysis tool. VADER stands for Valence Aware Dictionary and Sentiment Reasoner and is a A vocabulary and rule-based sentiment analysis tool.

    How the VADER sentiment analysis tool works is that it uses a pre-built sentiment vocabulary consisting of a large number of words with associated sentiments. It also provides grammatical rules to reveal the sentimental polarity (positive, neutral and negative) of short texts. The tool provides a sentiment score (also called a composite score) based on the sentiment of each word and the grammatical rules of the text. This score ranges from -1 to 1. Values ​​above zero are positive and values ​​below zero are negative. Since the tool relies on a pre-built sentiment vocabulary, it does not require complex machine learning models or extensive models.

    # Access to the required lexicon containing sentiment scores for words
    nltk.download('vader_lexicon')
    
    # Initializes the sentiment analyzer object
    from nltk.sentiment import SentimentIntensityAnalyzer
    
    #Calculate the sentiment polarity scores with analyzer
    analyzer = SentimentIntensityAnalyzer()
    
    # Polarity Score Method - Assign results to Polarity Column
    df['polarity'] = df['text_string'].apply(lambda x: analyzer.polarity_scores(x))
    df.tail(3)
    
    # Change data structure - Concat original dataset with new columns
    df = pd.concat(
        [df, 
         df['polarity'].apply(pd.Series)], axis=1)
    
    #Show structure of new column
    df.head(3)
    
    #Calculate mean value of compound score
    df.compound.mean()
    
    #Output: 0.20964702380952382

    The code above generates a polarity score that ranges from -1 to 1 for each abstract, referred to here as the composite score. The mean value is greater than zero, so most of the research has a positive connotation. How has it changed over time? We can simply define the sentiments by year:

    # Lineplot
    g = sns.lineplot(x='Year', y='compound', data=df)
    
    #Adjust labels and title
    g.set(title="Sentiment of Abstract")
    g.set(xlabel="Year")
    g.set(ylabel="Sentiment")
    
    #Add a grey line to indicate zero (the neutral score) to divide positive and negative scores
    g.axhline(0, ls="--", c="grey")
    Gender equality  word cloud |  NLP

    interesting. Most research has been positive since 2003. Prior to that, sentiment fluctuated more sharply and was more negative on average, perhaps due to the plight of women in Afghanistan.

    conclusion

    Natural language processing can help us extract valuable information from large amounts of text. What we learned here from the nearly 170 papers is that education and human rights were the most important topics in the research papers collected by Elicit, and that researchers began to write more positively about gender equality in Afghanistan after the fall of the Taliban in 2003. in 2001.

    Key Takeaways

    • We can use natural language processing tools to get a quick insight into the main topics studied in a particular research area.
    • Word Cloud is a great visualization tool for understanding the most frequently used words in a text.
    • Sentiment analysis shows that the survey may not be as neutral as expected.

    I hope this article was informative. Feel free to connect with me on LinkedIn. Let’s connect and work on using data forever!

    Frequently Asked Questions

    Q1. How does Elicit work?

    A. Elicit is an online platform designed to help researchers find AI papers and research data. By simply asking a research question, Elicit uses a huge database of 175 million articles to uncover relevant answers. Additionally, it provides the functionality to use Elicit to analyze your own papers. In addition, Elicit boasts a user-friendly interface that ensures effortless navigation and accessibility.

    Q2. What is natural language processing?

    A. Natural Language Processing (NLP) is a specialized branch in the field of Artificial Intelligence (AI). Its primary goal is to enable machines to understand and analyze human language, allowing them to automate a variety of repetitive tasks. Some common NLP applications include machine translation, summarization, ticket classification, and spell checking.

    Q3. How do you do sentiment analysis?

    A. There are several approaches to calculating a sentiment score, but a widely used method involves using a dictionary of words classified as negative, neutral, or positive. The text is further examined to determine the presence of negative and positive words, which allows an assessment of the overall mood conveyed by the text.

    Q4. What is the most commonly used metric in VADER sentiment analysis?

    A. A complex score is obtained by summing the valence scores of individual words in the dictionary, taking into account the applicable rules, and subsequently normalizing the score in the range of -1 (a very negative mood indicator) and +1 (an extremely positive mood indicator) ). This metric is especially valuable when looking for a singular, one-dimensional measure of sentiment.

    References

    Media displayed in this article is not owned by Analytics Vidhya and is used at the discretion of the author.

    connected

    [ad_2]

    Source link

    Previous ArticleHighlight text using Amazon Polly
    Next Article Give Every AI a Soul—or Else
    The AI Book

    Related Posts

    AI Language processing (NLP)

    The RedPajama Project: An Open Source Initiative to Democratize LLMs

    24 July 2023
    AI Language processing (NLP)

    Mastering Data Science with Microsoft Fabric: A Tutorial for Beginners

    23 July 2023
    AI Language processing (NLP)

    Will AI kill your job?

    22 July 2023
    Add A Comment

    Leave A Reply Cancel Reply

    • Privacy Policy
    • Terms and Conditions
    • About Us
    • Contact Form
    © 2025 The AI Book.

    Type above and press Enter to search. Press Esc to cancel.