Closing the Gender Data Gap: Using AI and Computational Methods to Build a Data Set that Measures the Influence of Gender in Social Responses to COVID

Authors: Laura Fernanda Cely

Tuesday, 01 November 2022

One of the core objectives of the COLEV project is to analyse social media platforms to identify common reactions to policies for COVID-19. In a health event that requires social coordination and cooperation to be effectively addressed, it’s important to monitor whether public opinion is in favour of or against policies. More specifically, in a country where 29% of men and 15% of women believe that women are less suited to be politicians, we wanted to identify whether the public’s perception towards policies is dictated by the policymaker’s gender. However, to produce gender-sensible research, it is necessary to have gender-sensible data, and in this case, as with many others, that data did not exist.

In this blog, we are not attempting to provide answers to the research question, rather we want to highlight the methodological steps taken to find the answer, starting with building a model to collect gender-sensible data with the help of AI and other computational methods.


Analysing data when there is no data

One of the key obstacles to understanding if reactions to COVID-19 policies were affected by the policymaker’s gender, is the lack of public access to gender information for all 1,102 Colombian mayors. Although “Federación Colombiana de Municipios”, an institution that gathers information on all of the country’s mayors, has published a directory with their names and contact information, there is no information regarding their gender. Unfortunately, the lack of gendered data is common when it comes to analysing gender information. The Gender Data Gap[i] refers to the absence of information that specifically acknowledges the different experiences of women and men, and the tendency for gathered data to be inherently male-biased, with male experiences generalised for the general population. Furthermore, 72% of the Colombian population uses the internet daily and it’s commonly known that Colombian mayors use social media to communicate with its citizens, however, the directory did not provide social media details.

Thus, to begin identifying the differences in public response to policies, we needed to address these gaps and find the mayor's social media accounts and gender information.


Using AI and computational methods to fill the data gap

1. Gender data imputation

There are many AI algorithms that are used to automatically detect gender based on attributes such as first names[ii], speech audio[iii], or facial imagery[iv]. Regardless of their accuracy, all these imputation methods carry the risk of misgendering people or reinforcing gender stereotypes[v]. In our case, as we had the mayor’s registered names, we used Global Name Data to detect their gender. This is a simple classifier that is trained with data from official birth records from at least four countries. We used Gender_detector, an implementation of Global Name Data in python, and used the classifier trained with United States information for its large size and bigger proportion of Latinx population data.

As with every AI Algorithm, our use of gender_detector has implications that must be addressed. For starters, there is a risk of misgendering as the mechanism uses names as the input, and the output is the gender that is usually assigned at birth to people with the given name. However, the risk of misgendering in our case is low, as most names in Spanish tend to be clearly gendered and trans people in Colombia have the legal right to change their name. Another implication of using this database is that it only accurately recognises names that are common in birth records within the United States. In our case, it recognised 93% of the names we submitted, which may be due to the high Latinx population, and thus, the high number of registered Hispanic names in the US. It is possible that the algorithm doesn’t work as well with less represented populations. We addressed these issues with manual qualitative validation and revisited all the people assigned as women or unknown and 30% of males. From such validation, we know the algorithm has an accuracy of 97% in our database, that means that 97% of the reviewed data points were correctly assigned. 

2. Social media user imputation

Social media data imputation from first names is a less common problem in computational science literature and with no clear reference to complete the task, our approach was heuristic. Since we were looking for the Twitter accounts of well-known public figures, we believed the easiest way to find their social media handle was to conduct Google searches while utilising computational methods to automatise the searches. We used GoogleSearch to make requests such as [First Name + Last Name + Mayor + Municipality_name + inurl:Twitter ] to generate the results we need. Following this method, we were able to find a username for 59% of the mayors and we manually validated these results. We also had to consider the possibility that 41% of the mayors in the country do not have a Twitter account, as Twitter penetration in Colombia is only 15%.

In this situation, the ethical concerns are low as we can only collect data from public accounts and there is no risk of infringing on privacy rights. Nevertheless, the precision of this method is highly dependent on the profile of the person whose Twitter account is searched. In the cases where the names are very common or the municipality is small, it is probable that the automatic google search finds the account of a namesake. Again, this approach may help to fill the data gap, but it is not 100% accurate.


Reflections on the process

AI and computational methods are useful tools to help fill data gaps but the results from these approaches have limitations. Asking authorities to collect and open data of public interest, such as the ones that we needed in this situation, should be a priority. Automatic Gender Detection is also useful but particularly problematic. Although filling the gender data gap is a prerequisite to making gender sensible research, it is important to recognise that automatic approaches will not provide information about identities and that most of these generic approaches tend to perpetuate the invisibility of trans and non-binary individuals.



This study was funded by the mixed-methods study on the design of AI and data science-based strategies to inform public health responses to COVID-19 in different local health ecosystems within Colombia (COLEV) project funded by the International Development Research Centre (IDRC) and the Swedish International Development Cooperation Agency (Sida) [109582]. I am thankful to Juan José Corredor Ojeda for his assistance in manual validation of data for this project, and Natalia Galvis, Estefanía Hernández Diana Higuera, Sandra Martínez, Natalia Niño, Nicolas Yáñez, Johana Trujillo, and Natali Valdez for their valuable insights and feedback in the aim of including gender perspective into the research.



[i] Caroline Criado Perez, Invisible Women: Data Bias in a World Designed for Men (2019).

[ii] Kamil Wais, “Gender Prediction Methods Based on First Names with GenderizeR,” The R Journal 8, no. 1 (2016): 17,; Nicolas Bérubé et al., “Wiki-Gendersort: Automatic Gender Detection Using First Names in Wikipedia” (SocArXiv, March 14, 2020),

[iii] T. Jayasankar, K. Vinothkumar, and Arputha Vijayaselvi, “Automatic Gender Identification in Speech Recognition by Genetic Algorithm,” Applied Mathematics & Information Sciences 11, no. 3 (May 1, 2017): 907–13,

[iv] “A Review of Facial Gender Recognition | SpringerLink,” accessed October 19, 2022,

[v] Os Keyes, “The Misgendering Machines: Trans/HCI Implications of Automatic Gender Recognition,” Proceedings of the ACM on Human-Computer Interaction 2, no. CSCW (November 1, 2018): 88:1-88:22,; Foad Hamidi, Morgan Klaus Scheuerman, and Stacy M. Branham, “Gender Recognition or Gender Reductionism? The Social Implications of Embedded Gender Recognition Systems,” in Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems, CHI ’18 (New York, NY, USA: Association for Computing Machinery, 2018), 1–13,

Tags: COVID-19, AI4COVID, Women, Social Media