Abstract

Global Gender Estimation From Distribution of First Names

Manolis Antonoyiannakis,1,2 Hugues Chaté,3,4 Serena Dalena,1 Jessica Thomas,1 Alessandro S. Villar1

Objective

By construction, current methods of gender estimation portray gender-skewed populations as more gender-balanced than they truly are.1,2 This systematic bias always underplays issues of underrepresentation, whereby one gender has a minority representation of less than 50%.
A global method to estimate the gender composition of a population from correlations with first names was introduced that is free of systematic errors.3 The method will improve our understanding of the review process and enhance analytics tools used in science.

Design

Determining gender composition of a group from first names requires prior knowledge of name-gender correlations from a reference population. Current gender-estimation methods assume that name-gender conditional probabilities can be directly transferred from a reference population to a target population. This strong assumption means that one population must be a fair sample of the other, particularly in gender composition, implying that conventional methods will fail for strong gender asymmetry. A global gender estimator method (gGEM) was derived that instead quantifies how reference conditional probabilities must transform to best describe the observed list of names. The transformation, based on a process that morphs one population into another and seeks a self-consistent solution using the complete list of names, frees the estimation process from the fair-sampling assumption while also quantifying the strength of the otherwise hidden gender-dependent social process. Public data containing more than 200,000 names from 3 countries (40% from the US, 35% from Brazil, and 25% from France) were used as reference populations, from which prescribed fractions of men or women were removed to construct test populations of various gender compositions. The estimation method was compared with conventional approaches using these well-controlled test populations.
A limitation is that the method is as accurate as the correlation between names and gender given by reference data.

Results

gGEM provided accurate estimates irrespective of gender composition. It was observed that previous methods produced estimates that deviated linearly from the correct values as the gender mix deviated from gender balance. In the extreme case of a highly skewed test population composed of 1% women (correctly estimated by gGEM), previous methods estimated 3% to 2% prevalence of women depending on whether names with unclear gender were considered or not, respectively—a systematic error of at least 100% of the correct prevalence. gGEM showed no observable systematic effect for every gender mix tested. Typically, conventional methods incur systematic inaccuracy that grows quickly if the fraction of the underrepresented gender falls below 20 individuals per 100 people.

Conclusions

When estimating the gender profile from first names, the global estimation method proposed here, which is easily implemented, should become the method of choice. Furthermore, it is argued that merging available reference populations with little overlap is a good strategy to mitigate errors stemming from population mismatching.

References

1.Ross CO, Gupta A, Mehrabi N, Muric G, Lerman K. The leaky pipeline in physics publishing. arXiv. Preprint posted online October 18, 2020. doi:10.48550/arXiv.2010.08912

2. Squazzoni F, Bravo G, Farjam M, et al. Peer review and gender bias: a study on 145 scholarly journals. Sci Adv. 2021;7(2):eabd0299. doi:10.1126/sciadv.abd0299

3. gGEM. Home page. https://www.ggem.app

1American Physical Society, College Park, MD, USA, villar@aps.org; 2Department of Applied Physics & Applied Mathematics, Columbia University, New York, NY, USA; 3Service de Physique de l’Etat Condensé, CEA, CNRS, Université Paris-Saclay, CEA-Saclay, Gif-sur-Yvette, France; 4Computational Science Research Center, Beijing, China

CONFLICT OF INTEREST DISCLOSURES

None reported.

Additional Information

Alessandro S. Villar and Hugues Chaté are co–corresponding authors.

Poster