For scalability and privacy reasons the Internet designers intentionally left out identity out of the architecture. Today identity (and trust) is provisioned at the application layer by apps like Facebook that require users to reveal their true identity. This revelation minimally includes user’s true name (situation is a little different on Twitter). What is interesting is that surprisingly names carry (and unintentionally leak) a lot of information, including gender, age, ethnicity and even birth location of the individual. So next time you tag a friend’s photo on Facebook with their name think what information you are revealing (yes, there are image recognition algorithms out there that use the name to help in the recognition task).
The problem we wanted to solve can be stated simply as: “given a name, design an algorithm that can predict the gender, ethnicity and age of the individual”. In this post we will describe the path we took to solve the first (simpler) part of the problem – predicting the gender. Future post will describe the other predictions.
The problem sounds simple enough–simply have a dictionary lookup. However there are some corner conditions that add complexity. For instance, researchers at CMU using the United States Social Security Baby Name DB as their data source found that:
there is nearly twice the diversity in the names selected for females (entropy of ﬁrst names, given female H(p|g = female)=9.20 bits) than for males (H(p|g = male)=8.22 bits). The majority of ﬁrst names are strongly associated with one gender or the other. The entropy of gender is nearly one bit (0.998) but the conditional entropy of gender given ﬁrst name is only H(g|p = n)=0.055. However, some names are surprisingly gender-neutral. For example, the names “Peyton”, “Finley”, “Kris”, “Kerry” and “Avery” all have near equal probability of being assigned to either a boy or girl.
We aggregated a series of name data sources (including the US Social Security Baby names DB mentioned above) and constructed a distributional model of names from all sources. Then given a name an expectation maximization algorithm predicts the likelihood of gender of names. You can access this service as a API via this URL
The output is a JSON object where “gender” key can have “male”,”female”,”unknown” values.
We hope you find this API useful. Comments and feedback as always welcomed.