RobustLinks has been developing computer vision algorithms for clients in agricultural applications. To demonstrate some of the potentials of Convolution Neural networks we built a system that learns to estimate the density of vegetation in an image at any scale. This is joint work with Sagar Waghmare, an expert in Deep Learning and Computer Vision.
We are happy to announce that a team of us are entering Kaggle’s Nature Conservancy Fisheries Monitoring competition. We will be updating this page with our team and submission information in due course.
“we should be ashamed of big data, because it means our algorithms are bad” – Bud Mishra, Professor, Computer Science, NYU.
Longer unedited version
“We should be as proud of big data as one should be of a big tumor. Just as an uncontrolled tumor is a sign of a failed somatic surveillance, an oblivious immune system and unstable genomes, big data points to our failure to design better algorithms and architecture for the internet.“
For scalability and privacy reasons the Internet designers intentionally left out identity out of the architecture. Today identity (and trust) is provisioned at the application layer by apps like Facebook that require users to reveal their true identity. This revelation minimally includes user’s true name (situation is a little different on Twitter). What is interesting is that surprisingly names carry (and unintentionally leak) a lot of information, including gender, age, ethnicity and even birth location of the individual. So next time you tag a friend’s photo on Facebook with their name think what information you are revealing (yes, there are image recognition algorithms out there that use the name to help in the recognition task).
The problem we wanted to solve can be stated simply as: “given a name, design an algorithm that can predict the gender, ethnicity and age of the individual”. In this post we will describe the path we took to solve the first (simpler) part of the problem – predicting the gender. Future post will describe the other predictions.
The problem sounds simple enough–simply have a dictionary lookup. However there are some corner conditions that add complexity. For instance, researchers at CMU using the United States Social Security Baby Name DB as their data source found that:
there is nearly twice the diversity in the names selected for females (entropy of ﬁrst names, given female H(p|g = female)=9.20 bits) than for males (H(p|g = male)=8.22 bits). The majority of ﬁrst names are strongly associated with one gender or the other. The entropy of gender is nearly one bit (0.998) but the conditional entropy of gender given ﬁrst name is only H(g|p = n)=0.055. However, some names are surprisingly gender-neutral. For example, the names “Peyton”, “Finley”, “Kris”, “Kerry” and “Avery” all have near equal probability of being assigned to either a boy or girl.
We aggregated a series of name data sources (including the US Social Security Baby names DB mentioned above) and constructed a distributional model of names from all sources. Then given a name an expectation maximization algorithm predicts the likelihood of gender of names. You can access this service as a API via this URL
The output is a JSON object where “gender” key can have “male”,”female”,”unknown” values.
We hope you find this API useful. Comments and feedback as always welcomed.
You may have read about or even used our Unit Predict API, which allows one to predict the likely demographic makeup of an individual standing at some geographic coordinates.
While the usefulness of inferring user data from a single data point cannot be underestimated, we realized there was potential for improvement, as well as for an expanded number of use cases!
An example – suppose you got a set of data points from an app user’s mobile device. Maybe something like this:
2:13 AM EST 06/01/2013 (lower east side)
2:03 PM EST 06/01/2013 (rockaway beach)
7:45 AM EST 06/01/2013 (murray hill)
10:45 PM EST 06/06/2013 (bushwick)
7:15 AM EST 06/07/2013 (murray hill)
8:01 PM EST 06/09/2013 (atlantic city NJ)
7:23 PM EST 06/11/2013 (midtown)
3:30 PM EST 06/13/2013 (midtown)
11:45 AM EST 06/14/2013 (midtown)
7:15 PM EST 06/15/2013 (miami FL)
From the times, dates, and locations, one might at a glance assume something like:
“User X lives in Murray Hill, works in midtown, and likes beaches.”
High-level inferences of this nature are left up to the API user. You know your data and your user base better than we do!
And note further that although you may have guessed something about the user’s behavior, you still don’t know whether said user is black or white, old or young, and so on. (I didn’t say “he” or “she” because you don’t know that either!)
So we’ve built new services on top of those available in UnitPredict to allow API users to programmatically deal with aggregate data sets like the above in order to more easily draw intelligent conclusions or filter data based on their domain knowledge. The basic flow is:
- Give us your batch of data points
- Give us some guidelines about what you’re looking for
- We’ll send back a predicted demographics profile, based only on those data points you told us were relevant. No more “eyeballing” of data – you don’t need to look at a map. So simple, even a computer could do it.
Let’s see how it works, using data from the above example.
The basic batchPredict query
Let’s see what the query looks like without filtering. In effect, we are aggregating all points in our data set as being “equally valid” in our attempt to build a likely demographic profile for a user.
Getting ready: compiling your data series
We’re guessing it’s most likely that you are collecting data programmatically, by which token you should be able to generate an API call programmatically from your stored data. To send us data series, you can generate a string of data points from your data set, each in the format
A GET request allows you to string data points together, i.e. :
UTC (Universal Time Code) or GMT (Greenwich Mean Time) is a globally accepted standard. (Given the number of time zones out there, some of which are based on hours, some on half hours, and given that daylight savings or lack thereof can vary even from county to county in a given state, there has to be some kind of standard).
You’ll get to specify time zone where appropriate, as described immediately following.
Filtering by time of day
Perhaps you’ve decided to assume that your app user is near home if it’s night or early morning. Along with the data set above, you can specify a relevance parameter(“timeOfDay”), and a timeOfDay parameter, with choices of MORNING (5-9 AM), WORKDAY (9AM -5 PM), and so on (see our API documentation for the complete list). You’ll also need to tell us the timezone as it relates to GMT (in May, and also as of this writing, New York is GMT-04:00 thanks to daylight savings time).
Here we take the above query and see what a user’s demographic profile might look like if we only consider location data collected during the MORNING, local NYC time:
Given this filtered data set, the Globeskimmer algorithms now have confidence to predict additional demographics dimensions — the user’s likely education level (Bachelor’s degree), gender(female), and age(25-34).
Filtering by location
You might also be wondering where your app user is generally based. As is clear from the above data set, people can get around like never before. Is your user an airline pilot, a secret bigamist, or simply taking a lot of vacations?
Whatever your take on user behavior, we make it easy to filter your data set by location. Continuing with the above example, perhaps you’ve decided the user probably lives in the New York Metropolitan Area, and you want to remove any outliers from the prediction data. To accomplish this you need to specify a bounding box, bbox for short, and a relevance parameter(“space”).
Here we take the original query and see what a user’s demographic profile might look like if we only consider location data collected within a generous bounding box around the New York area:
But suppose you’ve determined by other means, such as your mobile app’s internal data, that your user lives somewhere in Brooklyn. We can set the bounding box accordingly:
…and find that your user may be in a different line of work than previously predicted (clerical and labor).
These are a few of the interesting features of batchPredict. Sign up for a free API key to learn more!
US Army in the 60s had a program called “remote viewing” where it had gathered together civilians who had, purportedly, the ability to sense and imagine objects and events at a distant location (read USSR silos) given only a location. Fascinating work. Now imagine you have to do the same in software. Your task is, given only 1 location data point, you are asked to write an algorithm that predicts a simpler problem (than our Army colleagues) – what is the characteristics of the individual at that location? Why is this an interesting problem to solve in the first place? Well, the inventory of location data is growing fast, being generated at all layers from physical layer, to network, to application to content (e.g exif encoding in images, or location data with tweets), all adding to the inventory. But what does it really mean to have access to someone’s location? Location-Based Applications (LBS) today often use location information as a key to retrieve and/or match some location data that is relevant to the current user location. Being recommended bars or restaurants, etc.
But can we do more than simple retrieval and matching keyed on location? Can we infer something else from location data, something close to, but not as hard as our Army colleagues back in the 60s? Our mission at RobustLinks is to turn data to knowledge, and the problem seems like a good fit. The history behind it is long, convoluted and informative but the short of it is that after scratching our heads we came up with is a suite of algorithms that given only a single geolocation data of a user, transforms that data to a prediction of the demographics of that user, along several variables (age, gender, education, profession, income, ethnicity, number of children, home ownership, etc). It is important to note that the algorithm is given only 1 bit of information and asked to predict the profile of the person that generated it.
How we do it is mum right now. But as of recently we have opened up the APIs to the algorithms to the public. This article will overview the first API – unit predict. Next article will describe the batch predict (which given a time series of geocodes, as well as radius, aggregation rules etc) improves its prediction.
Unit predict is simple to use. After you’ve registered for an API key you simply provide:
- your APIKey, and
- a single location (latlong)
there are some advanced optional parameters that you can set that allow you to constrain the algorithm. I’ll cover those in another posting.
The API returns most likely predicted demographic profile for a person at that location, along the dimensions mentioned above. You can see the API doc at:
and call it via
This work started while trying to build an encryption service on mobile device, artifacts that are increasingly “leaking” a lot of personal data. Talking to folks in marketing industry we discovered that a lot of media provisioning and allocations are done in Nielsen style DMA (Designated Market Area). “Buy and distribute this type of media for Ohio region because the demographics is X,Y,Z”. When we started designing the demographic APIs we were also thinking about application developers who do not have an authentication logic in their apps. Can they send us their user location and we provide them with a profile of their users? Or how about Ad Networks? They are data aggregators and transforming that data to knowledge of users would be valuable.
Join the Partner Network
We designed these APIs with some potential use cases in mind. As the saying goes “man plans, god laughs”. We’ve already seen API users use the service in unanticipated ways. So if you see a use then feel free to give it a go. Due to resource limitations (and the potentially large volume of incoming data) we’ve had to cap the free service to 100 calls / day. Joining our partner network opens up the access to a greater degree.
How is Google connected to Iran? Querying Wikipedia links from page “Google” to page “Iran” and running Dijkstra’s shortest path algorithm on the weighted directed cyclic graph gives us this (15 hop) shortest path
We’ve been indexing the (cumulative) page view counts of wikipedia for a while now. The pattern seems to be entertainment related pages (movies in particular). But to my surprise today I noticed that Alan Turing was 18th! Above Justin Bieber (at 26) and Tom Cruise (at 27th)!!
- Main Page 125764992
- Undefined 5916363
- UEFA Euro 2012 2463627
- Fifty Shades of Grey 2400574
- Prometheus (film) 2187157
- 404 error 1710697
- Wiki 1329682
- Higgs boson 1327427
- Facebook 1315548
- The Amazing Spider-Man (2012 film) 1200449
- Scientology 1137067
- One Direction 1136127
- Deaths in 2012 1115729
- UEFA Euro 2012 schedule 1108055
- Elizabeth II 1102307
- The Avengers (2012 film) 1050659
- Game of Thrones (TV series) 1044779
- Alan Turing 1001883
- Andy Griffith 910641
- Independence Day (United States) 905349
- Moody chart 877351
- Mario Balotelli 859874
- The Legend of Korra 767574
- 2012 Summer Olympics 763270
- United States 759609
- Justin Bieber 745252
- Tom Cruise 731923
Long live Alan Turing.
Recently I needed to search using a MoreLikeThis, but not as a MoreLikeThisHandler searchHandler or as a searchComponent (which returns mlt foreach result, expensive). What I wanted was to execute a standard search given a query and then use the top result as input for a mlt search. My final requirement was to provide this functionality inside a searchHandler itself so that I could add my own logic.
So with a bit of work I managed to get the following design. Note, the code below is cobbled together for the benefit of this blog entry. It is not tested and is only meant to share the lessons I learnt from the exercise. The crux of the solution is to use MoreLikeThisHelper, which is a Helper class for MoreLikeThis that can be called from other request handlers
First you need to register your handler (called /test below) in solrconfig.xml
Next, we define the actual handler by extending the SearchComponent (public class Classname extends SearchComponent)
and define (in either overridden prepare() or process() methods) the following handler logic (see here for how MoreLikeThis handler implements its logic)
Came across Ms Meeker’s slides today. Then and Now?
I have known at least 3 companies trying to solve this problem but the fact remains, old fashioned way is still the best. It is not a technological problem. Demand and supply is known by both parties. On Greenwich and Washington street you step out of your door and there is a cab right away. Why? Because cabbies know there are people who can afford cabs and likely want to go to airport.