Learning to Estimate Vegetation Density from Images. A Deep Learning Approach

RobustLinks has been developing computer vision algorithms for clients in agricultural applications. To demonstrate some of the potentials of Convolution Neural networks we built a system that learns to estimate the density of vegetation in an image at any scale. This is joint work with Sagar Waghmare, an expert in Deep Learning and Computer Vision.

We used Plant Phenotyping DataSet to train a CNN for semantic and instance segmentation tasks. See Plant-phenotyping and how it was collected. The CNN code can be found on Github 

Big Data


Paraphrasing here

we should be ashamed of big data, because it means our algorithms are bad” – Bud Mishra, Professor, Computer Science, NYU.

Longer unedited version

We should be as proud of big data as one should be of a big tumor. Just as an uncontrolled tumor is a sign of a failed somatic surveillance, an oblivious immune system and unstable genomes, big data points to our failure to design better algorithms and architecture for the internet.


Predicting Gender from First Names









For scalability and privacy reasons the Internet designers intentionally left out identity out of the architecture. Today identity (and trust) is provisioned at the application layer by apps like Facebook that require users to reveal their true identity. This revelation minimally includes user’s true name (situation is a little different on Twitter). What is interesting is that surprisingly names carry (and unintentionally leak) a lot of information, including gender, age, ethnicity and even birth location of the individual. So next time you tag a friend’s photo on Facebook with their name think what information you are revealing (yes, there are image recognition algorithms out there that use the name to help in the recognition task).

The problem we wanted to solve can be stated simply as: “given a name, design an algorithm that can predict the gender, ethnicity and age of the individual”. In this post we will describe the path we took to solve the first (simpler) part of the problem – predicting the gender. Future post will describe the other predictions.

The problem sounds simple enough–simply have a dictionary lookup. However there are some corner conditions that add complexity. For instance, researchers at CMU using the United States Social Security Baby Name DB as their data source found that:


there is nearly twice the diversity in the names selected for females (entropy of first names, given female H(p|g = female)=9.20 bits) than for males (H(p|g = male)=8.22 bits). The majority of first names are strongly associated with one gender or the other. The entropy of gender is nearly one bit (0.998) but the conditional entropy of gender given first name is only H(g|p = n)=0.055. However, some names are surprisingly gender-neutral. For example, the names “Peyton”, “Finley”, “Kris”, “Kerry” and “Avery” all have near equal probability of being assigned to either a boy or girl.


We aggregated a series of name data sources (including the US Social Security Baby names DB mentioned above) and constructed a distributional model of names from all sources. Then given a name an expectation maximization algorithm predicts the likelihood of gender of names. You can access this service as a API via this URL

http://robustlinks.dyndns.org/api/globeskimmer/genderPredict/?apikey=&name=andrea jones

The output is a JSON object where “gender” key can have “male”,”female”,”unknown” values.

{"input": "andrea jones", "calls_remaining_24h": 86, "version": "0.1", "gender": "female"}


We hope you find this API useful. Comments and feedback as always welcomed.

Unit Predict – RL’s Demographics API

US Army in the 60s had a program called “remote viewing” where it had gathered together civilians who had, purportedly, the ability to sense and imagine objects and events at a distant location (read USSR silos) given only a location. Fascinating work. Now imagine you have to do the same in software. Your task is, given only 1 location data point, you are asked to write an algorithm that predicts a simpler problem (than our Army colleagues) – what is the characteristics of the individual at that location? Why is this an interesting problem to solve in the first place? Well, the inventory of location data is growing fast, being generated at all layers from physical layer, to network, to application to content (e.g exif encoding in images, or location data with tweets), all adding to the inventory. But what does it really mean to have access to someone’s location? Location-Based Applications (LBS) today often use location information as a key to retrieve and/or match some location data that is relevant to the current user location. Being recommended bars or restaurants, etc.

But can we do more than simple retrieval and matching keyed on location? Can we infer something else from location data, something close to, but not as hard as our Army colleagues back in the 60s? Our mission at RobustLinks is to turn data to knowledge, and the problem seems like a good fit. The history behind it is long, convoluted and informative but the short of it is that after scratching our heads we  came up with is a suite of algorithms that given only a single geolocation data of a user, transforms that data to a prediction of the demographics of that user, along several variables (age, gender, education, profession, income, ethnicity, number of children, home ownership, etc). It is important to note that the algorithm is given only 1 bit of information and asked to predict the profile of the person that generated it.

How we do it is mum right now. But as of recently we have opened up the APIs to the algorithms to the public. This article will overview the first API – unit predict. Next article will describe the batch predict (which given a time series of geocodes, as well as radius, aggregation rules etc) improves its prediction.

Unit predict is simple to use. After you’ve registered for an API key you simply provide:

  • your APIKey, and
  • a single location (latlong)

there are some advanced optional parameters that you can set that allow you to constrain the algorithm. I’ll cover those in another posting.

The API returns most likely predicted demographic profile for a person at that location, along the dimensions mentioned above. You can see the API doc at:


and call it via

http://robustlinks.dyndns.org/api/globeskimmer/unitPredict/?apikey=<your API Key>&lat=31.7&lng=-78


  "homeownership": {"own": 0, "rent": 100}, 
  "gender": {"male": 49, "female": 50}, 
  "age": {"35-44": 4, "18-24": 3, "25-34": 2, "45-64": 64, "65+": 25}, 
  "numberofchildren": {"5 or more": 0, "1": 4, "0": 93, "3": 0, "2": 1, "4": 0}, 
  "income": {"45,000 to 49,999": 1, "40,000 to 44,999": 2, "75,000 to 99,999": 7, "15,000 to 19,999": 1, "200,000 or more": 26, "10,000 to 14,999": 1, "50,000 to 59,999": 2, "30,000 
 to 34,999": 1, "60,000 to 74,999": 8, "100,000 to 124,999": 11, "20,000 to 24,999": 9, "150,000 to 199,999": 2, "25,000 to 29,999": 1, "Less than 10,000": 15, "125,000 to 149,999": 3, "35,000 to 39,999": 2}, 
  "education": {"BD": 26, "AD": 1, "GP": 15, "HS": 43, "SC": 9, "L9": 3}, 
  "employment": {"professional": 20, "media and entertainment": 0, "clerical and labor": 71, "service": 8}, 
  "ethnicity": {"hispanic": 0, "white": 61, "black": 38, "asian": 0}
"prediction": {
  "homeownership": "rent", 
  "gender": "female", 
  "age": "45-64", 
  "numberofchildren": "0", 
  "income": "200,000 or more", 
  "education": "HS", 
  "employment": "clerical and labor", 
  "ethnicity": "white"
 "weighting": "confidence", 
 "version": "0.1", 
 "radius": 5, 
 "calls_remaining_24h": 89, 
 "filterset": 5


Use Cases

This work started while trying to build an encryption service on mobile device, artifacts that are increasingly “leaking” a lot of personal data. Talking to folks in marketing industry we discovered that a lot of media provisioning and allocations are done in Nielsen style DMA (Designated Market Area). “Buy and distribute this type of media for Ohio region because the demographics is X,Y,Z”. When we started designing the demographic APIs we were also thinking about application developers who do not have an authentication logic in their apps. Can they send us their user location and we provide them with a profile of their users? Or how about Ad Networks? They are data aggregators and transforming that data to knowledge of users would be valuable.

Join the Partner Network

We designed these APIs with some potential use cases in mind. As the saying goes “man plans, god laughs”. We’ve already seen API users use the service in unanticipated ways. So if you see a use then feel free to give it a go. Due to resource limitations (and the potentially large volume of incoming data) we’ve had to cap the free service to 100 calls / day. Joining our partner network opens up the access to a greater degree.


Google’s connection to Iran (on English Wikipedia)

How is Google connected to Iran? Querying Wikipedia links from page “Google” to page “Iran” and running Dijkstra’s shortest path algorithm on the weighted directed cyclic graph gives us this (15 hop) shortest path

list of google domains
list of google products
world day against cyber censorship
reporters without borders
concerns and controversies over the 2008 summer olympics
concerns and controversies over the 2010 winter olympics
2010 canada anti-prorogation protests
timeline of the canadian afghan detainee issue
opposition to the war in afghanistan (2001–present)
blowback (intelligence)
abdul qadeer khan
iran–pakistan relations

Alan Turing


We’ve been indexing the (cumulative) page view counts of wikipedia for a while now. The pattern seems to be entertainment related pages (movies in particular). But to my surprise today I noticed that Alan Turing was 18th! Above Justin Bieber (at 26) and Tom Cruise (at 27th)!!



  1. Main Page 125764992
  2. Undefined 5916363
  3. UEFA Euro 2012 2463627
  4. Fifty Shades of Grey 2400574
  5. Prometheus (film) 2187157
  6. 404 error 1710697
  7. Wiki 1329682
  8. Higgs boson 1327427
  9. Facebook 1315548
  10. The Amazing Spider-Man (2012 film) 1200449
  11. Scientology 1137067
  12. One Direction 1136127
  13. Deaths in 2012 1115729
  14. UEFA Euro 2012 schedule 1108055
  15. Elizabeth II 1102307
  16. The Avengers (2012 film) 1050659
  17. Game of Thrones (TV series) 1044779
  18. Alan Turing 1001883
  19. Andy Griffith 910641
  20. Independence Day (United States) 905349
  21. Moody chart 877351
  22. Mario Balotelli 859874
  23. The Legend of Korra 767574
  24. 2012 Summer Olympics 763270
  25. United States 759609
  26. Justin Bieber 745252
  27. Tom Cruise 731923

Long live Alan Turing.


getMoreLikeThis logic in a SearchComponent (with Solrj)

Recently I needed to search using a MoreLikeThis, but not as a MoreLikeThisHandler searchHandler or as a searchComponent (which returns mlt foreach result, expensive). What I wanted was to execute a standard search given a query and then use the top result as input for a mlt search. My final requirement was to provide this functionality inside a searchHandler itself so that I could add my own logic.

So with a bit of work I managed to get the following design. Note, the code below is cobbled together for the benefit of this blog entry. It is not tested and is only meant to share the lessons I learnt from the exercise. The crux of the solution is to use MoreLikeThisHelper, which is a Helper class for MoreLikeThis that can be called from other request handlers

First you need to register your handler (called /test below) in solrconfig.xml

 <requestHandler name="/test" class="solr.SearchHandler">
 	<lst name="defaults">
       <str name="defType">dismax</str>
       <str name="q.alt">"*:*"</str> 
       <int name="start">0</int>
       <int name="rows">2000</int>
       <str name="echoParams">all</str>
       <str name="fl">id score</str>
       <str name="qf">content</str>

       <str name="mlt.match.include">true</str>   
       <str name="mlt.fl">content</str>   
       <int name="mlt.mintf">3</int>
       <int name="mlt.mindf">1</int>  

       <arr name="last-components">
         <str> customComponent </str>
<searchComponent name="customComponent" class="com.abc. customComponent"/>

Next, we define the actual handler by extending the SearchComponent (public class Classname extends SearchComponent)
and define (in either overridden prepare() or process() methods) the following handler logic (see here for how MoreLikeThis handler implements its logic)

public void process (ResponseBuilder rb) 
	SolrParams params = rb.req.getParams();
        String q = params.get( CommonParams.Q );
	SolrIndexSearcher searcher = rb.req.getSearcher();
	List filters = rb.getFilters();
    	String defVectorSize = params.get(CommonParams.ROWS);   	
    	int vectorSize = Integer.parseInt(params.get("vectorSize",defVectorSize));
   	String defType = params.get(QueryParsing.DEFTYPE);
	defType = defType==null ? QParserPlugin.DEFAULT_QTYPE : defType;
    	String fl = params.get(CommonParams.FL);
	int start = params.getInt(CommonParams.START, 0);
	int flags = 0;
	if (fl != null) 
	    flags |= SolrPluginUtils.setReturnFields(fl, rb.rsp);

	 // Hold on to the interesting terms if relevant
	 TermStyle termStyle = TermStyle.get( params.get( MoreLikeThisParams.INTERESTING_TERMS ) );
	 List interesting = (termStyle == TermStyle.NONE )
	      ? null : new ArrayList();

         DocListAndSet mltDocs = null;

	 MoreLikeThisHelper mlt = new MoreLikeThisHelper( params, searcher );

	 // Matching options
	 boolean includeMatch = params.getBool(MoreLikeThisParams.MATCH_INCLUDE,true);
	 int matchOffset = params.getInt(MoreLikeThisParams.MATCH_OFFSET, 0);

	    Query query = QParser.getParser(q, defType, rb.req).parse();			
	    DocList tophit = searcher.getDocList(query,filters, null, matchOffset, 1,flags);
            if( includeMatch ) {
             rsp.add( "match", tophit );
           // This is an iterator, but we only handle the first match
           DocIterator iterator = tophit.iterator();
           if( iterator.hasNext() ) {
           // do a MoreLikeThis query for each document in results
           int id = iterator.nextDoc();
           DocListAndSet mltDocs = mlt.getMoreLikeThis( id, start, rows, filters, interesting, flags );
          else {
              throw new SolrException( SolrException.ErrorCode.BAD_REQUEST, 
                "MoreLikeThis requires either a query (?q=) or text to find similar documents." );
          if( mltDocs == null ) {
             mltDocs = new DocListAndSet(); // avoid NPE
          rsp.add( "response", mltDocs.docList );
         } catch(Exception e){
           // handle error logic

Comments welcomed.

Ms Meeker does not live in Manhattan

Came across Ms Meeker’s slides today. Then and Now?



I have known at least 3 companies trying to solve this problem but the fact remains, old fashioned way is still the best. It is not a technological problem. Demand and supply is known by both parties. On Greenwich and Washington street you step out of your door and there is a cab right away. Why? Because cabbies know there are people who can afford cabs and likely want to go to airport.

context matters

Solr’s namespaces

I have to admit, moving from Lucene (3.2) to Solr (3.5) has been very painful. The parameter namespaces are so incredibly cognitively taxing (qf,fq,fl,{!tag=..},…), colorfully said to be more of a communication language than programming language. After painfully mapping all the acronyms to memory the next problem is to map the SolrJ client namespace for the same URL parameter namespace. And method names like this really don’t help.

setGetFieldStatistics(boolean v) 

For the benefit of those Solr noobies here is (an incomplete) list I gathered (mainly from Lucidworks) of the parameter namespaces. Found this very useful process pipeline flow from LucidWorks

Flow from LucidWorks

“Common Query” Parameters

The table below summarizes Solr’s common query parameters, which are supported by the Standard, DisMax, and eDisMax Request Handlers. Lucid Imagination strongly recommends that any future SolrRequestHandlers support these parameters, as well.

Parameter Description
defType Selects the query parser to be used to process the query.
sort Sorts the response to a query in either ascending or descending order based on the response’s score or another specified characteristic.
start  Specifies an offset (by default, 0) into the responses at which Solr should begin displaying content.
rows  Controls how many rows of responses are displayed at a time (default value: 10)
fq  Applies a filter query to the search results.
fl  Limits the query’s responses to a listed set of fields.
debugQuery Causes Solr to include additional debugging information in the response, including “explain” information for each of the documents returned. Note that this parameter takes effect if it is present, regardless of its setting.
explainOther Allows clients to specify a Lucene query to identify a set of documents. If non-blank, the explain info of each document which matches this query, relative to the main query (specified by the q parameter) will be returned along with the rest of the debugging information.
timeAllowed Defines the time allowed for the query to be processed. If the time elapses before the query response is complete, partial information may be returned.
omitHeader Excludes the header from the returned results, if set to true. The header contains information about the request, such as the time the request took to complete. The default is false.
wt Specifies the Response Writer to be used to format the query response.
cache=false By default, Solr caches the results of all queries and filter queries. Set cache=false to disable caching of the results of a query.

DisMax Parameters

In addition to the common request parameter, highlighting parameters, and simple facet parameters, the DisMax query parser supports the parameters described below. Like the standard query parser, the DisMax query parser allows default parameter values to be specified in solrconfig.xml, or overridden by query-time values in the request.

Parameter Description
q  Defines the raw input strings for the query.
q.alt Calls the standard query parser and defines query input strings, when the q parameter is not used.
qf Query Fields: specifies the fields in the index on which to perform the query.
mm Minimum “Should” Match: specifies a minimum number of fields that must match in a query.
pf  Phrase Fields: boosts the score of documents in cases where all of the terms in the q parameter appear in close proximity.

ps Phrase Slop: specifies the number of positions two terms can be apart in order to match the specified phrase.

qs Query Phrase Slop: specifies the number of positions two terms can be apart in order to match the specified phrase. Used specifically with the qf parameter.

tie Tie Breaker: specifies a float value (which should be something much less than 1) to use as tiebreaker in DisMax queries.

bq Boost Query: specifies a factor by which a term or phrase should be “boosted” in importance when considering a match.

bf Boost Functions: specifies functions to be applied to boosts. (See for details about function queries.)