Thursday, May 22, 2014

didYouMean() Function: Using Google to correct errors in Strings

A function that will take a String as an input and return the "Did you mean.." or "Showing Results for.." from google.com. Good for misspelled names or locations.


library(RCurl)
##if on windows might need: options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl")))
didYouMean=function(input){
  input=gsub(" ", "+", input)
  doc=getURL(paste("https://www.google.com/search?q=",input,"/", sep=""))
  
  
  dym=gregexpr(pattern ='Did you mean',doc)
  srf=gregexpr(pattern ='Showing results for',doc)
  
  
  if(length(dym[[1]])>1){
    doc2=substring(doc,dym[[1]][1],dym[[1]][1]+1000)
    s1=gregexpr("?q=",doc2)
    s2=gregexpr("/&",doc2)
    new.text=substring(doc2,s1[[1]][1]+2,s2[[1]][1]-1)
    return(gsub("[+]"," ",new.text))
    break
  }
  
  else if(srf[[1]][1]!=-1){
    doc2=substring(doc,srf[[1]][1],srf[[1]][1]+1000)
    s1=gregexpr("?q=",doc2)
    s2=gregexpr("/&",doc2)
    new.text=substring(doc2,s1[[1]][1]+2,s2[[1]][1]-1)
    return(gsub("[+]"," ",new.text))
    break
  }
  else(return(gsub("[+]"," ",input)))
}  

So didYouMean("gorecge washington") returns "george washington"


Works well with misspelled companies or nouns or phrases. For example; you're doing text analysis on twitter and a customer raves about Carlsburg beer. Only problem is he's enjoying their product while tweeting (something that happens only rarely, I'm sure) and wrote "clarsburg gprou". Not to worry!

> didYouMean("clarsburg gprou")
[1] "carlsberg group"

Or suppose you have a 3 phase plan for profits. This can help you get there!

didYouMean("clletc nuderpants")
[1] "collect underpants"

Saturday, May 17, 2014

Modelling This Time is Different: Corrected

New PDF
New Code


I made two errors in my previous post.

The first is that I put Probability in the utility function. Generally, this is a no no where the E[utility]=sum over i: P(outcome i)*U(outcome i). I therefore changed it to a more simple maximization problem (no Lagrangian multipliers necessary) where the individual maximizes E[Profits].

 The second problem has to deal with maximizing subject to probabilities. Since I sampled from the joint posterior distribution of unknown parameters, I had a number of draws from the distribution. What I did in the previous analysis was maximize each pair of simulated draws individually, and then averaged over these maximized results to get what I thought was the optimal result. In general, this method does not result in the optimal value. I should have maximized all pairs simultaneously. Basically I did E[max s of f(s,p)] instead of max s of E[f(s,a)].

In general, the results are superficially similar to my original analysis. Even if the results are largely the same, its best to describe my mistakes upfront, and avoid awkward questions later.