Thursday, May 12, 2016

When the Predictions are more accurate than the Response

The methodology I propose is to use the original classification to build a model and use this model to fit the paragraphs. The code below suggests this method increases classification performance.
##Assume there are two classes in each document. Each paragraph has data - in this case a random normal - and the higher the value the higher probability it is associated with a particular topic. This could be thought of as word frequency 

##create documents - each document consists of 100 paragraphs 
documents=lapply(1:100, function(x) rnorm(100))

##create topics for each paragraph - 100 documents with a 100 paragraphs each
paragraph_classes=lapply(documents, function(x) rbinom(100,size=1,prob=1/(1+exp(-x))))

## a document consists of several paragraphs but the document assigned the most common topic
document_classes=(sapply(paragraph_classes, function(x) sum(x>0) )>50)

unlisted_documents=unlist(documents)
unlisted_class=rep(document_classes, each=100)

##the original correct classficiation rate is around 54%
table(unlist(paragraph_classes),unlisted_class)
##    unlisted_class
##     FALSE TRUE
##   0  2643 2353
##   1  2257 2747
sum(diag(table(unlist(paragraph_classes),unlisted_class)))/length(unlisted_class)
## [1] 0.539
##build a model. each paragraph response variable is the original class assigned
glm=glm(unlisted_class~unlisted_documents, family="binomial")


#the predicted classification using model on data - can see classification results are over 14% greater than original classficiations
predicted_values=scale(predict(glm))
table(predicted_values>0,unlist(paragraph_classes)>.5)
##        
##         FALSE TRUE
##   FALSE  3386 1647
##   TRUE   1610 3357
sum(diag(table(predicted_values>0,unlist(paragraph_classes)>.5)))/length(unlisted_class)
## [1] 0.6743