Stan version of multiple logistic regression

classic Classic list List threaded Threaded
2 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Stan version of multiple logistic regression

Francis Campos
I was doing logistic regression with a data set large enough, ~25,000 rows, that the JAGS code was annoyingly slow. Since data sets larger than this are common for me I decided to look into STAN and found that the translation of MultipleLogisticRegressionJags.R wasn't very difficult. The main changes are:
the model code
not using as.matrix to read in the y values from the csv
the initialization of the chains
the structure of the array of posterior samples

I made other changes to the code before I came across this forum and decided to post the code. Some variable names have changed because they made more sense to me and I changed the display of the autocorrelation graphs to be one window per chain.

I was mostly interested in speeding up the calculation and found that, with an equal number of burn-in (warmup) and final saved steps, a run that took 54 minutes with JAGS took 3.5 minutes with STAN. I do not claim that is a perfectly fair or complete comparison. It does include the compile time of the C++ code in STAN and 500 adapt steps in JAGS.

I am new to R, JAGS and STAN, so my code may be filled with barbaric methods of data manipulation.

One impediment to selling Bayesian methods to my coworkers has been the long calculation times. STAN seems a big step in the right direction from that point of view.

Francis Campos

CODE

graphics.off()
rm(list=ls(all=TRUE))
#fileNameRoot="MultipleLogisticRegr" # for constructing output filenames
if ( .Platform$OS.type != "windows" ) {
  windows <- function( ... ) X11( ... )
} # Adapted to Stan by Francis Campos. Originally MultipleLogisticRegressionJags.R
                                # Kruschke, J. K. (2011). Doing Bayesian Data Analysis:
require(rstan)                      # A Tutorial with R and BUGS. Academic Press / Elsevier.
require(rjags)  #just for its autocorr function
#------------------------------------------------------------------------------
# THE MODEL.
modelstring = "
data {
  int<lower=0> nPredictors;
  int<lower=0> nData;
  int<lower=0,upper=1> y[nData];
  matrix[nData,nPredictors] x;
}
parameters {
  real z0;
  vector[nPredictors] z;
}
model {
  z0 ~ normal(0,3.2);
  for (d in 1:nPredictors)
      z[d] ~ normal(0,2);
  for (n in 1:nData)
    y[n] ~ bernoulli(inv_logit(z0 + x[n] * z));
}
" # close quote for modelstring
writeLines(modelstring,con="model.txt")

#------------------------------------------------------------------------------
# THE DATA.

fileNameRoot = "StanLearn"
dataFile = "My.csv"
dataMat = read.csv(file=dataFile)
predictedName = "P_F"
predictorNames = c( "B", "H", "L", "A")
nData = NROW( dataMat )
#y = as.matrix( dataMat[,predictedName] )
y = dataMat[,predictedName]
x = as.matrix( dataMat[,predictorNames] )
nPredictors = NCOL( x )

nChains = 2
thin = 3
warm = 500
iter = 2000
# Re-center data at mean, to reduce autocorrelation in MCMC sampling.
# Standardize (divide by SD) to make initialization easier.
standardizeCols = function( dataMat ) {
    zDataMat = dataMat
    for ( colIdx in 1:NCOL( dataMat ) ) {
        mCol = mean( dataMat[,colIdx] )
        sdCol = sd( dataMat[,colIdx] )
        zDataMat[,colIdx] = ( dataMat[,colIdx] - mCol ) / sdCol
    }
    return( zDataMat )
}
zx = standardizeCols( x )
#zy = y   y is not standardized; must be 0,1

dataList = list(nPredictors = nPredictors, nData = nData, y = y, x = zx)
#------------------------------------------------------------------------------
# INTIALIZE THE CHAINS.

glmInfo = glm( dataList$y ~ dataList$x , family=binomial(logit) ) # R func.
show( glmInfo ) ; flush.console() # display in case glm() has troubles
z0Init = glmInfo$coef[1]
zInit = glmInfo$coef[-1]
initsList =  list( z0 = z0Init , z = zInit )
initsChains <- list()
for (i in 1:nChains) {
  initsChains[[i]] <- initsList
}

#Read the model either directly from modelstring or from a file.
#logistic_out2 <- stan(model_code = modelstring, data = dataList, iter = iter,
#                      chains = nChains, thin=thin, warmup = warm)
logistic_out2 <- stan(file = 'model.txt', data = dataList, iter = iter,
                      chains = nChains, thin=thin, warmup = warm, init = initsChains)

#extract() is a rstan function. With permuted=FALSE it returns an array with
#dimensions [iteration, chain, parameter]. The last param is lp__
chainData <- extract(logistic_out2, permuted=FALSE, inc_warmup=FALSE)
checkConvergence = T
if ( checkConvergence ) {
  print( logistic_out2 )
#  windows()
#  plot( codaSamples , ask=F )
  for (j in 1:nChains) {
    windows()
    autocorr.plot( chainData[,j,] , ask=T ) #autocorr is from rjags
  }
}
#Extract and merge chain data by parameter
z0 <- chainData[,1,1]
lastPredIndx = nPredictors + 1
zb <- chainData[,1,2:lastPredIndx]
if (nChains  > 1) {
  for (j in 2:nChains){
    z0 <- rbind(z0,chainData[,j,1])
    zb <- rbind(zb,chainData[,j,2:lastPredIndx])
  }
}
chainLength <- length(z0)
# Convert to original scale:
x = dataMat[,predictorNames]
y = dataMat[,predictedName]
Mx = apply(x,2,mean)
SDx = apply(x,2,sd)
b0 = 0 * z0
b = 0 * zb
for ( stepIdx in 1:chainLength ) {
  b0[stepIdx] = ( z0[stepIdx]
                        - sum( Mx / SDx * zb[stepIdx,] ) )
  for ( j in 1:nPredictors ) {                      
    b[stepIdx,j] = zb[stepIdx,j] / SDx[j]
  }
}

source("plotPost.R") #must also hae HDI0fMCMC.R in directory

# Examine sampled values, z scale:
windows()
thinIdx = ceiling(seq(1,chainLength,length=700))
pairs(  cbind( z0[thinIdx] , zb[thinIdx,] )  ,
        labels=c( "z0", paste("zb",predictorNames,sep="") ) )
# Examine sampled values, original scale:
windows()
pairs( cbind( b0[thinIdx] , b[thinIdx,] ) ,
       labels=c( "b0", paste("b_",predictorNames,sep="") ) )
savePlot(file=paste(fileNameRoot,"PostPairs.jpg",sep=""),type="jpg")

# Display the posterior. Use window with 2 rows
if ((nPredictors %% 2 )==1) {nCells = nPredictors + 1} else {nCells = nPredictors + 2}
windows(3.5*(nCells / 2),5.5)
layout( matrix(1:(nCells),nrow=2) )
histInfo = plotPost( b0 , credMass = 0.99, xlab="b0 Value" , compVal=NULL , breaks=30 ,
                     main=paste( "logit(p(", predictedName ,
                                 "=1)) when predictors = zero" , sep="" ) )
for ( bIdx in 1:nPredictors ) {
  histInfo = plotPost( b[,bIdx] , credMass = 0.99, xlab=paste("b",bIdx," Value",sep="") ,
                       compVal=0.0 , breaks=30 ,
                       main=paste(predictorNames[bIdx]) )
}
savePlot(file=paste(fileNameRoot,"PostHist.jpg",sep=""),type="jpg")

# Plot data with .5 level contours of believable logistic surfaces.
# The contour lines are best interpreted when there are only two predictors.
showContours = FALSE
if (showContours) {
for ( p1idx in 1:(nPredictors-1) ) {
  for ( p2idx in (p1idx+1):nPredictors ) {
    windows()
    xRange = range(x[,p1idx])
    yRange = range(x[,p2idx])
    # make empty plot
    plot( NULL , NULL , main=predictedName , xlim=xRange , ylim=yRange ,
          xlab=predictorNames[p1idx] , ylab=predictorNames[p2idx] )
    # Some of the 50% level contours from the posterior sample.
    for ( chainIdx in ceiling(seq( 1 , chainLength , length=20 )) ) {
      abline( -( b0[chainIdx]
                 + if (nPredictors>2) {
                   b[chainIdx,c(-p1idx,-p2idx)]*Mx[c(-p1idx,-p2idx)]
                 } else { 0 } )
              / b[chainIdx,p2idx] ,
              -b[chainIdx,p1idx]/b[chainIdx,p2idx] ,
              col="grey" , lwd = 2 )
    }
    # The data points:
    for ( yVal in 0:1 ) {
      rowIdx = ( y == yVal )
      points( x[rowIdx,p1idx] , x[rowIdx,p2idx] , pch=as.character(yVal) ,
              cex=1.75 )
    }
    #    savePlot(file=paste(fileNameRoot,"PostContours",p1idx,p2idx,".jpg",sep=""),type="jpg")
  }
}
}
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Stan version of multiple logistic regression

John K. Kruschke
Administrator
This is terrific. Thanks very much for posting the Stan code. Translations into Stan are on my to-do list, and it's a big help to see examples from you and others too. I hope others will follow your lead and post Stan versions of programs.

ALSO, the package runjags is a nice speed-up of JAGS because it can run chains on parallel processors. With four processors, the run time is about 1/4 the original. Not nearly the speed up you report with Stan, but still nice because of the minimal changes needed in the coding and learning.

Thanks again for your post.

John K. Kruschke, Professor
Doing Bayesian Data Analysis
The book: http://www.indiana.edu/~kruschke/DoingBayesianDataAnalysis/

The blog: http://doingbayesiandataanalysis.blogspot.com/





On Sat, Sep 14, 2013 at 4:17 PM, Francis Campos [via Doing Bayesian Data Analysis] <[hidden email]> wrote:
I was doing logistic regression with a data set large enough, ~25,000 rows, that the JAGS code was annoyingly slow. Since data sets larger than this are common for me I decided to look into STAN and found that the translation of MultipleLogisticRegressionJags.R wasn't very difficult. The main changes are:
the model code
not using as.matrix to read in the y values from the csv
the initialization of the chains
the structure of the array of posterior samples

I made other changes to the code before I came across this forum and decided to post the code. Some variable names have changed because they made more sense to me and I changed the display of the autocorrelation graphs to be one window per chain.

I was mostly interested in speeding up the calculation and found that, with an equal number of burn-in (warmup) and final saved steps, a run that took 54 minutes with JAGS took 3.5 minutes with STAN. I do not claim that is a perfectly fair or complete comparison. It does include the compile time of the C++ code in STAN and 500 adapt steps in JAGS.

I am new to R, JAGS and STAN, so my code may be filled with barbaric methods of data manipulation.

One impediment to selling Bayesian methods to my coworkers has been the long calculation times. STAN seems a big step in the right direction from that point of view.

Francis Campos

CODE

graphics.off()
rm(list=ls(all=TRUE))
#fileNameRoot="MultipleLogisticRegr" # for constructing output filenames
if ( .Platform$OS.type != "windows" ) {
  windows <- function( ... ) X11( ... )
} # Adapted to Stan by Francis Campos. Originally MultipleLogisticRegressionJags.R
                                # Kruschke, J. K. (2011). Doing Bayesian Data Analysis:
require(rstan)                      # A Tutorial with R and BUGS. Academic Press / Elsevier.
require(rjags)  #just for its autocorr function
#------------------------------------------------------------------------------
# THE MODEL.
modelstring = "
data {
  int<lower=0> nPredictors;
  int<lower=0> nData;
  int<lower=0,upper=1> y[nData];
  matrix[nData,nPredictors] x;
}
parameters {
  real z0;
  vector[nPredictors] z;
}
model {
  z0 ~ normal(0,3.2);
  for (d in 1:nPredictors)
      z[d] ~ normal(0,2);
  for (n in 1:nData)
    y[n] ~ bernoulli(inv_logit(z0 + x[n] * z));
}
" # close quote for modelstring
writeLines(modelstring,con="model.txt")

#------------------------------------------------------------------------------
# THE DATA.

fileNameRoot = "StanLearn"
dataFile = "My.csv"
dataMat = read.csv(file=dataFile)
predictedName = "P_F"
predictorNames = c( "B", "H", "L", "A")
nData = NROW( dataMat )
#y = as.matrix( dataMat[,predictedName] )
y = dataMat[,predictedName]
x = as.matrix( dataMat[,predictorNames] )
nPredictors = NCOL( x )

nChains = 2
thin = 3
warm = 500
iter = 2000
# Re-center data at mean, to reduce autocorrelation in MCMC sampling.
# Standardize (divide by SD) to make initialization easier.
standardizeCols = function( dataMat ) {
    zDataMat = dataMat
    for ( colIdx in 1:NCOL( dataMat ) ) {
        mCol = mean( dataMat[,colIdx] )
        sdCol = sd( dataMat[,colIdx] )
        zDataMat[,colIdx] = ( dataMat[,colIdx] - mCol ) / sdCol
    }
    return( zDataMat )
}
zx = standardizeCols( x )
#zy = y   y is not standardized; must be 0,1

dataList = list(nPredictors = nPredictors, nData = nData, y = y, x = zx)
#------------------------------------------------------------------------------
# INTIALIZE THE CHAINS.

glmInfo = glm( dataList$y ~ dataList$x , family=binomial(logit) ) # R func.
show( glmInfo ) ; flush.console() # display in case glm() has troubles
z0Init = glmInfo$coef[1]
zInit = glmInfo$coef[-1]
initsList =  list( z0 = z0Init , z = zInit )
initsChains <- list()
for (i in 1:nChains) {
  initsChains[[i]] <- initsList
}

#Read the model either directly from modelstring or from a file.
#logistic_out2 <- stan(model_code = modelstring, data = dataList, iter = iter,
#                      chains = nChains, thin=thin, warmup = warm)
logistic_out2 <- stan(file = 'model.txt', data = dataList, iter = iter,
                      chains = nChains, thin=thin, warmup = warm, init = initsChains)

#extract() is a rstan function. With permuted=FALSE it returns an array with
#dimensions [iteration, chain, parameter]. The last param is lp__
chainData <- extract(logistic_out2, permuted=FALSE, inc_warmup=FALSE)
checkConvergence = T
if ( checkConvergence ) {
  print( logistic_out2 )
#  windows()
#  plot( codaSamples , ask=F )
  for (j in 1:nChains) {
    windows()
    autocorr.plot( chainData[,j,] , ask=T ) #autocorr is from rjags
  }
}
#Extract and merge chain data by parameter
z0 <- chainData[,1,1]
lastPredIndx = nPredictors + 1
zb <- chainData[,1,2:lastPredIndx]
if (nChains  > 1) {
  for (j in 2:nChains){
    z0 <- rbind(z0,chainData[,j,1])
    zb <- rbind(zb,chainData[,j,2:lastPredIndx])
  }
}
chainLength <- length(z0)
# Convert to original scale:
x = dataMat[,predictorNames]
y = dataMat[,predictedName]
Mx = apply(x,2,mean)
SDx = apply(x,2,sd)
b0 = 0 * z0
b = 0 * zb
for ( stepIdx in 1:chainLength ) {
  b0[stepIdx] = ( z0[stepIdx]
                        - sum( Mx / SDx * zb[stepIdx,] ) )
  for ( j in 1:nPredictors ) {                      
    b[stepIdx,j] = zb[stepIdx,j] / SDx[j]
  }
}

source("plotPost.R") #must also hae HDI0fMCMC.R in directory

# Examine sampled values, z scale:
windows()
thinIdx = ceiling(seq(1,chainLength,length=700))
pairs(  cbind( z0[thinIdx] , zb[thinIdx,] )  ,
        labels=c( "z0", paste("zb",predictorNames,sep="") ) )
# Examine sampled values, original scale:
windows()
pairs( cbind( b0[thinIdx] , b[thinIdx,] ) ,
       labels=c( "b0", paste("b_",predictorNames,sep="") ) )
savePlot(file=paste(fileNameRoot,"PostPairs.jpg",sep=""),type="jpg")

# Display the posterior. Use window with 2 rows
if ((nPredictors %% 2 )==1) {nCells = nPredictors + 1} else {nCells = nPredictors + 2}
windows(3.5*(nCells / 2),5.5)
layout( matrix(1:(nCells),nrow=2) )
histInfo = plotPost( b0 , credMass = 0.99, xlab="b0 Value" , compVal=NULL , breaks=30 ,
                     main=paste( "logit(p(", predictedName ,
                                 "=1)) when predictors = zero" , sep="" ) )
for ( bIdx in 1:nPredictors ) {
  histInfo = plotPost( b[,bIdx] , credMass = 0.99, xlab=paste("b",bIdx," Value",sep="") ,
                       compVal=0.0 , breaks=30 ,
                       main=paste(predictorNames[bIdx]) )
}
savePlot(file=paste(fileNameRoot,"PostHist.jpg",sep=""),type="jpg")

# Plot data with .5 level contours of believable logistic surfaces.
# The contour lines are best interpreted when there are only two predictors.
showContours = FALSE
if (showContours) {
for ( p1idx in 1:(nPredictors-1) ) {
  for ( p2idx in (p1idx+1):nPredictors ) {
    windows()
    xRange = range(x[,p1idx])
    yRange = range(x[,p2idx])
    # make empty plot
    plot( NULL , NULL , main=predictedName , xlim=xRange , ylim=yRange ,
          xlab=predictorNames[p1idx] , ylab=predictorNames[p2idx] )
    # Some of the 50% level contours from the posterior sample.
    for ( chainIdx in ceiling(seq( 1 , chainLength , length=20 )) ) {
      abline( -( b0[chainIdx]
                 + if (nPredictors>2) {
                   b[chainIdx,c(-p1idx,-p2idx)]*Mx[c(-p1idx,-p2idx)]
                 } else { 0 } )
              / b[chainIdx,p2idx] ,
              -b[chainIdx,p1idx]/b[chainIdx,p2idx] ,
              col="grey" , lwd = 2 )
    }
    # The data points:
    for ( yVal in 0:1 ) {
      rowIdx = ( y == yVal )
      points( x[rowIdx,p1idx] , x[rowIdx,p2idx] , pch=as.character(yVal) ,
              cex=1.75 )
    }
    #    savePlot(file=paste(fileNameRoot,"PostContours",p1idx,p2idx,".jpg",sep=""),type="jpg")
  }
}
}


If you reply to this email, your message will be added to the discussion below:
http://doing-bayesian-data-analysis.12272.x6.nabble.com/Stan-version-of-multiple-logistic-regression-tp5000723.html
To start a new topic under Doing Bayesian Data Analysis, email [hidden email]
To unsubscribe from Doing Bayesian Data Analysis, click here.
NAML

Loading...