In this session, we’ll see how to carry out supervised classification with Naive Bayes classifier and Support Vector Machines. The packages we use are the same like before. The only package needed to be installed is the e1071

install.packages("e1071")
install.packages("quanteda.textmodels")
library(readr)
library(dplyr)
library(quanteda)
library(quanteda.textmodels)
library(e1071)

We will use random sampling so to get the same results let’s set our random seed.

set.seed(042)

We will replicate the spam/ham classification excercise with real data. The data comes from Kaggle: https://www.kaggle.com/uciml/sms-spam-collection-dataset/data

We have 5572 observations of text messages and around 87% is ham, the rest is spam. The objective: predict if a text is ham or spam!

spam_df <- read_csv("data/spam.csv") %>% 
    select(1:2) %>% 
    rename(type = v1, sms = v2)
#> Warning: Missing column names filled in: 'X3' [3], 'X4' [4], 'X5' [5]



spam_df %>% 
    group_by(type) %>% 
    summarise(n = n()) %>% 
    mutate(freq = n / sum(n))
#> # A tibble: 2 x 3
#>   type      n  freq
#>   <chr> <int> <dbl>
#> 1 ham    4825 0.866
#> 2 spam    747 0.134

1 Using Naive Bayes

Put our data frame into quanteda and create a corpus.

spam_corpus <- corpus(spam_df, text_field = "sms")

summary(spam_corpus, 5)
#> Corpus consisting of 5572 documents, showing 5 documents:
#> 
#>   Text Types Tokens Sentences type
#>  text1    22     29         3  ham
#>  text2     7     12         2  ham
#>  text3    33     37         2 spam
#>  text4    10     17         2  ham
#>  text5    13     14         1  ham

As a first thing before we do anything we should separate our data into a training and test set. There are no hard rules for the ratio, so we will just randomly put 80% of our data into our training set and the 20% for testing. The process:

  1. Create a numeric id for our documents (simple row numbers)
  2. Generate 4457 random number (between 1 and 5572) without replacement
  3. Subset the corpus accoring to the random numbers.
docvars(spam_corpus, "id") <- 1:ndoc(spam_corpus)

summary(spam_corpus, 10)
#> Corpus consisting of 5572 documents, showing 10 documents:
#> 
#>    Text Types Tokens Sentences type id
#>   text1    22     29         3  ham  1
#>   text2     7     12         2  ham  2
#>   text3    33     37         2 spam  3
#>   text4    10     17         2  ham  4
#>   text5    13     14         1  ham  5
#>   text6    35     37         4 spam  6
#>   text7    15     18         2  ham  7
#>   text8    28     32         2  ham  8
#>   text9    29     33         5 spam  9
#>  text10    29     31         3 spam 10

The random numbers

id_train <- sample(1:ndoc(spam_corpus), 0.8 * ndoc(spam_corpus), replace = FALSE)

head(id_train)
#> [1]  634 2097 5248 5423 3911  356

Subset out corpus into training and test sets and create dfms.

training <- corpus_subset(spam_corpus, id %in% id_train)

test <- corpus_subset(spam_corpus, !id %in% id_train)

training_dfm <- dfm(training)

test_dfm <- dfm(test)


# check if the subset happened as we wanted it.
cat("The training set has", ndoc(training), "documents")
#> The training set has 4457 documents

cat("The test set has", ndoc(test), "documents")
#> The test set has 1115 documents

IMPORTANT: always separate data into training and test set before running your model so the classifier do not have any knowledge about your test set.

Without further ado let’s train our Naive Bayes model. We’ll use the textmodel_nb() from quanteda.

nb_train <- textmodel_nb(training_dfm, y = docvars(training_dfm, "type") ,smooth = 1)

summary(nb_train)
#> 
#> Call:
#> textmodel_nb.dfm(x = training_dfm, y = docvars(training_dfm, 
#>     "type"), smooth = 1)
#> 
#> Class Priors:
#> (showing first 2 elements)
#>  ham spam 
#>  0.5  0.5 
#> 
#> Estimated Feature Scores:
#>            go     until    jurong     point       ,     crazy       . available
#> ham  0.002512 0.0002473 2.603e-05 1.562e-04 0.01481 0.0001432 0.09523 0.0001692
#> spam 0.001022 0.0001893 3.786e-05 3.786e-05 0.01124 0.0001514 0.03699 0.0001514
#>         only       in     bugis         n     great     world        la
#> ham  0.00138 0.008409 1.041e-04 0.0013667 0.0010413 3.775e-04 9.112e-05
#> spam 0.00284 0.002234 3.786e-05 0.0003408 0.0003408 7.572e-05 3.786e-05
#>              e    buffet      cine     there       got     amore       wat
#> ham  0.0008070 3.905e-05 1.041e-04 0.0020566 0.0024081 2.603e-05 1.067e-03
#> spam 0.0003029 3.786e-05 3.786e-05 0.0004165 0.0003029 3.786e-05 7.572e-05
#>             ok       lar    joking       wif        u       oni      free
#> ham  0.0027856 4.165e-04 6.508e-05 3.124e-04 0.010374 6.508e-05 0.0005988
#> spam 0.0001893 3.786e-05 3.786e-05 3.786e-05 0.005301 3.786e-05 0.0073073
#>          entry
#> ham  1.302e-05
#> spam 9.465e-04

Now we are ready to predict labels for our training set. We should also create a confusion matrix to quickly see how well our model did. (Naive Bayes can only take features into consideration that occur both in the training set and the test)

matched_dfm <- dfm_match(test_dfm, features = featnames(training_dfm))


spam_pred <- predict(nb_train, newdata = matched_dfm, type = "class")

table(spam_pred, docvars(test_dfm, "type")) %>% 
    print()
#>          
#> spam_pred ham spam
#>      ham  969    9
#>      spam  12  125

Let’s compute the various metrics for our model. We’ll write a function for it.

model_eval <- function(conf_matrix) {
    TP <- conf_matrix[1,1]
    FP <- conf_matrix[2,1]
    TN <- conf_matrix[2,2]
    FN <- conf_matrix[1,2]
    
    prec <- TP / sum(TP, FP)
    rec <- TP / sum(TP, FN)
    acc <- sum(diag(conf_matrix)) / sum(conf_matrix)
    
    print(conf_matrix)
    
    cat("\n precesion = ", round(prec, 2),
        "\n recall = ", round(rec, 2),
        "\n accuracy = ", round(acc, 2))
}

Time to check out our model. Pretty good!

confusion_matrix <- table(spam_pred, docvars(test_dfm, "type"))

model_eval(conf_matrix = confusion_matrix)
#>          
#> spam_pred ham spam
#>      ham  969    9
#>      spam  12  125
#> 
#>  precesion =  0.99 
#>  recall =  0.99 
#>  accuracy =  0.98

Time to do some digging! We can access the posterior class probabilities by checking the param (predicted class given word) part of our trained model.

posterior <- nb_train$param %>% 
    as.matrix() %>% 
    t() %>% 
    as.data.frame() %>% 
    mutate(feature = rownames(.),
           ham = round(ham, 5),
           spam = round(spam, 5))

head(posterior)
#>       ham    spam feature
#> 1 0.00251 0.00102      go
#> 2 0.00025 0.00019   until
#> 3 0.00003 0.00004  jurong
#> 4 0.00016 0.00004   point
#> 5 0.01481 0.01124       ,
#> 6 0.00014 0.00015   crazy

What are the features that are most likely to be ham or spam?

# ham
posterior %>% 
    arrange(desc(ham)) %>% 
    head(15)
#>        ham    spam feature
#> 1  0.09523 0.03699       .
#> 2  0.02357 0.00151       i
#> 3  0.01938 0.00920     you
#> 4  0.01627 0.02105      to
#> 5  0.01481 0.01124       ,
#> 6  0.01446 0.00515       ?
#> 7  0.01162 0.00636     the
#> 8  0.01089 0.01132       a
#> 9  0.01037 0.00530       u
#> 10 0.00856 0.01696       !
#> 11 0.00855 0.00398     and
#> 12 0.00841 0.00223      in
#> 13 0.00808 0.00106      me
#> 14 0.00799 0.00011       ;
#> 15 0.00790 0.00038      my

2 Support Vector Machine

Let’s stick to our training and test sets and see how well an SVM performs. We use the factor() function to wrap our docvars as the svm requires numeric dependent variables. For now we specify the cost parameter by hand (and just guessing really).

svm_spam <- svm(x = training_dfm, y = factor(docvars(training_dfm, "type")),
                kernel = "linear", cost = 5, probability = TRUE)

pred_svm <- predict(svm_spam, matched_dfm)

svm_spam
#> 
#> Call:
#> svm.default(x = training_dfm, y = factor(docvars(training_dfm, "type")), 
#>     kernel = "linear", cost = 5, probability = TRUE)
#> 
#> 
#> Parameters:
#>    SVM-Type:  C-classification 
#>  SVM-Kernel:  linear 
#>        cost:  5 
#> 
#> Number of Support Vectors:  575

How well we did? Well, extremely well I would say.

svm_cm <- table(pred_svm, docvars(test_dfm, "type"))

model_eval(svm_cm)
#>         
#> pred_svm ham spam
#>     ham  978   12
#>     spam   3  122
#> 
#>  precesion =  1 
#>  recall =  0.99 
#>  accuracy =  0.99

To get a better cost function, we can use the tune function. This might be a little slow…

best_cost <- tune(svm, train.x = training_dfm, 
                  train.y = factor(docvars(training_dfm, "type")),
                  kernel = "linear",
                  ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
                  

summary(best_cost)               
#> 
#> Parameter tuning of 'svm':
#> 
#> - sampling method: 10-fold cross validation 
#> 
#> - best parameters:
#>  cost
#>   0.1
#> 
#> - best performance: 0.01660352 
#> 
#> - Detailed performance results:
#>    cost      error  dispersion
#> 1 1e-03 0.10611982 0.015472115
#> 2 1e-02 0.02333552 0.006277848
#> 3 1e-01 0.01660352 0.005999181
#> 4 1e+00 0.01660402 0.005413992
#> 5 5e+00 0.01682773 0.004987850
#> 6 1e+01 0.01682773 0.004987850
#> 7 1e+02 0.01682773 0.004987850

To extract the best model (with cost = 0.1):

best_svm <- best_cost$best.model

summary(best_svm)
#> 
#> Call:
#> best.tune(method = svm, train.x = training_dfm, train.y = factor(docvars(training_dfm, 
#>     "type")), ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 
#>     100)), kernel = "linear")
#> 
#> 
#> Parameters:
#>    SVM-Type:  C-classification 
#>  SVM-Kernel:  linear 
#>        cost:  0.1 
#> 
#> Number of Support Vectors:  576
#> 
#>  ( 365 211 )
#> 
#> 
#> Number of Classes:  2 
#> 
#> Levels: 
#>  ham spam