In this session, we’ll see how to carry out supervised classification with Naive Bayes classifier and Support Vector Machines. The packages we use are the same like before. The only package needed to be installed is the e1071
We will use random sampling so to get the same results let’s set our random seed.
We will replicate the spam/ham classification excercise with real data. The data comes from Kaggle: https://www.kaggle.com/uciml/sms-spam-collection-dataset/data
We have 5572 observations of text messages and around 87% is ham, the rest is spam. The objective: predict if a text is ham or spam!
spam_df <- read_csv("data/spam.csv") %>%
select(1:2) %>%
rename(type = v1, sms = v2)
#> Warning: Missing column names filled in: 'X3' [3], 'X4' [4], 'X5' [5]
spam_df %>%
group_by(type) %>%
summarise(n = n()) %>%
mutate(freq = n / sum(n))
#> # A tibble: 2 x 3
#> type n freq
#> <chr> <int> <dbl>
#> 1 ham 4825 0.866
#> 2 spam 747 0.134
Put our data frame into quanteda
and create a corpus.
spam_corpus <- corpus(spam_df, text_field = "sms")
summary(spam_corpus, 5)
#> Corpus consisting of 5572 documents, showing 5 documents:
#>
#> Text Types Tokens Sentences type
#> text1 22 29 3 ham
#> text2 7 12 2 ham
#> text3 33 37 2 spam
#> text4 10 17 2 ham
#> text5 13 14 1 ham
As a first thing before we do anything we should separate our data into a training and test set. There are no hard rules for the ratio, so we will just randomly put 80% of our data into our training set and the 20% for testing. The process:
docvars(spam_corpus, "id") <- 1:ndoc(spam_corpus)
summary(spam_corpus, 10)
#> Corpus consisting of 5572 documents, showing 10 documents:
#>
#> Text Types Tokens Sentences type id
#> text1 22 29 3 ham 1
#> text2 7 12 2 ham 2
#> text3 33 37 2 spam 3
#> text4 10 17 2 ham 4
#> text5 13 14 1 ham 5
#> text6 35 37 4 spam 6
#> text7 15 18 2 ham 7
#> text8 28 32 2 ham 8
#> text9 29 33 5 spam 9
#> text10 29 31 3 spam 10
The random numbers
id_train <- sample(1:ndoc(spam_corpus), 0.8 * ndoc(spam_corpus), replace = FALSE)
head(id_train)
#> [1] 634 2097 5248 5423 3911 356
Subset out corpus into training and test sets and create dfms.
training <- corpus_subset(spam_corpus, id %in% id_train)
test <- corpus_subset(spam_corpus, !id %in% id_train)
training_dfm <- dfm(training)
test_dfm <- dfm(test)
# check if the subset happened as we wanted it.
cat("The training set has", ndoc(training), "documents")
#> The training set has 4457 documents
cat("The test set has", ndoc(test), "documents")
#> The test set has 1115 documents
IMPORTANT: always separate data into training and test set before running your model so the classifier do not have any knowledge about your test set.
Without further ado let’s train our Naive Bayes model. We’ll use the textmodel_nb()
from quanteda.
nb_train <- textmodel_nb(training_dfm, y = docvars(training_dfm, "type") ,smooth = 1)
summary(nb_train)
#>
#> Call:
#> textmodel_nb.dfm(x = training_dfm, y = docvars(training_dfm,
#> "type"), smooth = 1)
#>
#> Class Priors:
#> (showing first 2 elements)
#> ham spam
#> 0.5 0.5
#>
#> Estimated Feature Scores:
#> go until jurong point , crazy . available
#> ham 0.002512 0.0002473 2.603e-05 1.562e-04 0.01481 0.0001432 0.09523 0.0001692
#> spam 0.001022 0.0001893 3.786e-05 3.786e-05 0.01124 0.0001514 0.03699 0.0001514
#> only in bugis n great world la
#> ham 0.00138 0.008409 1.041e-04 0.0013667 0.0010413 3.775e-04 9.112e-05
#> spam 0.00284 0.002234 3.786e-05 0.0003408 0.0003408 7.572e-05 3.786e-05
#> e buffet cine there got amore wat
#> ham 0.0008070 3.905e-05 1.041e-04 0.0020566 0.0024081 2.603e-05 1.067e-03
#> spam 0.0003029 3.786e-05 3.786e-05 0.0004165 0.0003029 3.786e-05 7.572e-05
#> ok lar joking wif u oni free
#> ham 0.0027856 4.165e-04 6.508e-05 3.124e-04 0.010374 6.508e-05 0.0005988
#> spam 0.0001893 3.786e-05 3.786e-05 3.786e-05 0.005301 3.786e-05 0.0073073
#> entry
#> ham 1.302e-05
#> spam 9.465e-04
Now we are ready to predict labels for our training set. We should also create a confusion matrix to quickly see how well our model did. (Naive Bayes can only take features into consideration that occur both in the training set and the test)
matched_dfm <- dfm_match(test_dfm, features = featnames(training_dfm))
spam_pred <- predict(nb_train, newdata = matched_dfm, type = "class")
table(spam_pred, docvars(test_dfm, "type")) %>%
print()
#>
#> spam_pred ham spam
#> ham 969 9
#> spam 12 125
Let’s compute the various metrics for our model. We’ll write a function for it.
model_eval <- function(conf_matrix) {
TP <- conf_matrix[1,1]
FP <- conf_matrix[2,1]
TN <- conf_matrix[2,2]
FN <- conf_matrix[1,2]
prec <- TP / sum(TP, FP)
rec <- TP / sum(TP, FN)
acc <- sum(diag(conf_matrix)) / sum(conf_matrix)
print(conf_matrix)
cat("\n precesion = ", round(prec, 2),
"\n recall = ", round(rec, 2),
"\n accuracy = ", round(acc, 2))
}
Time to check out our model. Pretty good!
confusion_matrix <- table(spam_pred, docvars(test_dfm, "type"))
model_eval(conf_matrix = confusion_matrix)
#>
#> spam_pred ham spam
#> ham 969 9
#> spam 12 125
#>
#> precesion = 0.99
#> recall = 0.99
#> accuracy = 0.98
Time to do some digging! We can access the posterior class probabilities by checking the param
(predicted class given word) part of our trained model.
posterior <- nb_train$param %>%
as.matrix() %>%
t() %>%
as.data.frame() %>%
mutate(feature = rownames(.),
ham = round(ham, 5),
spam = round(spam, 5))
head(posterior)
#> ham spam feature
#> 1 0.00251 0.00102 go
#> 2 0.00025 0.00019 until
#> 3 0.00003 0.00004 jurong
#> 4 0.00016 0.00004 point
#> 5 0.01481 0.01124 ,
#> 6 0.00014 0.00015 crazy
What are the features that are most likely to be ham or spam?
# ham
posterior %>%
arrange(desc(ham)) %>%
head(15)
#> ham spam feature
#> 1 0.09523 0.03699 .
#> 2 0.02357 0.00151 i
#> 3 0.01938 0.00920 you
#> 4 0.01627 0.02105 to
#> 5 0.01481 0.01124 ,
#> 6 0.01446 0.00515 ?
#> 7 0.01162 0.00636 the
#> 8 0.01089 0.01132 a
#> 9 0.01037 0.00530 u
#> 10 0.00856 0.01696 !
#> 11 0.00855 0.00398 and
#> 12 0.00841 0.00223 in
#> 13 0.00808 0.00106 me
#> 14 0.00799 0.00011 ;
#> 15 0.00790 0.00038 my
Let’s stick to our training and test sets and see how well an SVM performs. We use the factor()
function to wrap our docvars as the svm requires numeric dependent variables. For now we specify the cost parameter by hand (and just guessing really).
svm_spam <- svm(x = training_dfm, y = factor(docvars(training_dfm, "type")),
kernel = "linear", cost = 5, probability = TRUE)
pred_svm <- predict(svm_spam, matched_dfm)
svm_spam
#>
#> Call:
#> svm.default(x = training_dfm, y = factor(docvars(training_dfm, "type")),
#> kernel = "linear", cost = 5, probability = TRUE)
#>
#>
#> Parameters:
#> SVM-Type: C-classification
#> SVM-Kernel: linear
#> cost: 5
#>
#> Number of Support Vectors: 575
How well we did? Well, extremely well I would say.
svm_cm <- table(pred_svm, docvars(test_dfm, "type"))
model_eval(svm_cm)
#>
#> pred_svm ham spam
#> ham 978 12
#> spam 3 122
#>
#> precesion = 1
#> recall = 0.99
#> accuracy = 0.99
To get a better cost function, we can use the tune
function. This might be a little slow…
best_cost <- tune(svm, train.x = training_dfm,
train.y = factor(docvars(training_dfm, "type")),
kernel = "linear",
ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10, 100)))
summary(best_cost)
#>
#> Parameter tuning of 'svm':
#>
#> - sampling method: 10-fold cross validation
#>
#> - best parameters:
#> cost
#> 0.1
#>
#> - best performance: 0.01660352
#>
#> - Detailed performance results:
#> cost error dispersion
#> 1 1e-03 0.10611982 0.015472115
#> 2 1e-02 0.02333552 0.006277848
#> 3 1e-01 0.01660352 0.005999181
#> 4 1e+00 0.01660402 0.005413992
#> 5 5e+00 0.01682773 0.004987850
#> 6 1e+01 0.01682773 0.004987850
#> 7 1e+02 0.01682773 0.004987850
To extract the best model (with cost = 0.1):
best_svm <- best_cost$best.model
summary(best_svm)
#>
#> Call:
#> best.tune(method = svm, train.x = training_dfm, train.y = factor(docvars(training_dfm,
#> "type")), ranges = list(cost = c(0.001, 0.01, 0.1, 1, 5, 10,
#> 100)), kernel = "linear")
#>
#>
#> Parameters:
#> SVM-Type: C-classification
#> SVM-Kernel: linear
#> cost: 0.1
#>
#> Number of Support Vectors: 576
#>
#> ( 365 211 )
#>
#>
#> Number of Classes: 2
#>
#> Levels:
#> ham spam