by Jack Weyer and Madison Neyland
Introduction
Efficient and organized communication is a vital aspect of our professional and personal lives, and email is arguably the most important form of technology-facilitated communication. However, the efficiency and organization that emailing provides its users is jeopardized by unauthentic usage in the form of spam emails. Spam is such a widespread and infamous issue that almost all email users have received spam at least once. Spam email is defined as an unsolicited message received by a user which can be both annoying and dangerous.
The article “Spam!” by communication technology specialists Lorrie Cranor and Brian LaMacchia outlines problems arising from continuous spam that falls on email users. These problems range from moderate inconveniences to serious cybersafety concerns. For instance, having to manually filter through and delete spam emails among important emails costs time and harms the efficiency of having all-important messages accessible in one exclusive location. On the other hand, the damage can be as extreme as giving your computer viruses embedded in certain spam emails which activate when a user opens an attachment or link.
Of course, countermeasures against spam have been made such as the creation of built-in spam filters and legislative bans of its distribution. However, the fact that this article was written in 1998 shows how spam email has been relevant and persistent for the last two decades, and it points to the importance of staying informed and alert, as it has yet to be entirely resolved.
In this project we aim to target this need by creating a model that predicts if an email should be classified as spam based on emails that are known to be spam. Ideally, it will reveal which characteristics are determining factors of spam, informing us of what to be alert for in order to avoid or lessen the risks and inconveniences associated with spam emails.
Methods
We will use the SPAM dataset from the University of California-Irvine’s database to create a model that predicts the likelihood of an email being spam using various predictors contained in the set. Based on the predictors deemed relevant through variable selection, this model can help inform users what to look for when sorting through emails and how to be alert of unsafe content. It can also be used by web services to automatically sort emails.
First to get a better idea of the data, we counted the number of TRUE and FALSE observations of the “spam” variable. We found that there are 1,813 spam and 2,788 non-spam emails in the data set. Then, the complete.cases() command was used to remove any empty rows with respect to “spam” and the “testid” column was removed because it does not pertain to our model.
Next, correlation between several of the predictor variables was found using the cor() function. A couple pairs of variables showed to have a semi-strong correlation , crl.ave and crl.long have a 0.49 correlation value. However, logistic regression, used in this report, controls for numerous confounders given a large enough sample size, and since the sample size is 4601, we can assume that the confounding caused by correlation is accounted for in the model. Thus, no variables were removed from the model at this stage of our procedure.
Then the data was split into a training set “train.RData” and test set “test.RData” using the createDataPartition() function, making sure to use a chosen seed for reproducible results. Since 1500 observations were to be chosen for the training set, 0.3257 was used for the proportion value of data going into the training set. Conversely, we selected 300 non-training values at random to comprise the test set, used to evaluate our performance.
The next step is to perform the Lasso regression. This is done with the cv.glmnet() function on the training set, setting the “family” argument to “binomial” and the “type.measure” argument to “class” because we use misclassification error to select the optimal lambda coefficient for variable shrinkage. The “alpha” argument does not need to be manually set to “1” to indicate Lasso regression because this is the default. Next we plotted the cv.glmnet() function and found the lambda value which minimizes misclassification error: this is the lambda to be used for our Lasso model.
Next, we predicted responses on the test set using the predict function on the logistic model. These responses are the probability that the email is spam: a number between 0 and 1. Probabilities above 0.5 were set as 1 and probabilities below 0.5 were set as 0 to classify each response as spam or not spam. Then we compared these predicted responses of the test set to the actual responses of the test set. Again, the actual response probabilities were set to strictly 0 or 1. We made a table of the predicted and actual probabilities to display the accuracy of the model in predicting if an email is spam.
Results
Using our LASSO linear logistic regression model, a lambda value of 0.001409308 yielded the least test classification error. Out of our 57 potential predictors, seven were shrunk to zero leaving our model with 50 predictors and an intercept value. The coefficients of these predictors are shown below:
coefEst[coefEst!=0]
[1] -1.9443491334 -0.0499005987 -0.1224171588 0.2226877468
[5] 0.0353612981 0.5090775383 0.8020992924 2.9043532562
[9] 0.3615053943 0.2490693326 0.3878490022 -0.3759202609
[13] 0.3056264641 0.8187318452 0.8982449848 0.2852182572
[17] 0.0294308676 0.6280911006 0.1727308326 0.2271655064
[21] 2.1384322642 0.2900899874 -1.1540501918 -0.8396749372
[25] -0.8185534371 0.3414869707 -0.8782020118 -0.7895493498
[29] -0.2671314901 -0.5841471131 -1.2862076455 -0.1249331619
[33] 0.6967906273 -0.4326721485 -0.1352582597 -0.7496237264
[37] -0.9653456085 -1.6960148526 -0.6929635667 -0.7991954959
[41] -0.6040260166 -1.2349432960 -0.3752114713 -1.5744082039
[45] -1.1192038312 -1.3429571372 1.4727497557 6.5165126267
[49] 1.1810443643 0.0146366699 0.0005796719
These coefficients can help us interpret what factors affect whether or not an email is spam. For instance, let’s look at the 50th and the 51st coefficient in the list. These are the coefficients for the “crl.long” (average length of uninterrupted sequences of capital letters) and “crl.tot” (length of longest uninterrupted sequence of capital letters) variables, respectively. The coefficient for “crl.long” is positive 0.0146366699, indicating that on average an increase in the average length of uninterrupted sequences of capital letters increases the probability that an email is spam. Likewise, the coefficient for “crl.tot” is positive 0.0005796719, indicating that on average an increase in length of the longest uninterrupted sequence of capital letters increases the probability that an email is spam.
Using these values we obtained likelihood estimates for each of our 300 emails in the test set. The graph below depicts the 300 emails by index with their corresponding likelihood of being spam. Keep in mind that indexes 1 through 118 are in fact spam emails while indexes 119 through 300 are not.
It may still be a bit unclear how our model performed so we now classify our data as either ‘spam’ or ‘not spam’ based on their likelihood’s relation to a probability of 0.5. This new classification system is depicted below and gives a better visual on our model effectiveness.
Out of 118 spam emails, our model correctly identified 105 for an accuracy rate of 89%. Conversely the model correctly identified 96% of non-spam emails, or 174 out of 182. Overall, the model mislabeled only 7% of the time-- 21 out of 300 emails. This low test error indicates that our model is a good predictor of spam emails.
Discussion
This model informs us of several predictors of an email being classified as spam and can serve as a tool to avoid such emails or as a starting point for an algorithm to filter out spam emails from users’ folders. It also shows that completely trustable results can come out of sorting spam emails without any human intervention. However, as with any technology, spam emails have progressed over time in terms of topic, content, construction, and complexity. This data set was created in 1997 and includes data from emails that were created over twenty years ago, so the time difference needs to be taken into account in relation to our models applicability in today’s world. We would advise re-assessing this model with ‘new-age’ spam emails before applying any of its findings. A possible improvement to our model can be using an updated data set that includes predictors relevant to today's technology environment.
Comments