Tech Bites: Naive Bayes

example Python code to implement an email spam filter using the Naive Bayes algorithm:


import os
import numpy as np
import pandas as pd
import nltk
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score, confusion_matrix

# Set the path of the dataset directory
data_dir = "data/"

# Read the emails from the dataset directory
emails = []
labels = []
for folder in os.listdir(data_dir):
    if folder == "ham":
        label = 0
    elif folder == "spam":
        label = 1
    else:
        continue
    folder_path = os.path.join(data_dir, folder)
    for file in os.listdir(folder_path):
        file_path = os.path.join(folder_path, file)
        with open(file_path, "r", encoding="utf8", errors="ignore") as f:
            email = f.read()
        emails.append(email)
        labels.append(label)

# Preprocess the emails
nltk.download("punkt")
nltk.download("wordnet")
lemmatizer = WordNetLemmatizer()
tokenizer = CountVectorizer().build_tokenizer()
preprocessed_emails = []
for email in emails:
    tokens = tokenizer(email)
    lemmatized_tokens = [lemmatizer.lemmatize(token) for token in tokens]
    preprocessed_email = " ".join(lemmatized_tokens)
    preprocessed_emails.append(preprocessed_email)

# Split the data into training and testing sets
X = preprocessed_emails
y = np.array(labels)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Vectorize the emails
vectorizer = CountVectorizer()
X_train_vect = vectorizer.fit_transform(X_train)
X_test_vect = vectorizer.transform(X_test)

# Train the Naive Bayes classifier
classifier = MultinomialNB()
classifier.fit(X_train_vect, y_train)

# Evaluate the classifier on the testing set
y_pred = classifier.predict(X_test_vect)
accuracy = accuracy_score(y_test, y_pred)
confusion = confusion_matrix(y_test, y_pred)
print("Accuracy:", accuracy)
print("Confusion matrix:\n", confusion)

This code reads the emails from a directory and preprocesses them using NLTK to tokenize and lemmatize the text. It then splits the data into training and testing sets and vectorizes the emails using the CountVectorizer from scikit-learn. Finally, it trains a Naive Bayes classifier on the training set and evaluates its performance on the testing set using accuracy and confusion matrix.

The requirements.txt file lists the Python packages required to run the email spam filter code. Here is an example requirements.txt file:

makefile
nltk==3.6.3
pandas==1.3.4
scikit-learn==1.0.2

This file specifies the version numbers of the nltk, pandas, and scikit-learn packages that the code requires. You can create this file by running the following command in your command prompt or terminal:


pip freeze > requirements.txt

This command writes all currently installed Python packages and their versions to the requirements.txt file. You can then edit this file to remove any unnecessary packages and specify the exact versions required by your code.

Choosing a classifier based on the training set data size can depend on several factors, such as the complexity of the data, the number of features, and the required accuracy of the model. Here are some general guidelines for selecting a classifier based on the size of the training set:

Small training set (less than 10,000 samples): In this case, simple classifiers such as Naive Bayes, Logistic Regression, or Decision Trees can be effective. These classifiers are computationally efficient and can handle small datasets well.
Medium training set (between 10,000 and 100,000 samples): Here, more complex classifiers such as Random Forests, Support Vector Machines (SVMs), and Gradient Boosting can be considered. These classifiers can handle larger datasets and capture more complex patterns in the data.
Large training set (more than 100,000 samples): In this case, deep learning models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), or Transformer-based models can be considered. These models can handle large amounts of data and learn complex representations of the input data.

It is important to note that the number of features in the data can also affect the choice of classifier.

For example, if the number of features is very high, then feature selection or dimensionality reduction techniques may need to be applied before training the model.

Ultimately, the choice of classifier should be based on the specific characteristics of the data, the available computational resources, and the required accuracy of the model.

It is often a good idea to try multiple classifiers and compare their performance on a validation set before selecting the best one for the task

Tech Bites

Friday, March 24, 2023

python code to email spam filter - Naive Bayes algorithm

How Can You Choose a Classifier Based on a Training Set Data Size?

Followers