[ad_1]
Clean, process and tokenize texts in milliseconds using built-in Polars string expressions
With the widespread adoption of large language models (LLMs), it may seem that we have passed the stage where we have to clean and process textual data manually. Unfortunately, I and other NLP practitioners can attest that this is not the case. Clear text data is needed at all levels of NLP sophistication – from basic text analytics to machine learning and LLM. This post shows how using Polars can greatly speed up this time-consuming and tedious process.
Polars is an amazingly fast Data Frame library written in Rust that is incredibly efficient at handling strings (due to its Arrow backend). Polars store strings Utf8
using the format Arrow
A backend that makes string traversals cache-optimal and predictable. Also, it exposes a lot of built-in string operations underneath str
A namespace that parallelizes string operations. Both of these factors make working with strings very easy and fast.
The library shares a lot of syntax with pandas, but there are also a lot of quirks that you’ll have to get used to. This post will introduce you to working with strings, but for a comprehensive overview, I recommend this “getting started” guide as it will give you a good overview of the library.
You can find all the code in this GitHub repo, so be sure to check it out if you want to code (don’t forget the ⭐). To make this post more practical and fun, I’ll show you how we can clean a small dataset of fraudulent emails that you can find on Kaggle (License CC BY-SA 4.0). Poles can be set using pip – pip install polars
And the recommended version of Python is 3.10
.
The purpose of this pipeline is to parse a raw text file into a DataFrame that can be used for further analytics/modelling. Here are the general steps that will be taken:
- Read the data in the text
- Remove relevant fields (eg sender email, subject, text, etc.)
- Extract useful functions from these fields (eg length, % of digits, etc.)
- Preprocessing text for further analysis
- Do some basic text analytics
Without further ado, let’s get started!
Reading the data
Assuming the text file in the email is saved as fraudulent_emails.txt
Here is the function used to read them:
def load_emails_txt(path: str, split_str: str = "From r ") -> list[str]:
with open(path, "r", encoding="utf-8", errors="ignore") as file:
text = file.read()emails = text.split(split_str)
return emails
If you examine the textual data, you will see that the email has two main sections
- Metadata (starts with
From r
) which contains the email sender, subject, etc. - Email text (starts with
Status: O
orStatus: RO
)
I use the first template to split a continuous text file into an email list. In total, we should be able to read 3,977 emails, which we put into a Polars DataFrame for further analysis.
emails = load_emails_txt("fradulent_emails.txt")
emails_pl = pl.DataFrame("emails": emails)print(len(emails))
>>> 3977
Obtaining the corresponding fields
Now the hard part begins. How do we extract the relevant fields from this mess of textual data? Unfortunately, the answer is a regex.
sender and subject
Examining the metadata further (below), you can see that it has fields From:
and Subject:
which will be very useful for us.
From r Wed Oct 30 21:41:56 2002
Return-Path: <james_ngola2002@maktoob.com>
X-Sieve: cmu-sieve 2.0
Return-Path: <james_ngola2002@maktoob.com>
Message-Id: <200210310241.g9V2fNm6028281@cs.CU>
From: "MR. JAMES NGOLA." <james_ngola2002@maktoob.com>
Reply-To: james_ngola2002@maktoob.com
To: webmaster@aclweb.org
Date: Thu, 31 Oct 2002 02:38:20 +0000
Subject: URGENT BUSINESS ASSISTANCE AND PARTNERSHIP
X-Mailer: Microsoft Outlook Express 5.00.2919.6900 DM
MIME-Version: 1.0
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: 8bit
Status: O
If you keep scrolling through the emails, you’ll find that there are several formats From:
Valley. The first format you see above is where we have name and email. The second format contains only email, e.g From: 123@abc.com
or From: “123@abc.com”
. With that in mind, we’ll need three regex patterns – one for the subject and two for the sender (name with email and email only).
email_pattern = r"From:\s*([^<\n\s]+)"
subject_pattern = r"Subject:\s*(.*)"
name_email_pattern = r'From:\s*"?([^"<]+)"?\s*<([^>]+)>'
Polars have str.extract
A method that can compare the above patterns with our text and (you guessed it) extract the matching groups. Here’s how you can use it emails_pl
DataFrame.
emails_pl = emails_pl.with_columns(
# Extract the first match group as email
pl.col("emails").str.extract(name_email_pattern, 1).alias("sender_name"),
# Extract the second match group as email
pl.col("emails").str.extract(name_email_pattern, 2).alias("sender_email"),
# Extract the subject
pl.col("emails").str.extract(subject_pattern, 1).alias("subject"),
).with_columns(
# In cases where we didn't extract email
pl.when(pl.col("sender_email").is_null())
# Try another pattern (just email)
.then(pl.col("emails").str.extract(email_pattern, 1))
# If we do have an email, do nothing
.otherwise(pl.col("sender_email"))
.alias("sender_email")
)
In addition, as you can see str.extract
We also use a pl.when().then().otherwise()
Expressions (Polars version of if/else) to account for the second email sample only. If you print the results, you’ll see that it should work correctly (and incredibly fast) in most cases. now we have sender_name
, sender_email
and subject
fields for our analysis.
Email text
As mentioned above, the body of the email starts after that Status: O
(opened) or Status: RO
(read and open) which means we can use this pattern to split an email into “metadata” and “text” parts. Below you can see the three steps we need to take to extract the required field and the corresponding Polars method to perform them.
- replace
Status: RO
withStatus: O
So that we have only one “split” sample – usestr.replace
- Split the actual string
Status: O
– Usestr.split
- Get the second element of the resulting list (text) — use
arr.get(1)
emails_pl = emails_pl.with_columns(
# Apply operations to the emails column
pl.col("emails")
# Make these two statuses the same
.str.replace("Status: RO", "Status: O", literal=True)
# Split using the status string
.str.split("Status: O")
# Get the second element
.arr.get(1)
# Rename the field
.alias("email_text")
)
and no! We extracted important fields in just a few milliseconds. Let’s put it all into one consistent function that we can use later in the pipeline.
def extract_fields(emails: pl.DataFrame) -> pl.DataFrame:
email_pattern = r"From:\s*([^<\n\s]+)"
subject_pattern = r"Subject:\s*(.*)"
name_email_pattern = r'From:\s*"?([^"<]+)"?\s*<([^>]+)>'emails = (
emails.with_columns(
pl.col("emails").str.extract(name_email_pattern, 2).alias("sender_email"),
pl.col("emails").str.extract(name_email_pattern, 1).alias("sender_name"),
pl.col("emails").str.extract(subject_pattern, 1).alias("subject"),
)
.with_columns(
pl.when(pl.col("sender_email").is_null())
.then(pl.col("emails").str.extract(email_pattern, 1))
.otherwise(pl.col("sender_email"))
.alias("sender_email")
)
.with_columns(
pl.col("emails")
.str.replace("Status: RO", "Status: O", literal=True)
.str.split("Status: O")
.arr.get(1)
.alias("email_text")
)
)
return emails
Now we can move on to the feature generation part.
Artistic engineering
From personal experience, scam emails tend to be very detailed and long (because the scammers are trying to gain your trust), so the length of the email’s character will be quite informative. Also, they often use exclamation points and numbers, so calculating the proportion of non-characters in an email can also be useful. Finally, scammers love to use caps lock, so let’s calculate the proportion of capital letters. Of course, there are many more features we can create, but to keep this post from getting too long, let’s just focus on these two.
The first function can be created very easily using a builtin str.n_chars()
function. Two other features can be calculated using regex and str.count_match()
. Below you can find a function to calculate these three features. Like the previous function, it uses with_columns()
A clause for moving old functions and creating new ones on top of them.
def email_features(data: pl.DataFrame, col: str) -> pl.DataFrame:
data = data.with_columns(
pl.col(col).str.n_chars().alias(f"col_length"),
).with_columns(
(pl.col(col).str.count_match(r"[A-Z]") / pl.col(f"col_length")).alias(
f"col_percent_capital"
),
(pl.col(col).str.count_match(r"[^A-Za-z ]") / pl.col(f"col_length")).alias(
f"col_percent_digits"
),
)return data
Clean up the text
If you print out some of the emails we’ve obtained, you’ll notice some things that need to be cleaned up. For example:
- HTML tags are still present in some emails
- Many non-alphabetic characters are used
- Some emails are capitalized, some are capitalized, and some are mixed
Just like above, we’re going to use regular expressions to clean up the data. However, it is now the method of choice str.replace_all
Because we want to replace all matching cases, not just the first one. In addition, we will use str.to_lowercase()
to make all text lowercase.
emails_pl = emails_pl.with_columns(
# Apply operations to the emails text column
pl.col("email_text")
# Remove all the data in <..> (HTML tags)
.str.replace_all(r"<.*?>", "")
# Replace non-alphabetic characters (except whitespace) in text
.str.replace_all(r"[^a-zA-Z\s]+", " ")
# Replace multiple whitespaces with one whitespace
# We need to do this because of the previous cleaning step
.str.replace_all(r"\s+", " ")
# Make all text lowercase
.str.to_lowercase()
# Keep the field's name
.keep_name()
)
Now let’s convert this chain of operations into a function so that it can be applied to other columns of interest as well.
def email_clean(
data: pl.DataFrame, col: str, new_col_name: str | None = None
) -> pl.DataFrame:
data = data.with_columns(
pl.col(col)
.str.replace_all(r"<.*?>", " ")
.str.replace_all(r"[^a-zA-Z\s]+", " ")
.str.replace_all(r"\s+", " ")
.str.to_lowercase()
.alias(new_col_name if new_col_name is not None else col)
)return data
Tokenization of text
As the final step in the preprocessing pipeline, we’re going to tokenize the text. Tokenization will be done using the already familiar method str.split()
where we are going to specify a gap as a separator.
emails_pl = emails_pl.with_columns(
pl.col("email_text").str.split(" ").alias("email_text_tokenised")
)
Again, let’s put this code in our final pipeline function.
def tokenise_text(data: pl.DataFrame, col: str, split_token: str = " ") -> pl.DataFrame:
data = data.with_columns(pl.col(col).str.split(split_token).alias(f"col_tokenised"))return data
Removing stop words
If you’ve worked with text data before, you know that removing stop words is a key step in preprocessing tokenized texts. Removing these words allows us to focus our analysis on only the important parts of the text.
To remove these words, we must first define them. Here, I’m going to use the default set of stop words nltk
A library plus a set of HTML related words.
stops = set(
stopwords.words("english")
+ ["", "nbsp", "content", "type", "text", "charset", "iso", "qzsoft"]
)
Now, we need to find out if these words exist in the tokenized array, and if they do, we need to drop them. For this we will need to use it arr.eval
method because it allows us to run Polars expressions (eg .is_in
) against all elements of the tokenized list. Be sure to read the comments below to understand what each line does, as this part of the code is more complex.
emails_pl = emails_pl.with_columns(
# Apply to the tokenised column (it's a list)
pl.col("email_text_tokenised")
# For every element, check if it's not in a stopwords list and only then return it
.arr.eval(
pl.when(
(~pl.element().is_in(stopwords)) & (pl.element().str.n_chars() > 2)
).then(pl.element())
)
# For every element of a new list, drop nulls (previously items that were in stopwords list)
.arr.eval(pl.element().drop_nulls())
.keep_name()
)
As usual, let’s refactor this piece of code into our final pipeline function.
def remove_stopwords(
data: pl.DataFrame, stopwords: set | list, col: str
) -> pl.DataFrame:
data = data.with_columns(
pl.col(col)
.arr.eval(pl.when(~pl.element().is_in(stopwords)).then(pl.element()))
.arr.eval(pl.element().drop_nulls())
)
return data
Although this pattern may seem quite complicated, it is worth using a predefined one str
and arr
Expressions to optimize performance.
Complete pipeline
So far, we’ve defined preprocessing functions and seen how they can be applied to a single column. Polars provides a very convenient pipe
A method that allows us to bind Polars operations that are specified as a function. Here’s what the final pipeline looks like:
emails = load_emails_txt("fradulent_emails.txt")
emails_pl = pl.DataFrame("emails": emails)emails_pl = (
emails_pl.pipe(extract_fields)
.pipe(email_features, "email_text")
.pipe(email_features, "sender_email")
.pipe(email_features, "subject")
.pipe(email_clean, "email_text")
.pipe(email_clean, "sender_name")
.pipe(email_clean, "subject")
.pipe(tokenise_text, "email_text")
.pipe(tokenise_text, "subject")
.pipe(remove_stopwords, stops, "email_text_tokenised")
.pipe(remove_stopwords, stops, "subject_tokenised")
)
Note that we can now easily apply all feature engineering, sanitization, and tokenization functions to all extracted columns, not just the email text as in the examples above.
If you have so far – great job! We read, cleaned, processed, patched, and engineered key features at ~4k text records per second (at least on my Mac M2 machine). Now let’s enjoy the fruits of our labor and do some basic text analysis.
First, let’s look at a word cloud of email texts and marvel at all the nonsense we can find.
# Word cloud function
def generate_word_cloud(text: str):
wordcloud = WordCloud(
max_words=100, background_color="white", width=1600, height=800
).generate(text)plt.figure(figsize=(20, 10), facecolor="k")
plt.imshow(wordcloud)
plt.axis("off")
plt.tight_layout(pad=0)
plt.show()
# Prepare data for word cloud
text_list = emails_pl.select(pl.col("email_text_tokenised").arr.join(" "))[
"email_text_tokenised"
].to_list()
all_emails = " ".join(text_list)
generate_word_cloud(all_emails)
Bank accounts, relatives, security companies and deceased relatives – it’s all there. Let’s see how this looks for text clusters created using simple TF-IDF and K-Means.
# TF-IDF with 500 words
vectorizer = TfidfVectorizer(max_features=500)
transformed_text = vectorizer.fit_transform(text_list)
tf_idf = pd.DataFrame(transformed_text.toarray(), columns=vectorizer.get_feature_names_out())# Cluster into 5 clusters
n = 5
cluster = KMeans(n_clusters=n, n_init='auto')
clusters = cluster.fit_predict(tf_idf)
for c in range(n):
cluster_texts = np.array(text_list)[clusters==c]
cluster_text = ' '.join(list(cluster_texts))
generate_word_cloud(cluster_text)
Below you can see some interesting clusters I identified:
Additionally, I also found some crappy clusters, which means there’s still room for improvement when it comes to text sanitization. Still, it looks like we were able to extract some useful clusters, so let’s call it a success. Let me know which clusters you find!
This post covered the variety of preprocessing and cleanup operations that the Polars library allows you to do. We have seen how to use Polars:
- Extract specific patterns from texts
- Divide the texts into lists based on the character
- Calculate the length of the texts and the number of matches
- Clear texts using regex
- Correct the texts and filter the stop words
I hope you found this post helpful and will give polars a chance in your next NLP project. Please consider subscribing, clapping and commenting below.
Not an average member yet?
Radev, D. (2008), CLAIR Fraud Email Collection, ACL Data and Code Repository, ADCR2008T001, http://aclweb.org/aclwiki
Project Github https://github.com/aruberts/tutorials/tree/main/metaflow/fraud_email
Polars User Guide https://pola-rs.github.io/polars-book/user-guide/
[ad_2]
Source link