# What we'll do

Journalists frequently encounter the mountains of messy data generated by our periphrastic society. This vast and verbose corpus boasts everything from long-hand entries in police reports to the legalese of legislative bills.

Understanding and analyzing this data is critical to the job but can be time-consuming and inefficient. Computers can help by automating sorting through blocks of text, extracting key details and flagging unusual patterns.

A common goal in this work is to classify text into categories. For example, you might want to sort a collection of emails as “spam” and “not spam” or identify corporate filings that suggest a company is about to go bankrupt.

Traditional techniques for classifying text, like keyword searches or regular expressions, can be brittle and error-prone. Machine learning models can be more flexible, but they require large amounts of human training, a high level of computer programming expertise and often yield unimpressive results.

Large-language models offer a better deal. We will demonstrate how you can use them to get superior results with less hassle.

## Our example case

To show the power of this approach, we’ll focus on a specific data set: Maryland state grants.

State agencies provide millions of dollars each year to governments, non-profit organizations and other institutions. Tracking that spending can reveal patterns and lead to important stories.

But it’s no easy task. The names of grantees are not standardized and neither are the descriptions of the purpose of the spending.

We will create a classifier that can scan the grants and categorize them.