Augmenting Legal Teams with Machine Learning

Whitepaper

Jun 9

Tldr;

Much like their portfolio companies, private equity and venture capital firms are under pressure to control their costs during this year’s economic headwinds.
Legal costs are an area where firms are looking to find efficiencies and machine learning (ML) can be applied to re-distribute activities to ensure that expensive legal resources are spent on the most strategic elements of a deal.
ML use cases like automated data room organization, contract risk analysis, and contract variance can free up legal teams by implementing some battle-tested ML techniques.
Text embeddings (i.e. a way to convert unstructured text data in structured numerical values) are a critical foundation to ensure that the nuances of contracts are encoded for downstream processing.
Firms should be critical when they decide which models to build and it is often a good idea to start by building automation for contracts that the firm reviews in high volumes and tend to have a standard format.

Want to apply ML to your legal function? Get in touch

In the realm of deal making, legal considerations occupy a pivotal but frequently under-celebrated role. As companies engage in complex transactions, legal expertise becomes an indispensable asset for mitigating risks, ensuring compliance, and safeguarding deal success. However, legal costs associated with deal making have surged in recent years, creating significant financial burdens for firms. An independent survey of 300 in-house lawyers working in private equity and venture capital concluded that two thirds of firms saw an increase in legal spend during 2022 and they expect the same to continue through 2023. As fund returns face headwinds from inflation and rising interest rates, the pressure to control legal costs is building. The study also highlighted that it is not just fund managers putting the pressure on legal teams to reduce costs. A whopping 84% of those surveyed agreed that Limited Partners (LPs) are taking matters into their own hands and are challenging increasing legal fees too.

To address these challenges, firms need to evaluate how they are using legal resources. High cost legal professionals should assume strategic roles as negotiation advisors and due diligence experts. By leveraging machine learning (ML) technologies, organizations can optimize resource allocation, enabling professionals to focus on complex, high-value activities. This Perspective explores the transformative potential of ML in legal activities specifically within the context of deal making. It highlights key use cases but more importantly, it illustrates the challenges you are likely to face and the approaches to take to overcome them.

Rather than list dozens of use cases for machine learning in the legal domain (there are enough blogs and LinkedIn posts doing this), we will discuss three specific ones that we know to be feasible and provide measurable value. We will focus on the value of each solution and the important data and machine learning concepts you will encounter while tackling them.

Use Case 1: Legal Data Room Organization & Document Verification

After a letter of intent (LOI) is signed and due diligence kicks off, the exchange of documents accelerates and quickly consumes the legal team’s time. Countless documents, including vendor contracts, leases, customer contracts, option agreements, employment agreements, investor side letters, and so on are shared in the data room. External legal counsels assign large teams of junior resources to sift through these documents, ensuring they are properly organized and comparing the list of documents received to their due diligence checklists. It is important they work through this data organization task quickly to give their side as much time as possible to perform due diligence but when done manually, the process is labor-intensive (i.e. expensive) and prone to mistakes. ML offers a solution by augmenting the legal team's resources. It does not seek to change the thoroughness of the process or the steps involved. Rather, the goal of the solution is to 100x the capacity of a firm’s junior legal resources with a system that does not make mistakes when it gets tired.

At the core, sorting documents in a data room is a topic classification problem in the field of machine learning. Like a junior legal analyst, the model learns from historical contracts, reviews a new contract's title and contents, and determines what type of contract it is. Topic classification is a proven technique in ML and to build a model for your firm you will need to collect and annotate relevant data, train your model and then integrate that model into your existing process.

Step 1: Data Collection & Annotation

To learn how to distinguish between the different contract types, a topic classification model needs to be trained on historical contracts (training data) that have already been labeled with the correct contract type. Creating a training dataset large enough to accomplish this is usually a time consuming task but private equity legal teams have a unique advantage to accelerate this step. Portfolio company contract databased provide an ideal source because they contain numerous contracts from different businesses that have already been organized manually by teams of legal analysts. Software can be written to automatically ingest contracts for the model to be trained on. But even if data needs to be collected manually, the time to gather training data has already been drastically reduced because the contracts are stored centrally and organized by type.

Step 2: Model Development

Once the training data is collected, the next step is to build the model that will perform the document classification for a contract that it has not been trained on. An essential component of document classification models is the text embedding that is used. A text embedding is a method that takes an unstructured document (i.e. text in a contract) and converts it into a vector that the classification algorithm can understand and learn from. A well-designed text embedding captures the nuances of the contract, enabling the model to better understand its contents and subsequently classify contracts with better precision. Figure 1 below provides a visual representation of a simplified embedding with only 10-dimensions for some common terminology that you might find in a non-disclosure agreement.

Figure 1: Visual representation of a simplified embedding

With a text embedding in place, techniques like linear discriminant analysis (LDA) or clustering are employed to train the classification model. These methods use the text embeddings to understand what information the contract contains so that it can determine what type of contract it is most likely to be.

Step 3: Integration into Process

A trained document classification model alone is not enough to create value because it is not integrated into the legal team’s process. The model needs to be wrapped in software that allows a user to plug it into the team’s data room review process. Some examples of the software that firms build around their models include:

Integration into their virtual data room technology so that the model has direct access to pull contracts that need to be analyzed and organize them into the appropriate folders after determining what type of contract it is
A light user interface so that a user can tell the model where to look for uncategorized contracts and where the model should save the contracts after they have been classified
A dashboard so that the user can check on the classification process and see how many contracts have been classified
A digital version of the due diligence checklist that the ML solution updates so that teams can quickly identify if there are any types of contracts missing or if the volume of a particular contract type seems insufficient (e.g. fewer customer contracts than expected)

As you can see, this ML-powered system does not change the steps in the process nor does it eliminate the need for legal expertise. Rather, it allows the firm to re-allocate activities, allow legal professionals to act as reviewers, and free up time to perform more complex due diligence activities.

Use Case 2: Contract risk analysis

During due diligence, legal teams meticulously review contracts to identify red flags and assess risks that could impact the outcome or price of a deal. Examples include analyzing customer contracts for potential churn, vendor contracts for dependencies and high cancellation costs, historical litigation for future financial liabilities, and employment agreements for unusual terms, payments, bonuses, or equity grants. However, these documents are often lengthy and contain extensive amounts of text that are irrelevant to the risk assessment. This increases the time it takes for legal professionals to complete a risk assessment because significant effort is required just to locate the pertinent information.

An ML solution can address this challenge by extracting key terms from each contract type in a way the provides legal professionals with the data points they need to make a risk assessment. Ultimately, it allows legal professionals to spend their time assessing the risk associated with the contract’s terms, rather than searching for the information needed to make the assessment.

You can visualize the solution to this problem as a multi-stage filter (see figure 2) that emulates the thought process that a legal professional uses. It uses several machine learning models including document classification, topic classification, and named entity recognition (NER) to first identify what information needs to be extracted from the document and then search the document to extract the relevant data points.

Figure 2: Filter architecture for contract risk analysis solution

Document classification, explained in the previous section, serves as the initial filter to identify the type of contract being analyzed. Different types of documents give rise to different risks and different risks need to be assessed using different data points. For example, the risk inherent in a customer contract is analyzed in a completely different way than the risk in the executive team’s employment and option agreements. The document classification model ensures that the downstream filters look for and eventually extract the information needed to make a proper risk assessment.

Filter 2: Topic classification to identify the different sections of a contract

After identifying the document type, the ML solution will then read the document contents to identify the sections in the document. The model’s approach is similar to a legal professional’s as it performs this task. It will split the document into parts, likely based on section headers, and it will use a combination of the section title itself and the content of that section to determine what the section is about and whether it contains information relevant to the risk assessment.

In machine learning, this process is referred to as topic classification. The steps to build a topic classifier are similar to the document classification model that described above. The model is trained on historical examples of contracts in which the sections of those documents have been labeled with the relevant topic. For example, a customer contract might be labeled with topics such as scope of work, pricing, cancellation, and intellectual property rights whereas an employment contract will be labelled with sections such as base compensation, variable compensation, vacation, equity-based compensation, non-solicitation, and role description. After being trained on historical contracts, the model will then be able to read a section in a new contract and determine whether that section belongs to one of the topics that has it has been trained to identify.

Filter 3: Named entity recognition to extract data points from the contract

The final filter is named entity recognition (NER) and it allows the model to extract the specific data points that a legal professional needs to complete a risk assessment. For instance, when reviewing vendor contracts, legal teams will be interested in understanding the financial penalty associated with contract cancellation. By examining sections related to the scope of services, pricing, and cancellation, an NER model can extract entities such as contract price, contract term, contract initiation date, cancellation penalty, and notice period. All of which are required for a professional to determine the risk and financial implication of cancelling the contract.

It is possible to automate the actual risk assessment task as well but building the three filters required to extract entities alone is a great first step. It provides significant value to legal teams and drastically reduces the time required review contracts.

Use Case 3: Automated Red-Line Review

Another time consuming but crucial activity that legal teams are responsible for is the drafting and reviewing of deal agreements (e.g. non-disclosure agreement, letter of intent, merger agreement, closing documents, etc.). Typically, one party’s legal team will write the first draft, send it to the other party’s team for review, and then receive a “red-line” version back that includes the other party’s proposed changes. Once the team receives the red-line, the first questions that deal teams ask are, “What did they change, and how significant are their changes?”. Much of the back-and-forth is reviewed manually by legal professionals and executives who are starting to ask themselves how they can apply automation to this process.

An ML solution that can be applied in this case is a contract similarity tool that compares a red-lined contract to a standard version of that contract, assesses how different the red-line version is, and highlights the notable differences to a legal professional. Before diving into how this type of solution is built, it is important to first understand the types of contract reviews that are most likely to benefit from this type of solution. Figure 3 below provides a useful framework to prioritize the types of legal document reviews that the solution could be built for. This model is by no means a complete accounting of all the types of legal documents reviewed during a deal process but should illustrate how some contract types are better suited to ML-based analysis than others.

Figure 3: Contract type prioritization framework

As you can see, the documents that are typically the most standard and reviewed in high volumes (green quadrant) are the best candidates to be augmented by ML. The expected return on investment is high because they strike the right balance between time saving and technical feasibility.

The solution to serve this use case is built on two important ML components; a text embedding and a text similarity model. Like the use cases above, the text embedding component is critical to encode an unstructured document like an NDA into a format that is usable by ML. Like the use cases above, the choice of text embedding technique is important because the more representative the embedding is of the document details, the more precise the downstream models can be.

Why is this solution better than Microsoft Word document comparison?

When the concept of text similarity is first introduced to most people, their reaction tends to be “Can I not just use document comparison to accomplish the same thing?”. Although they have correctly identified the way that text similarity can be used, they have not understood what text similarity models do. These models produce a score (often a probability) that describes how semantically similar two pieces of text are to one another. The advantage that this has over word-by-word document comparison is that two pieces of text do not have to be identical to be considered the same. This is especially useful when slightly different language is used in legal documents but they effectively have the same meaning. This is illustrated from the two samples below:

Confidential Information: "Confidential Information" refers to any non-public, proprietary, or confidential information, whether disclosed in written, oral, electronic, or any other form, that is marked or identified as confidential or that should be reasonably understood to be confidential given the nature of the information and the circumstances of disclosure.

Confidential Information: For the purposes of this Agreement, "Confidential Information" refers to any non-public, proprietary, or confidential information disclosed by the Disclosing Party to the Receiving Party, whether in written, oral, electronic, or any other form, that is marked or identified as confidential or that should be reasonably understood to be confidential given the nature of the information and the circumstances of disclosure.

These are both confidentiality terms extracted from non-disclosure agreements. When run through a document comparison tool in Microsoft Word for example, it tells you that those two paragraphs are different. That will prompt a legal analyst to review the term, note the differences, and eventually determine that although they use different language, the meaning is the same. When run through an ML text similarity model however, it will tell you that those two terms are similar, it will not prompt the legal analyst to review it, and they can move on to other terms in the contract that are semantically different and might introduce risk.

There are several machine learning techniques that can be used to build text similarity models and the most appropriate approach is usually identified through experimentation and evaluation against one another. Some approaches that would be used include:

Word Embeddings with cosine similarity: Techniques like Word2Vec, GloVe, or FastText can be used to create word embeddings. To compare two texts, the word embeddings of the texts can be compared using techniques like cosine similarity that effectively measures the distance between the two word embedding vectors.
Siamese Neural Networks: Siamese networks consist of two identical neural networks with shared weights. Each network processes one text, and their outputs are compared to determine similarity. The networks are trained with pairs of text, where the similarity between the texts is known. The model learns to encode the texts into fixed-length vectors and optimize the similarity metric.
Recurrent Neural Networks (RNNs): RNNs, such as Long Short-Term Memory (LSTM) or Gated Recurrent Unit (GRU), can be used to model sequences of words in paragraphs. By feeding the paragraphs through an RNN, it can capture the context and meaning of the words in each paragraph. The final hidden states of the RNN can be compared using similarity measures.
Transformer-based Models: Transformer models, such as BERT (Bidirectional Encoder Representations from Transformers), have achieved state-of-the-art results in various natural language processing tasks. BERT can be used in this case to obtain contextualized word representations for each term. By comparing the representations at the term level, semantic similarity can be inferred.

The approach that you decide to implement will ultimately depend on the nature of the documents being compared and whether you have training data available to take a supervised approach or whether you need to follow one of the unsupervised approaches.

In closing

The exploration of these use cases demonstrates the feasibility and benefits of applying well-researched ML techniques to the various legal activities required for deal making. The crucial step of creating appropriate text embeddings lays the foundation for downstream models. The success of models built upon these embeddings, such as document classification, topic classification, and text similarity, greatly depends on the ability of the embeddings to capture the intricate details of contracts. However, it is important to note that the models alone do not provide value to the firm. To fully leverage their potential, investments should also be made in software integration, enabling seamless incorporation into existing legal processes and systems. Once implemented and adopted, ML-powered solutions like the ones described above will improve the efficiency of legal teams and enable professionals to focus on the more complex parts of the deal making process like negotiation which will ultimately improve deal outcomes for the firm and improve returns.

Have a legal team that might benefit from ML? Get in touch

Matthew Smiarowski