Large Language Models in Gas Trading: A Hands-On Use Case

Sep 19, 2024

Wondering how to leverage AI to filter news for commodity trading? What if you have an idea what news might be relevant for you, but do not have them neatly packed in a dataset? What if data is too sensitive to send to ChatGPT? These were some of the issues we encountered, below we share our experience in solving them.

The Use Case

Our client is in the gas trading business in Europe, which can be significantly affected by sudden events happening worldwide. While the client’s traders monitor news in their day-to-day operations, this was never done in a time-efficient and structured way. Traders searched news individually from various websites and shared and discussed it in a Microsoft(MS) Teams channel.

Our Role

The client wanted to see if a better way was possible. They contacted a news data provider and negotiated a trial period of 3 months with full access to historical and live news reports. We were tasked with creating a Proof-of-Concept system that is fully operational from end to end, which:

– Fetched data from the data provider’s website

-Used an AI (Artificial Intelligence) to filter the incoming news, allowing only those that are relevant for our client and estimated a potential “impact”.

-Sent the filtered news to an MSTeams channel

An example of the desired output is below (Figure 1.)

Figure 1.

Lastly, the client wanted to do a weeklong live test with the system working and the traders being able to monitor the channel. Therefore, everything had to be ready at the very least a week before the 3-month trial expiration. We were short on time.

What Is Relevant?

Setting up a system that fetches from an API endpoint (the data provider’s) and sends it to an MSTeams Channel was a straightforward task without much data science involved. The most glaring issue was the question of what is “relevant”?

We started by using the historical news reports that we had access to. We filtered based on energy-related topics[1] and got a dataset of 1.2 million reports over 2 years. From this dataset, we pulled 1000 random news reports.

We then gave the news reports to 2 members of our client’s gas trading team and asked them to label what they think is “relevant” for them. It was key that they did this independently from each other so that we could gauge how much “relevant” differs from person to person. As expected, it turned out – a lot. They agreed ~50% of the time. Overall, each labeled around 50 articles (5%) as relevant, but those where both agreed were only 25. At this point, we decided to define as “relevant” what at least one had picked as such, or around 75 articles. With this, we were aiming to send to the MSTeams Channel every report that someone would find relevant, rather than those that all would find relevant[2].

The Methods

We had a dataset. However, for Machine Learning, 1000 samples are rather too few. Additionally, there was a “class imbalance” – the relevant ones were only 75, compared to 925 irrelevant ones. Having both a small sample and class imbalance complicates traditional Natural Language Processing methods like BERT[3]. Furthermore, we suspected that 1000 samples from 1.2 million might not catch enough of the richness of information within those reports. We needed something “smarter”.

This is where Large Language Models (e.g. ChatGPT) would come in. Apart from other useful applications, they excel at “few-shot learning”, meaning that an AI model can perform a task after seeing only a few examples. While “few” could mean as few as two or three, we had 75 (and 925 irrelevant). This was enough to give the AI as examples and to have some left to see how it will perform on those it has not seen – usually referred to as “train/test split”.

The Infrastructure

While ChatGPT is state-of-the-art in LLMs, we encountered a serious problem when the data provider prohibited the use of third-party API endpoints. This meant that ChatGPT, Google Gemini, and all other LLMs were out of the question. The solution was to deploy and host an open-source LLM ourselves. After looking at open-source LLM leaderboards, Mixtral 8x7b[4] looked like an appealing choice. It was roughly equivalent to ChatGPT 3.5, which is less powerful than the top OpenAI model 4, but still a solid choice. It, however, required 4 A100 Nvidia GPUs with a cost of approximately 24,000 Euros per month[5]. That was not a viable cost for the PoC. The solution was to use a “quantized” version of the Mixtral 8x7b.

Quantization is the process of reducing the floating-point precision (i.e. the size) of the model and thus allowing it to run on weaker infrastructure. It comes at a performance cost, but tests have shown that the performance loss is not linear and there is a “sweet spot” around 6 bits (down from 32), where the model is much smaller with a bearable quality drop. We could deploy[6] the quantized version of Mixtral 8x7b on a comparatively cheap 8 CPU, 52GB RAM, and 1 T4 GPU Azure container with a cost of ~600 Euro per month – a much bearable cost.

Defining the prompt

“Prompt” refers to the text one inputs to a LLM which explains what one wants it to do. “Prompt engineering” is the science (or art) of knowing what to ask an LLM to get the best results.  LLMs are trained on huge datasets, which contain knowledge from various domains. By asking it to act as an expert in some field, the LLM is guided to pay attention to a specific domain it has seen while it was trained.

This is where we started from. We told the LLM that “you are a gas trading expert” and “your task is to state whether the news article {X} would be relevant for you[7]. We then added some examples from our dataset and asked it to label the rest of our 1000 samples. The result was about 95% recall (meaning it caught 9/10 relevant articles) but about 20% precision (meaning that from all things the LLM said were relevant, only 2/10 were truly such). In other words, we had far too many false positives – the model was thinking that far too many reports were relevant. This was a problem because there would have been too much “noise” in the MSTeams channel.

Retrieval Augmented Generation

A more advanced technique one could use is referred to as RAG (Retrieval Augmented Generation). If one has more than two, or three examples, but 100, one cannot add all 100 examples each time the model is asked to label a specific report. But one can have a secondary model in the backend, that, given a certain article, finds those examples which are closest in meaning and adds those to the prompt. The prompt is then fed into the LLM. That is RAG in essence. With that, we improved the precision to 30%. Still, not satisfactory.

Let’s Think Step-By-Step

Why was it not understanding what is relevant? Perhaps asking “what is relevant” was too vague? At this point, we switched the strategy. LLMs are considered good at reasoning, so perhaps if we could explain why something was relevant together with the examples we could have more success. We went back to our client and asked for an interview with the people who did the labeling and asked them about their reasoning. That allowed us to break down the labeling problem into smaller steps (also a recommended technique when prompting an LLM) and derived guiding questions: instead of “Is this article relevant”, we asked “Does the information in the report pertain to gas supply or demand?”, “Does the report contain a change in the status quo?”, “Has the market reacted to the reported news?” etc. That approach cut down the false positives to only 4/10 (60% precision).

Success With A Grain Of Salt

We started with an open problem, no real dataset, and data protection and infrastructure limitations. We ended with a fully functional end-to-end system with above 90% recall and 60% precision. And did live user testing. And all that in 3 months. Not too bad for a PoC.

Still, one can always do better. The user testing showed that one does pay a cost for doing more with less. The model occasionally made ridiculous mistakes,  such as labeling reports on soybean futures as relevant[8]. It was getting confused. This time because of too much instruction as opposed to too little. With many steps, guiding questions, and examples our prompt was getting quite large – thousands of words. That led to our model “losing attention” – in other words, the sheer length of the task description started to confuse the AI. Usually, this should not be a problem. Mixtral was trained for roughly 6000 words as input length. But remember we had a quantized version of the model, which while smaller and faster was, in the end, also less intelligent. There is, indeed, no free.

 

Author: Ivan Dochev


Notes:

[1] The data provider had already given some labels, so we selected topics like “Natural Gas”, “LNG” etc.

[2] Of course, having more than 2 people doing the labeling would have perhaps improved the quality of the sample, but the client’s time is costly, and we deemed this to be good enough for a PoC.

[3] Bidirectional Encoder Representations from Transformers (BERT) was the state-of-the-art language model for classifying text before Large Language Models came along. It, however, required fine-tuning on a modestly large dataset to perform well.

[4] Developed by Mistral Labs (https://mistral.ai/)

[5] As of April 2024. Note that it varies if one uses Azure or AWS and/or Databricks or other layers on top, but it would be in the tens of thousands.

[6] Using llama-cpp (https://github.com/ggerganov/llama.cpp, kudos to the developers!)

[7] We present here only the gist of what our prompt contained.

[8] Because of the “commodity markets” tag, such reports slipped also past our initial, tag-based filters.

(Visited 75 times)