Last Updated on November 15, 2023
Note that this article does not refer to custom GPTs. Using custom GPTs, it is possible to build a Warren Buffett chatbot without code. Learn more about custom GPTs here.
Table of contents:
- How to build a Warren Buffett Chatbot
- What is OpenAI?
- What is ChatGPT?
- What is Whisper?
- What is StableWhisper?
- Getting started
- What is Diarization?
- How to perform diarization and transcription with Whisper?
- What is fine-tuning?
- How to organize the data for ChatGPT finetuning?
- How to fine-tune ChatGPT on your own data?
- How to chat with Warren Buffett using ChatGPT?
- My thoughts and ideas
- Full code
How to build a Warren Buffett Chatbot using OpenAI?
First, we will collect data on Warren Buffett’s market knowledge via video interviews. Next, we will clean and process this data. Lastly, using OpenAI’s API, we will fine-tune our own ChatGPT model by training it on the prepared data.
What is OpenAI?
OpenAI is an American artificial intelligence research laboratory that is primarily focused on NLP tasks through using Large Language Models (LLMS) such as ChatGPT, Whisper, and the like.
Link to website: https://openai.com
What is ChatGPT?
ChatGPT is an LLM developed by OpenAI that is used for solving a wide range of tasks ranging from finance to bedtime stories.
Link to website: https://chat.openai.com
What is Whisper?
Whisper is an automatic speech recognition (ASR) system trained on 680,000 hours of multilingual and multitask supervised data collected from the web. It is great for transcription and similar tasks.
Link to repo: https://github.com/openai/whisper
What is StableWhisper?
StableWhisper is an alteration of OpenAI’s Whisper model that produces more reliable timestamps.
The issue with the original Whisper is that it is prone to overshoot segments and thus produce unreliable timestamps that end up with sentences being cut up halfway and similar. This isn’t a big issue if we want to do summaries but for diarization, it reduces our accuracy.
There are procedures that can make these realignments on the original Whisper output such as using Wav2Vec2, and different punctuation models. Depending on your use case, this might be overkill and I’ve opted for using StableWhisper that doesn’t solve all the issues but helps considerably.
Link to the repo: jianfch/stable-ts: ASR with reliable word-level timestamps using OpenAI’s Whisper (github.com)
Getting started
This article will have the goal of creating an AI Chatbot that will simulate chatting with Warren Buffett with ChatGPT.
To achieve such a goal, we will need to do several things that will be laid out below:
- Collect training data
- Clean and organize the training data
- Finetune ChatGPT on the data
- Chat
The amount of data that we will be using in this article won’t be too much so that the reader can follow the article without issues and also for brevity.
The audios that we will use are these three YouTube videos that are picked semi-randomly where Warren Buffett was interviewed or hosted:
- Watch CNBC’s full interview with Berkshire Hathaway CEO Warren Buffett
- Warren Buffett reveals his investment strategy for mastering the market
- Warren Buffett and Charlie Munger join the 2023 Berkshire Hathaway annual meeting — 5/6/23
Each video is different and will present its own challenge. The difference between videos 1 and 2 is that the interviewers are of different sexes which results in the male voice being more “similar” to Warren’s which could pose a threat to the diarization accuracy.
The third video is very long and has many speakers and it showcases the issue that these models face for panel formats of discussion.
When it comes to text data, for best results it would require a different treatment and use of a vector database which I’ll show in another article.
One thing to note is that you can easily extend the training data by adding more videos to it. You can also add other types of personas such as Soros and thus combine a hybrid value investor and macro trader.
You could also split the data to get different versions of Warren Buffett such as the one previous to the year X and the other after.
For this, I’ll be using my own PC that has an ASUS ROG Strix RTX3060 12GB GPU. For those that don’t have a GPU or have a weaker one, I suggest using Kaggle or Google Colab as they offer free ones.
Now that we have covered the main points, let us work with the videos.
Make sure that you have a fresh environment to work in. I usually use conda:
conda create -n buffett
What is Diarization?
Diarization is a process in speech processing that involves identifying and segmenting different speakers in an audio recording. Essentially, it’s about determining “who spoke when” which is useful for tasks such as transcribing multi-speaker conversations.
How to perform diarization and transcription with Whisper?
To perform diarization and prescription with Whisper, we will need to use another approach as Whisper sadly doesn’t perform diarization out-of-the-box at the time of writing.
Moreover, we will be using the StableWhisper library and not the original Whisper one to gain on performance.
There are only a couple of diarization projects out there with mixed results. It seems that there aren’t a lot of contributions to the diarization process overall (well, at least open-source ones) that solve the problem with high accuracy.
I tried out a few of them such as the pyannote gated HuggingFace model and have found that it doesn’t perform better than a simple clustering pipeline I wrote. I found one HugginFace project and refactored it as the code for it wasn’t the best and was kept strictly inside HuggingFace.
The refactored version that can perform both transcription and diarization can be found on our AlgoTrading101 GitHub repository and can easily be used and cloned into Kaggle, Google Colab, locally and etc.
The project features a Gradio UI and has improved and cleaned code from its original version which can be found on HugginFace which wasn’t coded by me. My version features a more accurate diarization process, more accurate transcriptions and timestamps, and other quality improvements.
It was a quick refactor, and PRs are welcome.
You can also use the HuggingFace version and skip the bottom installation header part but it often fails for me.
Installing pre-requisites
We’ll start by installing the prerequisites that will ensure that you can use the above repo. First, clone the repo and install its requirements in a fresh environment that we created.
$ git clone https://github.com/AlgoTrading101/AlgoTrading101-Warren-Buffett-ChatGPT.git
$ pip install -r requirements.txt
$ pip install -qq https://github.com/pyannote/pyannote-audio/archive/refs/heads/develop.zip
After that, create a .env
file in the root of the repository by following the example template in env.example
. The most important variable is to set your HUGGINGFACE_TOKEN
as this will be needed to get the models that we will use from HuggingFace.
If you don’t have a token, please navigate to their website and create an account by clicking the “Sign Up” button in the upper right corner.
When you log into your account, click on your account icon in the upper right corner and navigate to Settings and then on the left menu click Access Tokens.
There you can create a new token with “read” permissions and paste it into your env.
The next two things are only for users that want to run this locally. Ensure that you have ffmpeg
installed. To quickly check for this open up a terminal and run ffmpeg
. If you don’t see anything popping up with the version number and similar, there is an installation guide for each major OS here.
Because we are using PyTorch, you will probably want to re-install it with your adequate version of Cuda by using this link here. I will rely on the CUDA that conda provides by following this link over here and selecting the latest supported version:
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
To see up to which version your GPU supports the CUDA, you can run this command and read the version number in the upper right corner:
nvidia-smi
To start the Gradio UI, we run the following (if you face issues with HuggingFace, I suggest running it as an administrator for the first time so it can pull the models easily):
python main.py
Now navigate to this URL to access it: http://127.0.0.1:7860/
If you want all of this inside a notebook, you can move the contents of main.py
into a notebook and run it that way.
Transcription and diarization
We begin our transcription and diarization process by pasting our YouTube video URL and then clicking the “Download YouTube video” button. After the video has been downloaded, you will be able to play it to confirm that it is functional.
Note: If the video is too long, sometimes it won’t appear straight away as it is being rendered and you can check if it is downloaded by observing your file structure or console output. But, you still need to wait for it to appear in order to continue the process. This can take some time depending on the video length.
The next step is to select the Whisper model you want to use, and the assumed number of speakers in the video.
The model that I suggest using is the large-v2 one as it offers the best transcription accuracy and segmentation. I’ve tried the lower ones and they had mixed accuracy results. The base one is also fine.
The goal of this article is to show you what you can do and how to do it. You can bring in more advanced pipelines and experiment with different models and use cases.
When it comes to the number of speakers, if you set it to 0 speakers, it will try to dynamically infer the number of speakers. It is not the best at doing it so I suggest setting this number if you know it or aren’t too lazy to check as the dynamic recognition needs improvements to be more precise.
It can also be an assumed number of speakers.
Now, click the “Transcribe audio and diarization” button and watch the magic happen. When it completes, and it usually doesn’t take too long, you will see your dataset at the bottom of the page. You will also be able to find its CSV in the output folder.
The diarization works by applying three clustering algorithms (KMeans, Agglomerative Clustering, and a Gaussian Mixture Model) on the audio segment embeddings and labeling the segment with a vote where 2/3 algorithms agreed.
This ensures the higher accuracy of the diarization and avoids relying on just one algorithm. I’ve found this approach to work well and it doesn’t require the usage of fancy neural networks and similar big models.
If you are using this on a Cloud or the like, you can also download the output to your local computer by pressing the download button inside the UI.
The dataset will have your transcript with adequate timestamps and speaker(s).
In the Full code section of this article, you will find all the datasets inside the output folder of our repo.
Cleaning videos and transcripts
When it comes to (pre/post)processing your videos and transcripts, you can cut out the parts from the videos where Warren Buffett wasn’t a part of the conversation as you can argue that we don’t need those.
This is the case for the last video labeled with the number 3 in which Buffett started speaking around the 1 hour mark. Thus, we will need to intercept the workflow. I’ll trim the part up to the 1-hour mark and split the remainder into 4 parts so that it can fit into my memory without issues.
The next thing that I’ll do for all transcripts is to deal with the fact that even the StableWhisper overshoots or undershoots a peculiar segment which can mess with the diarization clustering algorithms as they get leakage from its neighboring segment which can result in interesting results.
To combat this and increase accuracy I will create a cleaning pipeline that will check if a sentence is stopped midway and tape it back together where it is obvious. For example, ending a sentence with a comma, next row starting with a lowercase word or number, etc.
We will also merge the speaker into one if it repeats for two or more rows. The pipeline can be found inside the clean.ipynb
notebook and is an optional but recommended step. Feel free to think of ways to improve this pipeline and make it more robust.
To completely fix the Whisper issue, we would need more heavy AI algorithms that I’ve mentioned before. As improved models of Whisper and diarization algorithms come out, this will be less of an issue.
Now that the datasets are cleaned, I will rename the speaker columns into “BUFFETT” where it makes sense and merge all datasets into a single one. I will also do some manual cleaning as Buffett stutters can mess up the segments and overall diarization.
Let’s move on to prepare a schema that can be used for ChatGPT finetuning.
What is fine-tuning?
Fine-tuning is a technique in machine learning that involves adjusting the hyperparameters of a model to improve its performance on a specific dataset. This is usually done when a model was trained on a general-purpose dataset and can help improve its performance on specific tasks.
How to organize the data for ChatGPT fine-tuning?
To organize the data for ChatGPT fine-tuning, we will need to do two things. The first thing that we will do is restructure our dataset to be in a ChatGPT-friendly fine-tuning format. This means that we will have two columns labeled prompt and completion.
The prompt column will hold our prompt and overall context while the completion column will always be tied to Buffett’s answer. In order to make this more interesting and precise, I will create that the last 5 rows before Buffett’s answer are a single prompt.
Take note that you can do any number of previous rows to be your prompt, I’ve chosen 5 based on some similar projects I did way back ago.
df = pd.read_csv("output/full_transcripts.csv")
df = df.dropna()
df = df.reset_index(drop=True)
prompt_response = {'prompt':[],'completion':[]}
for i in range(len(df)):
if df['Speaker'][i] == 'BUFFETT':
prompt = df['Text'][i-5:i]
prompt = ' '.join(prompt)
response = df['Text'][i]
prompt_response['prompt'].append(prompt)
prompt_response['completion'].append(response)
df_clean = pd.DataFrame(prompt_response).shift(-2)
df_clean = df_clean.dropna()
df_clean
By having a larger prompt we allow the model to understand the context of the overall conversation more deeply and provide better answers. Usually, it is required to have about 500 prompt-completion pairs to feel the fine-tuning and we have 788 which would be a solid amount.
Now, you would need to do more post-processing of the data in order to make fine-tuning possible. But, we have a helper that is provided by OpenAI that we run like this:
openai tools fine_tunes.prepare_data -f .\output\train.csv
This will take us through all the steps and create a file in a jsonl
format that is used for training. It will prompt us by asking if we want to add certain things to make the fine-tuning process better and we’ll say “yes” to all.
Now, we are finally ready to commence the fine-tuning process!
How to fine-tune ChatGPT on your own data?
To fine-tune ChatGPT on your own data, you will need to have the data prepared in a fine-tuning-friendly format which is the jsonl
file structure. This was shown how to do in our previous headers.
To start the fine-tuning process, make sure to have your OpenAI API key ready as it will be needed to run this process. If you don’t have one, proceed to your OpenAI profile and click the “View API keys” button under the profile dropdown menu.
There, you can press the “Create new secret key” button and give it a name. After that, we will export the key to our environment so that it can be used. To do so I write:
export OPENAI_API_KEY="<OPENAI_API_KEY>"
To start fine-tuning I do:
openai api fine_tunes.create -t <TRAIN_FILE_ID_OR_PATH> -m <BASE_MODEL>
The training file will be the file we generated with their helper and the base model will be davinci
.
After you’ve started a fine-tuning job, it may take some time to complete as your job may be queued behind other jobs on their system, and training the model can take minutes or hours depending on the model and dataset size.
If the event stream is interrupted for any reason, you can resume it by running:
openai api fine_tunes.follow -i <YOUR_FINE_TUNE_JOB_ID>
When the job is done, it should display the name of the fine-tuned model.
I think that you should be able to use our fine-tuned model so that you don’t need to pay for the fine-tuning costs.
How to chat with Warren Buffett using ChatGPT?
To chat with Warren Buffett using ChatGPT, we can send prompts to our fine-tuned ChatGPT model and observe its results. To send a prompt you write:
import openai
openai.api_key="YOUR_KEY_HERE"
openai.Completion.create(
model="<YOUR_MODEL_HERE>",
prompt="<YOUR_PROMPT_HERE>",
max_tokens=1000,
temperature=0.9,
top_p=1,
)
Feel free to adjust the number of tokens that you want to be returned. I you want the maximum content length that my model can output which is 2049 you will need to adjust as the prompt also counts toward that cap. Each token costs about $0.00004397.
By adjusting the temperature, you are changing the amount of divergence that ChatGPT can make from what it was fine-tuned on. The lower the number, the more deterministic it will be.
Let’s try this out by sending over a few prompts and see how it behaves. I’ll start by taking a sentence from the train data as is, then I’ll make two more prompts that are more creative.
Prompt 1: “Income inequality, wealth inequality, you’ve talked about the earned income tax credit. Is there more to it than that? Should we adjust tax policy? It seems to be going the other way right now.”
Answer: “Well, we should adjust tax policy to get the income inequality to improve, not to deteriorate. And that means making sure that we have a very progressive income tax. A lot of people have been fooled by the fact that the payroll tax, which was originally designed to pay for Social Security and Medicare, has a regressive feature to it, that the first $127,000 of income is not taxed. And then above that, it’s taxed at a higher rate. And that’s consistent with the idea that you want to have some funding for Social Security and Medicare, and it doesn’t burden lower-income people as much as higher-income people. But I would like to see us move to a progressive income tax where the rate did get higher. Today, it turns out it’s about a third, effectively, as high on a billionaire as it is on somebody making $200,000, because of all the deductions. And we shouldn’t have a tax system encouraging people to get into the sort of artificial economy, where you move to Bermuda or the Cayman Islands, where you make all your money, but you don’t live in the United States. …“
The full response can be found in the examples
folder inside our GitHub repository.
Prompt 2: “Warren, what is your all-time favorite book?”
Answer: “Well, it changes every day. There’s a couple thousand books in our library. I’ve read most of them just one time. I don’t go back and read them again. So they are my favourite book for that day I read them. I’ve read some of them 20 or 30 times. I’ve read Ben Graham’s book, The Intelligent Investor, six times or seven times. But I’ve read 30 other books that I’ve read once. So the answer really is that the book that I’m enjoying at the moment is my favorite book.“
After this answer, the AI went on to simulate another question being asked.
Prompt 3: “Hello, Mr. Buffett! I am Igor from AlgoTrading101 and I am interested what are your thoughts about the future of AI and the stock market?”
Answer: “Yeah, on AI and market, you know, I think it’ll work well. I mean, there’ll be some mistakes made by a machine as wasn’t made by a human. That is an area where I think will be a significant advance. I actually think it will be more in diagnosis rather than actually prescribing drugs. I think we’ll be more advanced in that. There are two companies that I think will be subject to the fate of AI. One is a company called Aetna that I think is right out there in its death struggle. I mean, life insurance business. I don’t think they can, I don’t think they should try and repurchase their preferred – now they’re trying to repurchase their preferred. I mean, they can’t make money in the life insurance business. They had 100 million of their own money. I think they’ve spent nine billion trying to keep the preferred from getting their money. And it’s not working. Now, the money is coming due next year where I think they’ll have to have a good deal of the preferred to have to. I mean, they can’t pay it. And they’re charging higher medical costs, which is understandable. It’s human nature. But I think it’s going to be a huge trouble. On the other hand, you’ve got the pharmacy business. And if you have a bad cold and you’re looking for some relief, you’re going to continue to use the people, I think, a lot of extra stuff. But if I have cancer, I would like to see an AI system on the side prescribing what I could have to maximize the chance of living for 10 years or so. And I think you’ll see that in cancer and I think you’ll see it in maybe a lot of other areas. I find – I saw an IBM story today that the IBM computer beat the humans in another game, three games of the 90s, bridge. Anyway, there are a lot of things that will involve artificial intelligence. And I think of it as something that will be a plus to society. It’ll make mistakes, too, like we make mistakes. But I think the net will be upward. But I don’t think it’s going to change the nature of the market very much. I think it will be a great help in diagnosis, if not prescribing.“
My thoughts and ideas
Based on the answers and examples above, we can call this a success based on the number of data we passed into the model which isn’t large. By improving upon the diarization and transcription accuracy and quality of the overall prompts, we could make this a lot better.
About 90% of my time was spent exploring different transcription and diarization approaches and obtaining data with them.
What worked well on paper or the first two videos didn’t work in the third video and vice-versa. There are also many outdated repositories and models out there that can be a time sink.
From my experience, it is always the case that data obtainment and cleaning take the most time while the ML part is a small cog in the overall machinery. This article was a bit mixed as you needed good ML algorithms to obtain the data but you get the idea.
Feel free to explore different transcription and diarization approaches and different parameters for the model completion response. You can find all available parameters here. Another idea is to try combining people and have the whole panel be completions so that you involve Charlie Munger too.