First of all, this link is just to C# bindings of llama.cpp and so doesn’t contain the actual implementation.
I know, it’s my code. I refactored it from some much less readable and usable c# code. I picked it because it more clearly shows the steps involved in generating text.
How do you know LLMs can’t look ahead? […] How do you know it hasn’t written out the entire response in memory already after which it only shows you the first word?
Firstly, it goes against everything we know so far of how they operate, and secondly… because they can’t.
If you look at the C# code, the first step is in _process_tokens function, where it feeds the context into llama_eval. That goes through each token and updates the internal memory / model state. Since it saves state, if you already have processed some of the tokens you can tell it to skip them and start on the new ones.
After this function you have a state in memory, the current state of the LLM, as a result of the tokens it’s seen so far.
When we are done with that, we go to the more interesting part, the _predict_next_token function. Note that that takes a samplingparams parameter. It then set some options, like if top k is not set it’s set to length of the model’s vocabulary (number of tokens it knows about), and repeat_last_n, if not set, is set to the length of the existing context.
The code then gets the model’s vocabulary, aka all the tokens it knows about, and then it generates the logits. The logits is an array the length of the vocabulary, with a number for each token showing how likely that one is the next token. The code then adds any specified token bias to that token’s number. Already here, even if it already had a specific answer in mind, you can see problems starting.
Then the code adds token repetition penalty, based on the samplingparams. This means that if a token repeats inside the given history, it’s value will be lowered according to the repeat_penalty. Again, even if it had a specific answer, this has a high chance of messing that up. The same is done for frequency and presence. For more details of what those native functions do, you can see the llama.cpp source - they have the same name there.
After all the penalties are applied, it’s time to pick the token. If the temp is 0 or lower, it just picks the highest rated token (aka greedy sampling). This tends to give very boring and flat responses, but it’s predictable and reproduceable, so it’s often used in benchmarks of various kinds.
But if that’s not used (which it almost never is in “real” use), there are several methods. You have MiroStat, which tries to create more consistent quality between different answer lengths, and the “traditional” using top-k, top-p and temperature.
Common for them is however that internally it produces a top list of candidates, and then pick one at random. And that’s why a LLM can’t plan ahead.
When a token ID is eventually produced it returns the new ID, that gets added to context, the text equivalent of the token is looked up and sent back to the UI, and the new context is fed into llama_eval and the process starts again.
For the LLM to even be able to plan an answer ahead it must know of all penalties and parameters (or have none applied), and greedy token prediction must be used.
And that is why, even if it had some sort of near magical ability to plan ahead that we just don’t know is there, at the end of the day it could still not plan a specific response.
Wow, very nice! First of all, I will preface by admitting that I have not worked with LLMs to the degree of making a toy implementation. Your explanation of the sampling techniques is insightful but doesn’t clear up my confusion. Why does sampling imply the absence of higher level structure in the model?
For example, even though poker is highly influenced by chance, I can still have a plan that will increase my likelihood of winning. I don’t know what card will be drawn next but I can prepare strategies for each possible card. I can have preferences for which cards I want to be drawn next.
You know what, I don’t have a good answer to you here. I did a few small experiments on ChatGPT and it seems like it has some knowledge of if it will be able to complete it or not. This was with a pretty well known question though.
I tried to recreate an earlier experiment where I asked it to write about a friend of mine, which was in the news some time ago and have apparently a few entries in it’s training data, but very little. ChatGPT would then consistently hallucinate facts about the person, including date of birth and sometimes date of death. In that case it knew the pattern of writing about a person including date of birth, and sometimes date of death, but it didn’t know it didn’t have that info and just filled in plausible looking data there. Now it insists on not knowing who that person is at all and refuses to write anything about him.
Anyway, you’ve given me some things to think about, thanks.
I know, it’s my code. I refactored it from some much less readable and usable c# code. I picked it because it more clearly shows the steps involved in generating text.
Firstly, it goes against everything we know so far of how they operate, and secondly… because they can’t.
If you look at the C# code, the first step is in
_process_tokens
function, where it feeds the context intollama_eval
. That goes through each token and updates the internal memory / model state. Since it saves state, if you already have processed some of the tokens you can tell it to skip them and start on the new ones.After this function you have a state in memory, the current state of the LLM, as a result of the tokens it’s seen so far.
When we are done with that, we go to the more interesting part, the
_predict_next_token
function. Note that that takes asamplingparams
parameter. It then set some options, like if top k is not set it’s set to length of the model’s vocabulary (number of tokens it knows about), and repeat_last_n, if not set, is set to the length of the existing context.The code then gets the model’s vocabulary, aka all the tokens it knows about, and then it generates the logits. The logits is an array the length of the vocabulary, with a number for each token showing how likely that one is the next token. The code then adds any specified token bias to that token’s number. Already here, even if it already had a specific answer in mind, you can see problems starting.
Then the code adds token repetition penalty, based on the
samplingparams
. This means that if a token repeats inside the given history, it’s value will be lowered according to the repeat_penalty. Again, even if it had a specific answer, this has a high chance of messing that up. The same is done for frequency and presence. For more details of what those native functions do, you can see the llama.cpp source - they have the same name there.After all the penalties are applied, it’s time to pick the token. If the temp is 0 or lower, it just picks the highest rated token (aka greedy sampling). This tends to give very boring and flat responses, but it’s predictable and reproduceable, so it’s often used in benchmarks of various kinds.
But if that’s not used (which it almost never is in “real” use), there are several methods. You have MiroStat, which tries to create more consistent quality between different answer lengths, and the “traditional” using top-k, top-p and temperature.
Common for them is however that internally it produces a top list of candidates, and then pick one at random. And that’s why a LLM can’t plan ahead.
When a token ID is eventually produced it returns the new ID, that gets added to context, the text equivalent of the token is looked up and sent back to the UI, and the new context is fed into llama_eval and the process starts again.
For the LLM to even be able to plan an answer ahead it must know of all penalties and parameters (or have none applied), and greedy token prediction must be used.
And that is why, even if it had some sort of near magical ability to plan ahead that we just don’t know is there, at the end of the day it could still not plan a specific response.
Wow, very nice! First of all, I will preface by admitting that I have not worked with LLMs to the degree of making a toy implementation. Your explanation of the sampling techniques is insightful but doesn’t clear up my confusion. Why does sampling imply the absence of higher level structure in the model?
For example, even though poker is highly influenced by chance, I can still have a plan that will increase my likelihood of winning. I don’t know what card will be drawn next but I can prepare strategies for each possible card. I can have preferences for which cards I want to be drawn next.
You know what, I don’t have a good answer to you here. I did a few small experiments on ChatGPT and it seems like it has some knowledge of if it will be able to complete it or not. This was with a pretty well known question though.
I tried to recreate an earlier experiment where I asked it to write about a friend of mine, which was in the news some time ago and have apparently a few entries in it’s training data, but very little. ChatGPT would then consistently hallucinate facts about the person, including date of birth and sometimes date of death. In that case it knew the pattern of writing about a person including date of birth, and sometimes date of death, but it didn’t know it didn’t have that info and just filled in plausible looking data there. Now it insists on not knowing who that person is at all and refuses to write anything about him.
Anyway, you’ve given me some things to think about, thanks.