Gaurav Mantri's Personal Blog.

Microsoft Semantic Kernel – Some Tips & Tricks To Get Prompt & Completion Tokens

In my previous post, I talked about how you can get rendered prompts. In this post, I am going to talk about ways to get prompt and completion tokens when using Microsoft Semantic Kernel.

What are Tokens?

Let’s first begin with what tokens really are. In very simple terms, a token is the smallest unit of data that a Large Language Model (LLM) can understand and process. The data could be text, an image, a video, a sound clip or any other data.

When it comes to text, a token could be a single character (like “a”), a partial word (like “happiness” in “unhappiness”), a full word (like “apple”) or even a combination of words. The key thing to remember is that a token is the smallest unit of data that an LLM can understand.

When dealing with tokens, you will come across two terms: Prompt tokens and Completion tokens.

  • Prompt tokens are the tokens representing input prompt (i.e. the data being fed to an LLM)
  • Completion tokens are the tokens consumed by an LLM to process the prompt. It represents the data outputted by an LLM.

Why you should care about Tokens?

So, the next question is why you should care about them. Simple answer is, you pay for these tokens :). This is how the service providers make money. You are charged for both prompt and completion tokens. The pricing varies based on the service provider and the model you are using.

Because you are paying for the tokens, you have to be extra careful with them.

You should ensure that your prompts are complete (otherwise you will not get a proper result) but concise and to the point so that you don’t end up breaking the bank.

Through your prompts, you should also put some constraints on the desired output. You have to be very explicit in your prompts about the size and kind of data you expect an LLM to produce. For example, if you expect that your output should not be more than 500 characters, explicitly mention that.

How to get prompt and completion tokens?

So, how do you get prompt and completion tokens in your LLM application built using Microsoft Semantic Kernel?

I will show you fours ways to accomplish that. So, let’s start!

1. Hooking into Kernel Events (Obsolete, Not Recommended)

First way is by hooking into (Semantic) Kernel events. Kernel in Semantic Kernel exposes a FunctionInvoked  event which gets fired when a function is invoked. You can consume this event to get the consumed tokens.

My code would be something like the following:

private Kernel GetKernel()
{
    var kernelBuilder = Kernel.CreateBuilder();
    var deploymentId = "your-azure-openai-deployment-id";
    AzureOpenAIClient client = GetAzureOpenAIClientSomehow();        
    kernelBuilder.AddAzureOpenAIChatCompletion(deploymentId, client);
    var kernel = kernelBuilder.Build();
    kernel.FunctionInvoked += (sender, args) =>
    {
        var metadata = args.Metadata;
        if (metadata == null || !metadata.ContainsKey("Usage")) return;
        var usage = (CompletionsUsage)metadata["Usage"];
        if (usage == null) return;
        var promptTokens = usage.PromptTokens;
        var completionTokens = usage.CompletionTokens;
        // do something with the tokens
    };
    return kernel;
}

However, you should not be using this approach as it has being marked as obsolete in the latest version. In fact, if you use this approach with version 1.3.0 of Semantic Kernel (which is the most current version at the time of writing this post), you will get a warning about not to use it.

2. Using Filters (Experimental)

This is another approach that you can take. I believe this feature was introduced recently and is recommended to use this approach over using kernel events.

Using filters is really easy. You basically create a custom filter class that implements IFunctionFilter interface and then implement OnFunctionInvoking and OnFunctionInvoked methods to suit your requirements. For example, I could simply write the prompt and completion tokens to console.

So my code would be something like:

private class FunctionFilter : IFunctionFilter
{
    public void OnFunctionInvoking(FunctionInvokingContext context)
    {
    }

    public void OnFunctionInvoked(FunctionInvokedContext context)
    {
        var metadata = context.Result.Metadata;
        if (metadata == null || !metadata.ContainsKey("Usage")) return;
        var usage = (CompletionsUsage)metadata["Usage"];
        if (usage == null) return;
        var promptTokens = usage.PromptTokens;
        var completionTokens = usage.CompletionTokens;
        // do something with the tokens
    }
}

And this is how I would wire up the filter in the Kernel:

kernel.FunctionFilters.Add(new FunctionFilter());

My complete code for kernel would be:

private Kernel GetKernel()
{
    var kernelBuilder = Kernel.CreateBuilder();
    var deploymentId = "your-azure-openai-deployment-id";
    AzureOpenAIClient client = GetAzureOpenAIClientSomehow();        
    kernelBuilder.AddAzureOpenAIChatCompletion(deploymentId, client);
    var kernel = kernelBuilder.Build();
    kernel.FunctionFilters.Add(new FunctionFilter());
    return kernel;
}

Please note that this is still experimental and may change (or even removed) in the future versions.

3. Using Function Result

If you are invoking functions in a non-streaming way i.e. you are waiting for the complete result to come back, you can make use of FunctionResult to get the tokens.

So my code would be something like:

var function = GetKernelFunctionSomehow();
var kernelArguments = GetKernelArgumentsSomehow();
var result = await kernel.InvokeAsync(function, kernelArguments);
var metadata = result.Metadata;
if (metadata == null || !metadata.ContainsKey("Usage")) return;
var usage = (CompletionsUsage)metadata["Usage"];
if (usage == null) return;
promptTokens = usage.PromptTokens;
completionTokens = usage.CompletionTokens;
// do something with tokens

4. Using 3rd Party Library

Approach 3 above works great if you are waiting for the entire response to come to your application however what I have noticed is that the user experience (UX) is not great in this case.

LLMs sends the response in a streaming way that is they spit out (partial) response as it becomes available and if possible, you should try to stream that response back to the user.

However, approach 3 above would not work in that case. From what I am told, Azure OpenAI does not even return the token usage as part of their response in this case and hence Semantic Kernel also does not provide this information.

In this case, what you can do is make use of 3rd party libraries. One such library that I have used is Tiktoken. You can use it to calculate the token consumption. There are many other libraries available like it.

So my code would be something like:

var prompt = GetPromptSomehow();
var function = GetKernelFunctionSomehow();
var kernelArguments = GetKernelArgumentsSomehow();
var result = kernel.InvokeStreamingAsync(function, kernelArguments);
StringBuilder responseStringBuilder = new StringBuilder();
await foreach (var item in result)
{
    var response = item.ToString();
    // store the partial response. We will use it in the end to calculate prompt and completion tokens.
    responseStringBuilder.Append(response);
}
var answer = responseStringBuilder.ToString();
var encodingForModel = Tiktoken.Encoding.TryForModel("model type e.g. gpt-4, gpt-3.5-turbo or gpt-35-turbo");
promptTokens = encodingForModel.CountTokens(prompt);
completionTokens = encodingForModel.CountTokens(answer);
// do something with tokens

Summary

That’s it for this post. I hope you have found it useful. My recommendation would be to use #3 or #4 approach based on your scenario (non-streaming v/s streaming). You can use #2 but definitely stay away from #1.

I would like to end this post with the same bit of warning I gave in my previous post: Semantic Kernel (and in general AI tools) are changing very rapidly (quite evident from the fact that kernel events are being deprecated within a few minor releases). I would highly recommend referencing official documentation for the most current functionality.

In the next post I will talk about the AI assistant I have been building. A lot of my learning came from building that AI assistant. I am pretty excited about it and can’t wait to share more with you.

Until then, Happy Coding!


[This is the latest product I'm working on]