Function Calling Optimizations (GPT4 vs Opus vs Haiku vs Sonnet)

Function Calling Optimizations (GPT4 vs Opus vs Haiku vs Sonnet)
Function Calling accuracy across GPT4 vs Opus vs Haiku vs Sonnet

Code: https://github.com/SamparkAI/Composio-Function-Calling-Benchmark/.

New: Checkout updated model scores with GPT-4o

In the last blog, we introduced the ClickUp function calling benchmark and experimented with different optimisation approaches for improving function calling using gpt-4-turbo-preview.

This time, we wanted to check a selection of other models, which might or might not claim to be superior in performance ūüėÖ. We also wanted to make our benchmark test more generalised to find compatible optimisation approaches to specific models for function calling.

Function Call Optimisation Techniques

As function calling is a new concept, and not much literature is available, we checked different experiments by the community. From these and our intuition, we realised techniques like flattening the schema structure, making system prompts more focused on function calls, improving the function names, descriptions, parameter descriptions, adding examples, etc. will enhance the function calling performance. So, we decided on this elaborate experiment. To list the methods we experimented with:

  • No System Prompt: Only the problem statement
  • Flattening Schema: All the hierarchical parameters are flattened to a shallow tree structure
  • Flattened Schema + Simple System Prompt: Added a simple system prompt mentioning that function calling needs to be used
  • Flattened Schema + Focused System Prompt: Added characterisation on its role in solving function calling problems.
  • Flattened Schema + Focused System Prompt + Function Name Optimised: The function names were elaborated.
  • Flattened Schema + Focused System Prompt + Function Description Optimised: Explained the descriptions clearly.
  • Flattened Schema + Focused System Prompt containing Schema summary: Added summarised version of all function schema to the system prompts
  • Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimised: Summarised function schema in system prompt, with elaborated function names.
  • Flattened Schema + Focused System Prompt containing Schema summary + Function Description Optimised: Summarised function schema in system prompt, with clearly explained function descriptions.
  • Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised: Additionally, the description of the parameters was improved
  • Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised + Function Call examples: Examples of function calls were added along with function descriptions.
  • Flattened Schema + Focused System Prompt containing Schema summary + Function and Parameter Descriptions Optimised + Function Parameter examples added: Examples of parameter values were added to parameter descriptions.

OpenAI Models for Function Calling Optimsation

As we checked gpt-4-turbo-preview in the previous experiment, we wanted to test the performance of both its predecessor, gpt-4-0125-preview, and its successor gpt-4-turbo. As we have seen before, even though the next-generation models are pretty advanced in benchmark scores, they are often not better in an all-encompassing way. So, comparing with our previous scores, here is the performance of these two OpenAI models.

Optimization Approach

gpt-4-turbo-preview

gpt-4-turbo

gpt-4-0125-preview

No System Prompt

0.36

0.36

0.353

Flattening Schema

0.527

0.487

0.533

Flattened Schema + Simple System Prompt

0.553

0.533

0.54

Flattened Schema + Focused System Prompt

0.633

0.633

0.64

Flattened Schema + Focused System Prompt + Function Name Optimized

0.553

0.607

0.587

Flattened Schema + Focused System Prompt + Function Description Optimized

0.633

0.66

0.673

Flattened Schema + Focused System Prompt containing Schema summary

0.64

0.553

0.64

Flattened Schema + Focused System Prompt containing Schema summary + Function Name Optimized

0.70

0.707

0.686

Flattened Schema + Focused System Prompt containing Schema summary + 

Function Description Optimized

0.687

0.707

0.68

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized

0.767

0.767

0.787

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized + Function Call examples added

0.693

0.6

0.707

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized + Function Parameter examples added

0.787

0.693

0.787


So we can see that, in most cases, the original gpt-4-0125-preview performed better. When we added more examples of parameters, in the parameter descriptions, gpt-4-0125-preview consistently performed better than the other models. In the cases where we optimised or elaborated only the function names and descriptions, we see the gpt-4-turbo seems to do better.

GPT-4 vs GPT-4o

It is 3x faster, 50% cheaper with almost same accuracy.


Anthropic Models

Next, we did the same experimentation with Anthropic's Claude-3 series of models. Claude-3 has three models, haiku, sonnet and opus, in increasing order of parameters and performance(at least that is expected).

When we tried these models, we discovered that Claude models, especially opus, is very costly, and very slow!! Running the whole benchmark with GPT-4 for one run took ~4 minutes, while claude-3-opus-20240229took around ~13 minutes. claude-3-haiku-20240307 and claude-3-sonnet-20240229 took about ~3 minutes and ~6 minutes, respectively.

We faced several problems while running the benchmark for clause models. For example, unlike OpenAI models, Claude models' most function/tool calls are preceded by a block of thoughts text, which required some changes in our benchmark code.
Then, while we ran it, we found that the scores were incredibly low in some cases and kind of absurd.
After some digging, we found that sometimes the models predicted the boolean variables as strings, like True was predicted as "True" and False was predicted as "False". We added a fix for that and then finally obtained our results.

Optimization Approach

claude-3-haiku-20240307

claude-3-sonnet-20240229

claude-3-opus-20240229

No System Prompt

0.48

0.6

0.42

Flattening Schema

0.5

0.58

0.5

Flattened Schema + Simple System Prompt

0.54

0.6

0.54

Flattened Schema + Focused System Prompt

0.54

0.54

0.54

Flattened Schema + Focused System Prompt + Function Name Optimized

0.52

0.62

0.52

Flattened Schema + Focused System Prompt + Function Description Optimized

0.52

0.6

0.52

Flattened Schema + Focused System Prompt containing Schema summary

0.46

0.62

0.46

Flattened Schema + Focused System Prompt containing Schema summary + 

Function Name Optimized

0.5

0.64

0.46

Flattened Schema + Focused System Prompt containing Schema summary + 

Function Description Optimized

0.5

0.6

0.6

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized

0.58

0.74

0.58

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized + Function Call examples added

0.6

0.76

0.64

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized + Function Parameter examples added

0.68

0.76

0.66


Now I know.., you think they must have messed up the haiku and opus models scores. But believe me, I am equally surprised and can ensure that we ran the opus benchmark multiple times and checked the code quite a lot for probable bugs.

opus, sonnet and haiku initially outperform GPT models in non-optimized scenarios. sonnet consistently outpaces haiku, as expected. Had opus maintained this trend, it likely would have surpassed Openai models.

Final Conclusion of Function Calling Optimization for Different Models

OpenAI models, especially gpt-4-turbo-preview, are still the better choice regarding performance and cost.

Optimization Approach

gpt-4-turbo-preview

gpt-4-turbo

gpt-4-0125-preview

claude-3-haiku-20240307

claude-3-sonnet-20240229

claude-3-opus-20240229

No System Prompt

0.36

0.36

0.353

0.48

0.6

0.42

Flattening Schema

0.527

0.487

0.533

0.5

0.58

0.5

Flattened Schema + Simple System Prompt

0.553

0.533

0.54

0.54

0.6

0.54

Flattened Schema + Focused System Prompt

0.633

0.633

0.64

0.54

0.54

0.54

Flattened Schema + Focused System Prompt + Function Name Optimized

0.553

0.607

0.587

0.52

0.62

0.52

Flattened Schema + Focused System Prompt + Function Description Optimized

0.633

0.66

0.673

0.52

0.6

0.52

Flattened Schema + Focused System Prompt containing Schema summary

0.64

0.553

0.64

0.46

0.62

0.46

Flattened Schema + Focused System Prompt containing Schema summary + 

Function Name Optimized

0.70

0.707

0.686

0.5

0.64

0.46

Flattened Schema + Focused System Prompt containing Schema summary + 

Function Description Optimized

0.687

0.707

0.68

0.5

0.6

0.6

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized

0.767

0.767

0.787

0.58

0.74

0.58

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized + Function Call examples added

0.693

0.6

0.707

0.6

0.76

0.64

Flattened Schema + Focused System Prompt containing Schema summary + 

Function and Parameter Descriptions Optimized + Function Parameter examples added

0.787

0.693

0.787

0.68

0.76

0.66


All the codes are organised at: https://github.com/SamparkAI/Composio-Function-Calling-Benchmark/.

We're currently deciding which models to test next‚ÄĒperhaps Mistral or open-source options like Functionary or NexusRaven. Check out our repository and try running these models to compare their performance. If you have questions or suggestions, please submit a pull request. Thank you!

Read more