Exploring DeepSeek-R1 s Agentic Capabilities Through Code Actions
I ran a fast experiment investigating how DeepSeek-R1 performs on agentic tasks, in spite of not supporting tool usage natively, and I was quite amazed by initial results. This experiment runs DeepSeek-R1 in a single-agent setup, dokuwiki.stream where the model not only plans the actions but also formulates the actions as executable Python code. On a subset1 of the GAIA validation split, DeepSeek-R1 surpasses Claude 3.5 Sonnet by 12.5% absolute, from 53.1% to 65.6% correct, and other models by an even bigger margin:
The experiment followed design usage guidelines from the DeepSeek-R1 paper and oke.zone the design card: Don't use few-shot examples, prevent adding a system timely, and securityholes.science set the temperature level to 0.5 - 0.7 (0.6 was used). You can find further assessment details here.
Approach
DeepSeek-R1's strong coding abilities enable it to act as a representative without being clearly trained for tool use. By allowing the design to create actions as Python code, it can flexibly communicate with environments through code execution.
Tools are executed as Python code that is consisted of straight in the prompt. This can be a simple function meaning or a module of a larger plan - any code. The model then produces code actions that call these tools.
Arise from performing these actions feed back to the model as follow-up messages, driving the next steps till a last response is reached. The representative structure is an easy iterative coding loop that moderates the discussion between the design and its environment.
Conversations
DeepSeek-R1 is utilized as chat design in my experiment, where the design autonomously pulls extra context from its environment by utilizing tools e.g. by utilizing a search engine or bring information from websites. This drives the conversation with the environment that continues up until a final answer is reached.
In contrast, o1 models are known to perform poorly when utilized as chat designs i.e. they do not attempt to pull context throughout a discussion. According to the connected post, o1 designs carry out best when they have the full context available, with clear directions on what to do with it.
Initially, I likewise tried a complete context in a single prompt method at each action (with arise from previous actions consisted of), but this caused considerably lower ratings on the GAIA subset. Switching to the conversational method explained above, setiathome.berkeley.edu I was able to reach the reported 65.6% performance.
This raises a fascinating question about the claim that o1 isn't a chat model - perhaps this observation was more relevant to older o1 models that did not have tool use capabilities? After all, bybio.co isn't tool usage support a crucial system for allowing models to pull extra context from their environment? This conversational technique certainly appears reliable for yewiki.org DeepSeek-R1, though I still need to carry out similar explores o1 models.
Generalization
Although DeepSeek-R1 was mainly trained with RL on math and coding jobs, it is impressive that generalization to agentic jobs with tool usage through code actions works so well. This capability to generalize to agentic jobs advises of recent research study by DeepMind that shows that RL generalizes whereas SFT memorizes, although generalization to tool usage wasn't examined in that work.
Despite its ability to generalize to tool use, DeepSeek-R1 typically produces very long thinking traces at each action, compared to other designs in my experiments, restricting the effectiveness of this model in a single-agent setup. Even easier tasks sometimes take a long time to complete. Further RL on agentic tool use, be it through code actions or not, might be one option to improve effectiveness.
Underthinking
I also observed the underthinking phenomon with DeepSeek-R1. This is when a thinking model regularly switches in between different thinking thoughts without sufficiently exploring appealing paths to reach a proper service. This was a major reason for overly long thinking traces produced by DeepSeek-R1. This can be seen in the tape-recorded traces that are available for download.
Future experiments
Another typical application of reasoning models is to utilize them for planning only, while utilizing other designs for generating code actions. This could be a potential new function of freeact, addsub.wiki if this separation of roles shows beneficial for more complex tasks.
I'm likewise curious about how reasoning models that already support tool use (like o1, o3, ...) perform in a single-agent setup, with and without producing code actions. Recent developments like OpenAI's Deep Research or Hugging Face's open-source Deep Research, which likewise utilizes code actions, look fascinating.