DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Maojun Sun*1, Yue Wu*1, Yifei Xie*1, Ruijian Han†1, Binyan Jiang1, Defeng Sun2, Yancheng Yuan†2, Jian Huang†1,2
1Department of Data Science and Artificial Intelligence, 2Department of Applied Mathematics,
The Hong Kong Polytechnic University
{mj.sun, yue0301.wu, yifei.xie}@connect.polyu.hk
{ruijian.han, yancheng.yuan, j.huang}@polyu.edu.hk

Abstract

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG@10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

Methods

Figure 1: Comparison of traditional semantic function search methods and Distribution-Aware Retrieval Embedding (DARE) method.

Figure 2: The overall framework of DARE training process.

Figure 3: An example of RCodingAgent for realistic statistical analysis.

Figure 4: Upper panel: Pipeline for constructing R-based statistical evaluation tasks. Lower panel: Overview of selected domains and R packages covered in the benchmark.

Experimental Results

Table 1: Performance comparison with open-sourced SoTA embedding models on RPKB test set. Despite having only 23M parameters, DARE significantly outperforms all other models (up to 568M parameters) across all retrieval metrics.
Model Params NDCG@10 MRR@10 Recall@10 Recall@1
Snowflake/arctic-embed-l 335M 0.7932 0.7510 0.9235 0.6549
intfloat/e5-large-v2 335M 0.7513 0.7086 0.8838 0.6152
jina-embeddings-v2-base-en 137M 0.7429 0.6965 0.8873 0.5969
BAAI/bge-m3 568M 0.7308 0.6843 0.8758 0.5847
mxbai-embed-large-v1 335M 0.7068 0.6565 0.8639 0.5508
UAE-Large-V1 335M 0.7066 0.6556 0.8658 0.5479
gte-large-en-v1.5 435M 0.6639 0.6122 0.8257 0.5040
all-mpnet-base-v2 110M 0.6606 0.6057 0.8330 0.4937
Base Model (MiniLM) 23M 0.6127 0.5553 0.7936 0.4412
DARE (Ours) 23M 0.9347 0.9176 0.9863 0.8739

Figure 5: Results of QPS and Latency.

Downstream Agent Performance

Table 2: Comparison of End-to-End Success Rates for Various LLM Agents on statistical analysis tasks, with and without the DARE Module.
Model RCodingAgent (w/o DARE) RCodingAgent with DARE
claude-haiku-4.5 6.25% 56.25% (50.00%)
deepseek-v3.2 18.75% 56.25% (37.50%)
gpt-5.2 25.00% 62.50% (37.50%)
grok-4.1-fast 18.75% 75.00% (56.25%)
mimo-v2-flash 12.50% 62.50% (50.00%)
minimax-m2.1 12.50% 68.75% (56.25%)
Flag Counter