DARE: Aligning LLM Agents with the R Statistical Ecosystem via Distribution-Aware Retrieval

Maojun Sun^*1, Yue Wu^*1, Yifei Xie^*1, Ruijian Han^†1, Binyan Jiang¹, Defeng Sun², Yancheng Yuan^†2, Jian Huang^†1,2

¹Department of Data Science and Artificial Intelligence, ²Department of Applied Mathematics,
The Hong Kong Polytechnic University
{mj.sun, yue0301.wu, yifei.xie}@connect.polyu.hk
{ruijian.han, yancheng.yuan, j.huang}@polyu.edu.hk

Paper · DARE

Model · DARE-R-Retriever

Database · RPKB Code · GitHub

Abstract

Large Language Model (LLM) agents can automate data-science workflows, but many rigorous statistical methods implemented in R remain underused because LLMs struggle with statistical knowledge and tool retrieval. Existing retrieval-augmented approaches focus on function-level semantics and ignore data distribution, producing suboptimal matches. We propose DARE (Distribution-Aware Retrieval Embedding), a lightweight, plug-and-play retrieval model that incorporates data distribution information into function representations for R package retrieval. Our main contributions are: (i) RPKB, a curated R Package Knowledge Base derived from 8,191 high-quality CRAN packages; (ii) DARE, an embedding model that fuses distributional features with function metadata to improve retrieval relevance; and (iii) RCodingAgent, an R-oriented LLM agent for reliable R code generation and a suite of statistical analysis tasks for systematically evaluating LLM agents in realistic analytical scenarios. Empirically, DARE achieves an NDCG@10 of 93.47%, outperforming state-of-the-art open-source embedding models by up to 17% on package retrieval while using substantially fewer parameters. Integrating DARE into RCodingAgent yields significant gains on downstream analysis tasks. This work helps narrow the gap between LLM automation and the mature R statistical ecosystem.

Methods

Figure 1: Comparison of traditional semantic function search methods and Distribution-Aware Retrieval Embedding (DARE) method.

Figure 2: The overall framework of DARE training process.

Figure 3: An example of RCodingAgent for realistic statistical analysis.

Figure 4: Upper panel: Pipeline for constructing R-based statistical evaluation tasks. Lower panel: Overview of selected domains and R packages covered in the benchmark.

Experimental Results

Table 1: Performance comparison with open-sourced SoTA embedding models on RPKB test set. Despite having only 23M parameters, DARE significantly outperforms all other models (up to 568M parameters) across all retrieval metrics.

Model	Params	NDCG@10	MRR@10	Recall@10	Recall@1
Snowflake/arctic-embed-l	335M	0.7932	0.7510	0.9235	0.6549
intfloat/e5-large-v2	335M	0.7513	0.7086	0.8838	0.6152
jina-embeddings-v2-base-en	137M	0.7429	0.6965	0.8873	0.5969
BAAI/bge-m3	568M	0.7308	0.6843	0.8758	0.5847
mxbai-embed-large-v1	335M	0.7068	0.6565	0.8639	0.5508
UAE-Large-V1	335M	0.7066	0.6556	0.8658	0.5479
gte-large-en-v1.5	435M	0.6639	0.6122	0.8257	0.5040
all-mpnet-base-v2	110M	0.6606	0.6057	0.8330	0.4937
Base Model (MiniLM)	23M	0.6127	0.5553	0.7936	0.4412
DARE (Ours)	23M	0.9347	0.9176	0.9863	0.8739

Figure 5: Results of QPS and Latency.

Downstream Agent Performance

Table 2: Comparison of End-to-End Success Rates for Various LLM Agents on statistical analysis tasks, with and without the DARE Module.

Model	RCodingAgent (w/o DARE)	RCodingAgent with DARE
claude-haiku-4.5	6.25%	56.25% (50.00%)
deepseek-v3.2	18.75%	56.25% (37.50%)
gpt-5.2	25.00%	62.50% (37.50%)
grok-4.1-fast	18.75%	75.00% (56.25%)
mimo-v2-flash	12.50%	62.50% (50.00%)
minimax-m2.1	12.50%	68.75% (56.25%)