Hongliang Lu(卢红亮)

Welcome to my personal homepage! I am a third-year M.E. student in Mechanical Engineering at the College of Engineering, Peking University, advised by Prof. Zaiwen Wen. I received my B.E. degree in Robotics Engineering from Peking University in 2023.

My research centers on the synergy between Reinforcement Learning and Large Language Models, focusing on three key directions:

RL for LLMs: Developing data-efficient reinforcement learning algorithms to enhance post-training effectiveness, aiming to improve model performance and alignment with human preferences;
Agentic RL: Designing novel RL methods to advance autonomous agent capabilities, with a particular emphasis on self-evolving mechanisms that push the boundaries of agent performance through continuous self-improvement and autonomous capability scaling;
LLMs for Optimization: Leveraging the reasoning capabilities of large language models to tackle complex optimization modeling and decision-making problems.

I have interned at two leading AI companies. At Alibaba’s QuarkLLM team(2025.05-2025.09), I contributed to the Deep Search project, designing RL algorithms to strengthen the Deep Search agent’s performance on tasks requiring multi-step reasoning and complex retrieval . Previously, at Moonshot AI (202501-2025.05), I worked on data synthesis and RL training for their WebAgent.

news

Jan 28, 2026	Our paper “Search Self-Play: Pushing the Frontier of Agent Capability without Supervision” has been accepted to ICLR 2026! 🎉
Oct 22, 2025	We are excited to release our latest research work in Agentic RL: “Search Self-Play: Pushing the Frontier of Agent Capability without Supervision”! 🚀 The paper has been submitted to ICLR 2026 and explores novel self-play training methods for enhancing agent capabilities without supervision.
May 01, 2025	Our paper “OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling” has been accepted as a poster presentation at ICML 2025! 🎉

selected publications

ICLR 2026

Search Self-Play: Pushing the Frontier of Agent Capability without Supervision

Hongliang Lu^*, Yuhang Wen^*, Pengyu Cheng, and 7 more authors

The Fourteenth International Conference on Learning Representations, 2026

Abs arXiv Code

Reinforcement learning with verifiable rewards (RLVR) has become the mainstream technique for training LLM agents. However, RLVR highly depends on well-crafted task queries and corresponding groundtruth answers to provide accurate rewards, which requires massive human efforts and hinders the RL scaling processes, especially under agentic scenarios. Although a few recent works explore task synthesis methods, the difficulty of generated agentic tasks can hardly be controlled to provide effective RL training advantages. To achieve agentic RLVR with higher scalability, we explore self-play training for deep search agents, in which the learning LLM utilizes multi-turn search engine calling and acts simultaneously as both a task proposer and a problem solver. The task proposer aims to generate deep search queries with well-defined ground-truth answers and increasing task difficulty. The problem solver tries to handle the generated search queries and output the correct answer predictions. To ensure that each generated search query has accurate ground truth, we collect all the searching results from the proposer’s trajectory as external knowledge, then conduct retrieval-augmentation generation (RAG) to test whether the proposed query can be correctly answered with all necessary search documents provided. In this search self-play (SSP) game, the proposer and the solver co-evolve their agent capabilities through both competition and cooperation. With substantial experimental results, we find that SSP can significantly improve search agents’ performance uniformly on various benchmarks without any supervision under both from-scratch and continuous RL training setups.
ICML 2025

OptMATH: A Scalable Bidirectional Data Synthesis Framework for Optimization Modeling

Hongliang Lu^*, Zhonglin Xie^*, Yaoyu Wu, and 3 more authors

Forty-Second International Conference on Machine Learning, 2025

Abs arXiv Code Poster Slides Website

Despite the rapid development of large language models (LLMs), a fundamental challenge persists: the lack of high-quality optimization modeling datasets hampers LLMs’ robust modeling of practical optimization problems from natural language descriptions (NL). This data scarcity also contributes to the generalization difficulties experienced by learning-based methods. To address these challenges, we propose a scalable framework for synthesizing a high-quality dataset, named OptMATH. Starting from curated seed data with mathematical formulations (MF), this framework automatically generates problem data (PD) with controllable complexity. Then, a backtranslation step is employed to obtain NL. To verify the correspondence between the NL and the PD, a forward modeling step followed by rejection sampling is used. The accepted pairs constitute the training part of OptMATH. Then a collection of rejected pairs is identified and further filtered. This collection serves as a new benchmark for optimization modeling, containing difficult instances whose lengths are much longer than these of NL4OPT and MAMO. Through extensive experiments, we demonstrate that models of various sizes (0.5B-32B parameters) trained on OptMATH achieve superior results on multiple modeling benchmarks, thereby validating the effectiveness and scalability of our approach.

latest posts

Nov 08, 2024	Scaling Law
Jul 26, 2024	Transformer Architecture Explained: Attention is All You Need
Jul 26, 2024	Understanding Attention Mechanism: Self-Attention and Attention Models