About

Time: Wednesdays,
Location: CoRE 431 (DIMACS seminar room)
Mailing list: ak2530@scarletmail.rutgers.edu

The seminar explores the fast-growing interface between mathematical reasoning, formal methods, and modern AI systems. It is co-organized by DIMACS and the Mathematics Department at Rutgers University, New Brunswick. Here is the YouTube channel for the talks in this seminar.

Organizers

Emily First Ian Jauslin Ayush Khaitan Alex Kontorovich Konstantin Mischaikow Lirong Xia

Seminar Schedule

Fall 2025
Date Speaker Title / Materials
Alex Kontorovich (Rutgers)
An Introduction to Lean + AI for Research Mathematicians

Abstract. We'll do a "show and tell" of what it's like to try to formalize some basic mathematics in Lean, with help from AI.

Emily First (Rutgers)
A survey on AI for Proof Assistants

Abstract. In this talk, I’ll provide an overview of some advancements in AI and machine learning in proof assistant languages, such as Lean, Rocq, and Isabelle/HOL. I’ll discuss both neural and symbolic techniques for various proof assistant tasks, including automated proof synthesis, autoformalization, premise selection, and lemma conjecturing. With respect to neural techniques, I’ll cover the basics of applying language models, reinforcement learning, and graph neural networks to the theorem proving domain.

Chi Jin (Princeton)
Goedel-Prover-V2: The Strongest Open-Source Theorem Prover to Date

Abstract. This talk introduces Goedel-Prover-V2, an open-source model that sets a new state of the art in automated theorem proving.

Tomer Galanti (Texas A&M)
LLM-ERM: Sample-Efficient Program Learning via LLM-Guided Search

Abstract. We seek algorithms for program learning that are both sample-efficient and computationally feasible. Classical results show that targets admitting short program descriptions (e.g., with short "python code") can be learned with a "small" number of examples (scaling with the size of the code) via length-first program enumeration, but the search is exponential in description length. Consequently, Gradient-based training avoids this cost yet can require exponentially many samples on certain short-program families.

To address this gap, we introduce LLM-ERM, a propose-and-verify framework that replaces exhaustive enumeration with an LLM-guided search over candidate programs while retaining ERM-style selection on held-out data. Specifically, we draw k candidates with a pretrained reasoning-augmented LLM, compile and check each on the data, and return the best verified hypothesis, with no feedback, adaptivity, or gradients. Theoretically, we show that coordinate-wise online mini-batch SGD requires many samples to learn certain short programs. Empirically, LLM-ERM solves tasks such as parity variants, pattern matching, and primality testing with as few as 200 samples, while SGD-trained transformers overfit even with 100,000 samples. These results indicate that language-guided program synthesis recovers much of the statistical efficiency of finite-class ERM while remaining computationally tractable, offering a practical route to learning succinct hypotheses beyond the reach of gradient-based training.

Ayush Khaitan (Rutgers)
O-Forge: a verifiable, LLM-driven framework for proving inequalities in research mathematics.

Abstract. We introduce an LLM + computer software framework for proving sophisticated inequalities in research mathematics. We first ask a frontier LLM to break up a problem into its simplest parts, and then ask a computer algebra system to complete the proof for each part. In doing so, we solve a concrete question posed by Terence Tao. This is joint work with Vijay Ganesh (Georgia Tech).

Chenyang An (Amazon)
DeRL: Diverse‑Exploration Reinforcement Learning for Large Language Models

Abstract. Current reinforcement-learning (RL) pipelines for large language models (LLMs) that tackle mathematical reasoning and formal theorem proving tend to over-exploit a few high-probability chain-of-thought (CoT) sequences. Because rewards are granted solely for producing correct answers, the policy quickly converges on those paths, neglecting the rich space of alternative proofs and solution strategies that math problems usually have. We address this limitation with Diverse-Exploration RL (DeRL), a simple yet effective modification to the reward function and the RL prompts. During training, the model is explicitly instructed to solve each problem without relying on its previously generated CoT. Next, an auxiliary LLM judge verifies the approach dissimilarity between the new LLM output and the previous CoT. Combined with the correctness metric, this new reward encourages exploration of novel reasoning paths while preserving accuracy. We test DeRL on both natural-language math questions with boxed answers and formal theorem proving problems in Lean. Across the MATH benchmark and Leanabell dataset, DeRL yields more than 10% relative gain compared to the PPO baseline for the Pass@1 metric. DeRL also consistently yields better results for the Pass@N metric. Our findings demonstrate that incorporating diversity-aware rewards facilitates broader exploration and enhances reasoning capabilities of LLMs, indicating a promising direction for improving current reinforcement learning pipelines.

Vijay Ganesh (Georgia Tech)
Auto-formalization via Joint Embeddings

Abstract. In recent years we have witnessed a symbiotic trend wherein LLMs are being combined with provers, solvers, and computer algebra systems, resulting in dramatic breakthroughs in AI for math. Following this trend, we have developed two lines of work in my research group. The first is the idea that "good" joint embeddings (JE) can dramatically improve the efficacy of LLM-based auto-formalization tools. We say that JEs are good if they respect the following invariant: semantically-equivalent formally-dissimilar objects (e.g., pairs of semantically-equivalent natural and formal language proofs) must be "close by" in the embedding space, and semantically inequivalent ones "far apart". We use such JE models as part of a successful RAG-based auto-formalization pipeline, demonstrating that such JEs are a critical AI-for-math technology. The second idea is Reinforcement Learning with Symbolic Feedback (RLSF), a class of techniques that addresses the LLM hallucination problem in contexts where we have access to rich symbolic feedback such as math, physics, and code, demonstrating that they too are critical to the success of AI for math.

Kaiyu Yang (FAIR, Meta)
Verina: Benchmarking Verifiable Code Generation

Abstract. Large language models (LLMs) are increasingly integrated in software development, but ensuring correctness in LLM-generated code remains challenging and often requires costly manual review. Verifiable code generation—jointly generating code, specifications, and proofs of code-specification alignment—offers a promising path to address this limitation and further unleash LLMs' benefits in coding. Yet, there exists a significant gap in evaluation: current benchmarks often lack support for end-to-end verifiable code generation. In this paper, we introduce VERINA (Verifiable Code Generation Arena), a high-quality benchmark enabling a comprehensive and modular evaluation of code, specification, and proof generation as well as their compositions. VERINA consists of 189 manually curated coding tasks in Lean, with detailed problem descriptions, reference implementations, formal specifications, and extensive test suites. Our extensive evaluation of state-of-the-art LLMs reveals significant challenges in verifiable code generation, especially in proof generation, underscoring the need for improving LLM-based theorem provers in verification domains. The best model, OpenAI o4-mini, generates only 61.4% correct code, 51.0% sound and complete specifications, and 3.6% successful proofs, with one trial per task. We hope VERINA will catalyze progress in verifiable code generation by providing a rigorous and comprehensive benchmark.

Wuyang Chen (Simon Fraser University)
Hybrid Learning Machines Bridge AI and Physical Modeling

Abstract. Recent progress in LLMs has transformed text and code generation, yet models still falter on PDEs (partial differential equation) where correctness, constraints, and physical consequences are critical. This talk explores how formal LLM reasoning can advance symbolic PDE modeling. First, our PDE-Controller formalizes informal PDEs, synthesizes solver-ready code, and plans subgoals to tackle nonconvex control via interactions with external solvers. Second, our Lean Finder accelerates PDE formalization with a semantics-aware search engine for Lean/Mathlib that retrieves relevant theorems, outperforming GPT models and earning strong reception in the AI-for-math community. Together these efforts, our aim is to design a semantics-first LLM (large language model) that autoformalizes informal PDE problems into machine-checked specifications, synthesizes solver-ready code, and plans subgoals, closing the loop between formal analysis and LLM reasoning and surpassing human heuristics across diverse PDEs.

Schedule subject to change; check back for abstracts and slides.