From Graphs to Tokens: Substructure-Aware Molecular Representation for Large Language Models
Published in Information Processing & Management, 2026
Abstract
Large language models have shown promise for molecular reasoning, but graph tokenization remains a bottleneck. This paper introduces S2Token, a substructure-aware tokenizer that fragments molecular graphs into chemically meaningful reusable units instead of relying on atom-level or graph-level tokens alone. The method is designed to preserve both representativeness and generalization when aligning molecular graphs with LLM token spaces.
Key ideas
- Treat functional molecular substructures as discrete tokens for molecular LLMs.
- Build a balanced substructure vocabulary through graph decomposition.
- Learn dual-view token embeddings that capture both structural and functional attributes of each substructure.
- Model inter-substructure dependencies with a substructure-level alignment mechanism in the LLM embedding space.
- Evaluate generalization with curated cross-dataset benchmarks spanning four molecular tasks.
Main results
- On the molecular caption benchmark, S2Token achieves a 12.6% average improvement across six metrics over the best LLM-based baseline.
- For forward reaction and retrosynthesis, it improves synthesized-molecule fingerprint similarity by 6.6% and 8.6%, respectively.
- On molecular property prediction within the generalist evaluation suite, it reduces MAE by 9.7% and 3.0% compared with graph- and node-centric tokenization methods.
Why it matters
The paper argues that molecular substructures behave like chemically meaningful “subwords.” By tokenizing at that level and explicitly modeling both intra-substructure semantics and inter-substructure dependencies, S2Token gives language models a more transferable representation space for unseen molecules.
Code
Recommended citation: Runze Wang, Zijie Xing, *Xingyue Liu*, Mingqi Yang, Che He, Yanming Shen, “From Graphs to Tokens: Substructure-Aware Molecular Representation for Large Language Models,” Information Processing and Management, vol. 63, no. 6, p. 104771, Sep. 2026
Download Paper | Download Bibtex
