From Graphs to Tokens: Substructure-Aware Molecular Representation for Large Language Models

Published in Information Processing & Management, 2026

Abstract

Large language models have shown promise for molecular reasoning, but graph tokenization remains a bottleneck. This paper introduces S2Token, a substructure-aware tokenizer that fragments molecular graphs into chemically meaningful reusable units instead of relying on atom-level or graph-level tokens alone. The method is designed to preserve both representativeness and generalization when aligning molecular graphs with LLM token spaces.

Key ideas

  • Treat functional molecular substructures as discrete tokens for molecular LLMs.
  • Build a balanced substructure vocabulary through graph decomposition.
  • Learn dual-view token embeddings that capture both structural and functional attributes of each substructure.
  • Model inter-substructure dependencies with a substructure-level alignment mechanism in the LLM embedding space.
  • Evaluate generalization with curated cross-dataset benchmarks spanning four molecular tasks.

Main results

  • On the molecular caption benchmark, S2Token achieves a 12.6% average improvement across six metrics over the best LLM-based baseline.
  • For forward reaction and retrosynthesis, it improves synthesized-molecule fingerprint similarity by 6.6% and 8.6%, respectively.
  • On molecular property prediction within the generalist evaluation suite, it reduces MAE by 9.7% and 3.0% compared with graph- and node-centric tokenization methods.

Why it matters

The paper argues that molecular substructures behave like chemically meaningful “subwords.” By tokenizing at that level and explicitly modeling both intra-substructure semantics and inter-substructure dependencies, S2Token gives language models a more transferable representation space for unseen molecules.

Code

Recommended citation: Runze Wang, Zijie Xing, *Xingyue Liu*, Mingqi Yang, Che He, Yanming Shen, “From Graphs to Tokens: Substructure-Aware Molecular Representation for Large Language Models,” Information Processing and Management, vol. 63, no. 6, p. 104771, Sep. 2026
Download Paper | Download Bibtex