TL;DR: We reframed text chunking as a global optimization problem and built LASER (Least Action Semantic Router). It places first on every retrieval benchmark we tested — FinanceBench, CUAD, MSMARCO, HotpotQA — using a single configuration and the cheapest embedding model available. No per-dataset tuning. No LLM calls. Just math.

pip install lasr
from lasr import chunk

chunks = chunk(document)  # that's it

The problem nobody talks about

Every RAG pipeline has a chunking step. You take documents, split them into pieces, embed the pieces, and retrieve the most relevant ones at query time.

Most teams spend weeks choosing their vector database, days evaluating embedding models, and about fifteen minutes on chunking — usually settling for whatever LangChain's RecursiveCharacterTextSplitter does by default.

This is backwards. Chunking determines what your retrieval system can find. If a chunk splits a critical clause in half, no embedding model or reranker will recover that information. If your chunks are too small, you lose context. Too large, you dilute the signal. The boundaries matter more than most people realize.

And yet the most widely-used chunking methods are embarrassingly simple: split every N tokens, or split when a sliding window's cosine similarity drops below a threshold. These approaches are local and greedy — they decide each boundary independently, without considering how that decision affects the rest of the document.

We asked a different question: what if you optimized all the boundaries at once?


Chunking as physics

LASER treats chunking as a global optimization problem. Instead of scanning through a document and making local split decisions, it considers every possible way to partition the text and selects the one that minimizes a global objective we call Action.

The intuition comes from physics. The Action has two terms:

Action = Tension + Cost

Tension measures semantic dispersion inside each chunk — how scattered the content is. A chunk that covers one coherent topic has low tension. A chunk that mixes unrelated paragraphs has high tension. The optimizer wants to minimize this.

Cost is a boundary penalty. Every split you introduce costs you something. The optimizer has to justify each boundary by showing it reduces tension enough to be worth the cost. This prevents over-fragmentation.

The balance between these two forces is controlled by a single parameter α (alpha). Higher α means splits are more expensive, producing fewer, larger chunks. Lower α produces more, smaller chunks.

The key insight: because the optimizer sees the entire document at once, it can make globally coherent decisions. It won't split a contract clause in half just because the local similarity dipped. It won't fragment a paragraph that needs to stay together. It finds boundaries that make sense for the whole document, not just for the local window.

The optimization is solved exactly using dynamic programming. No approximations, no heuristics, no stochastic sampling. Given the same document and parameters, LASER always produces the same output.


The results

We benchmarked LASER against seven established chunking methods across five retrieval datasets. Every single result below uses α = 2.5 and all-MiniLM-L6-v2, a 22-million parameter open-source embedding model with 384-dimensional vectors. This is deliberately the smallest, cheapest sentence embedding model in common use.

Main benchmark table

Dataset Domain Docs LASER Recall@5 Next Best Their Recall@5 Δ
MSMARCO Web passages 500 0.999 recursive 0.985 +0.014
HotpotQA Multi-hop QA 500 0.974 LlamaIndex SemanticSplitter 0.972 +0.002
FinanceBench SEC filings 84 0.930 LlamaIndex SemanticSplitter 0.629 +0.302
CUAD Legal contracts 102 0.826 LlamaIndex SemanticSplitter 0.775 +0.051
QuALITY Long-form articles 230 0.057 LlamaIndex SemanticSplitter 0.052 +0.005

LASER is first on every benchmark. On FinanceBench, the margin is 30 percentage points. On MSMARCO, it misses one passage out of 500 queries.

QuALITY is the exception that proves the rule — all methods score near zero because the questions require literary inference that no sentence embedding model can capture. When the embedding model can't bridge the gap between question and answer, no chunking strategy helps. The spread across all eight chunkers is under 2 percentage points.

Full FinanceBench results (84 SEC filings, k=5)

Chunker Recall@5 MRR@5 NDCG@5 Avg Chunks
LASER 0.930 0.899 0.907 1.57
LlamaIndex SemanticSplitter 0.629 0.561 0.578 2.40
contextual 0.293 0.250 0.261 7.01
recursive 0.232 0.185 0.197 7.01
fixed-size 0.218 0.176 0.186 6.83
semantic-threshold 0.204 0.175 0.182 12.98
hierarchical 0.188 0.157 0.124 21.15
late-chunking 0.063 0.031 0.039 10.85

LASER achieves 93% recall with an average of 1.57 chunks per document. It's looking at 84 SEC filings and deciding that most of them need at most two chunks. And it's right — the relevant passage almost always lands in the top result (MRR 0.899).

Full CUAD results (102 legal contracts, k=5)

Chunker Recall@5 MRR@5 NDCG@5 Avg Chunks
LASER 0.826 0.557 0.610 11.43
LlamaIndex SemanticSplitter 0.775 0.509 0.563 12.18
contextual 0.597 0.486 0.495 61.95
recursive 0.538 0.379 0.400 61.95
semantic-threshold 0.527 0.371 0.392 142.04
fixed-size 0.461 0.310 0.331 58.84
hierarchical 0.460 0.346 0.231 183.47
late-chunking 0.280 0.215 0.196 91.98

LASER beats the next best (LlamaIndex SemanticSplitter) while using fewer chunks and no LLM inference. Notice the chunk counts: semantic-threshold produces 142 chunks per contract. Hierarchical produces 183. LASER produces 11. Fewer, better chunks.

Full MSMARCO results (500 web passages, k=5)

Chunker Recall@5 MRR@5 NDCG@5 Avg Chunks
LASER 0.999 0.900 0.926 1.70
recursive 0.985 0.666 0.747 4.60
contextual 0.972 0.687 0.758 4.60
LlamaIndex SemanticSplitter 0.901 0.704 0.755 3.07
fixed-size 0.838 0.572 0.640 4.41
hierarchical 0.739 0.555 0.381 13.56
late-chunking 0.353 0.207 0.241 7.03
semantic-threshold 0.293 0.184 0.211 22.48

MSMARCO is the benchmark everyone in information retrieval recognizes, and LASER essentially solves it — 0.999 recall means one missed passage in 500 queries. The MRR of 0.900 means the correct chunk is almost always ranked first.


LASER adapts its granularity automatically

One of the most interesting emergent behaviors: LASER automatically discovers the right level of granularity for each domain, without being told anything about the document type.

Dataset Domain Avg Chunks What LASER decided
HotpotQA Wikipedia paragraphs 1.0 Don't split — paragraphs are already atomic
FinanceBench SEC filings 1.57 Minimal splitting — filings have clear sections
MSMARCO Web passages 1.70 Minimal splitting — passages are mostly coherent
CUAD Legal contracts 11.43 Clause-level splitting — contracts need segmentation
QuALITY Long articles 13.30 Section-level splitting — articles have topical structure

Same algorithm, same parameters. The optimizer looks at each document's internal semantic structure and makes globally optimal boundary decisions. Short, coherent documents get kept whole. Long, multi-topic documents get segmented at natural topic boundaries.

This is not a special case or a fallback — it's the same optimization producing different outputs because the inputs have different structure. When the boundary penalty exceeds the tension reduction from splitting, the optimizer doesn't split. When splitting reduces tension enough to justify the cost, it does.


The α parameter: one knob to control everything

The boundary penalty α controls the tradeoff between granularity and coherence. Higher α means fewer, larger chunks. Lower α means more, smaller chunks.

We swept α across our benchmarks to map the cost-quality frontier:

FinanceBench α sweep (84 filings)

α Recall@5 Avg Chunks
1.00.8932.79
1.50.9241.93
2.00.9321.68
2.50.9301.57

FinanceBench is remarkably insensitive to α. Even at α=1.0, LASER crushes every baseline. The optimizer finds the right boundaries regardless of how much you penalize splitting, because the documents have such clear topical structure.

CUAD α sweep (102 contracts)

α Recall@5 Avg Chunks
1.00.68527.6
1.50.76216.6
2.00.80613.0
2.50.82611.4

Legal contracts show clear α sensitivity. At α=1.0, LASER over-fragments — 27.6 chunks per contract means it's splitting mid-clause. As α increases, chunks consolidate into coherent clauses and recall climbs. LASER crosses LlamaIndex SemanticSplitter (0.775) around α≈1.8 and keeps climbing.

Even at LASER's worst α (1.0 on CUAD), it still beats every baseline except LlamaIndex SemanticSplitter. You can't misconfigure it badly enough to lose to fixed-size or recursive splitting.

For most use cases, α=2.5 works well across domains.


Context bleed: a small trick that matters

When LASER finds the optimal partition, each chunk gets a 1-sentence "bleed" from its neighbors — one sentence of context before and one sentence after. The core text stays pure (for embedding), but the enriched version is available via chunk.text_with_context for vector database insertion.

This costs nothing computationally and helps retrieval systems that benefit from slightly overlapping context windows. The DP-optimal boundaries are not affected — bleed is applied after optimization, purely for downstream enrichment.


What the benchmarks reveal about existing methods

Our results expose a few things the chunking community doesn't talk about:

Semantic-threshold chunkers fail catastrophically on real documents. Greg Kamradt's approach (split when cosine similarity drops) is widely used, but it produced 142 chunks per contract on CUAD, 263 chunks per article on our segmentation benchmark, and consistently scored near the bottom. Local similarity dips are terrible signals for topic boundaries in structured text.

Our implementation of late chunking scored poorly across all benchmarks. Despite significant interest in the technique, the approach — running full documents through a transformer and pooling token-level hidden states within fixed-size spans — produced 0.063 recall on FinanceBench, 0.280 on CUAD, 0.353 on MSMARCO. We implemented the core principle (contextual token embeddings pooled per chunk) but not Jina's specific model or ColBERT-style late interaction, so these results reflect the general approach rather than any specific product.

Embedding-based splitting (LlamaIndex SemanticSplitter) was our strongest competitor. It uses percentile-based dissimilarity thresholds with sentence buffering and consistently placed second. On CUAD and FinanceBench, it significantly outperformed simpler methods. However, LASER beats it on every benchmark while requiring only a single parameter (α) instead of buffer sizes and percentile thresholds.

Fixed-size splitting is more competitive than people think. On MSMARCO, recursive splitting hits 0.985 recall — nearly matching LASER. For short, well-structured web passages, naive splitting actually works. The gap only appears on longer, structured documents where boundary placement matters.


All of this uses a free, open-source embedding model

Every result in this post uses all-MiniLM-L6-v2 — a 22-million parameter model with 384-dimensional embeddings. It's the smallest, cheapest, fastest sentence embedding model in common use. People typically use it when they need speed and don't care about quality.

LASER with this tiny embedding model outperforms methods that cost orders of magnitude more per document.

This tells you something important: the optimization framework is doing the heavy lifting, not the embedding model. When you optimize boundaries globally, even noisy, low-dimensional embeddings provide enough signal to find the right partition. Methods that make local, greedy boundary decisions need better signal to compensate for their suboptimal strategy.


When LASER doesn't help

We believe in reporting limitations honestly.

Literary comprehension (QuALITY): When questions require inferential reasoning over narrative text, embedding-based retrieval fails regardless of chunking. All methods scored under 6% recall. This is an embedding and retrieval paradigm limitation, not a chunking one.

Topic segmentation on very short text: On a Wikipedia section boundary detection benchmark (3,543 documents averaging 300 characters), LASER's boundary penalty prevents splitting because the documents are too short for tension reduction to justify a boundary. Dedicated segmentation methods that optimize for boundary detection rather than retrieval quality outperform LASER on this specific task. This is by design — LASER optimizes for retrieval, not for matching human-annotated boundaries.

Chunking latency: LASER requires computing sentence embeddings and running the DP optimization. On CUAD (102 contracts), the chunking step takes approximately 510 seconds versus 0.1 seconds for recursive splitting. The vast majority of this time (~99%) is spent on sentence encoding, not the DP solver itself — the dynamic programming runs in milliseconds. For offline indexing pipelines, this is irrelevant: you chunk once and query many times, and LASER's smaller index makes every subsequent query faster. For real-time streaming applications, the encoding latency may matter. GPU acceleration and batched encoding reduce it significantly.


Try it

pip install lasr
from lasr import chunk

# Default — works well across domains
chunks = chunk(document)

# Control granularity
chunks = chunk(document, alpha=2.5)

# Each chunk has context for richer retrieval
for c in chunks:
    c.text                # core DP-optimal text (for embedding)
    c.text_with_context   # with 1-sentence neighbor bleed (for storage)

LASR integrates with any pipeline that accepts a list of text chunks. Drop it in, replace your existing chunker, and measure the difference.


Star us on GitHub, try it on your documents, and tell us what breaks. We're particularly interested in results on domains we haven't tested — medical, scientific, technical documentation. If your retrieval pipeline matters, your chunking should too.