Beyond the Pin: Rethinking How We Annotate Visual Documents

Every design tool, image editor, and document review platform handles collaborative feedback the same way: you click somewhere on the canvas, a little pin appears, and you type your comment. The pin remembers where you clicked. That's it. That's the whole mechanism. It has been for so long that we've stopped noticing how badly it fits the work it's supposed to support.

The fundamental problem is that pins anchor to coordinates when reviewers really want to anchor to things. "Adjust the contrast on this wheel" is a comment about a wheel, not about pixel (427, 318). The wheel might move. It might get cropped. It might get nested inside a new group. The intent of the comment stays the same; the coordinates do not. And so the pin drifts off, the comment loses context, and the reviewer's careful thinking turns into a question mark next to whatever happens to be at position (427, 318) after the layout changed.

This isn't a small UX gripe. It's a structural failure that compounds across every collaborative document, and it's worth thinking carefully about what a better model looks like.

Why pins were good enough for so long

It's worth being fair to the pin. When collaborative review tools first emerged, coordinate-anchored comments were the only thing that worked. The tools didn't understand the contents of the canvas. To them, a design file was a flat image. The only thing the system could reliably reference was the position you clicked. Pins were the highest-resolution interaction the underlying systems could support.

And for simple documents, pins worked fine. A single image, a single round of feedback, no major edits between drafts. The reviewer pinned, the designer responded, the pin stayed roughly where it should. Everyone moved on.

But documents got more complex. Reviewers got more demanding. Iteration cycles got faster. Cropping, reordering, regrouping, and reflowing became normal mid-review activities. And the gap between what pins could do and what reviewers needed got wider every year.

The actual failure mode

Here's a concrete example. A designer shares a canvas with three cars. A reviewer wants to leave specific feedback on the front wheel of the second car.

→ The pin version

Three pins. Pin 1 is near a wheel. Which car's wheel? Whichever was at those coordinates when the pin was placed.

→ The semantic version

@car_2

The reviewer mentioned @car_2.front_wheel. The reference points to a thing, not a place.

Now imagine the designer edits the file. They reorder the cars. They crop the canvas to focus on one of them. They move the second car to a different position. With pins, the reviewer's comments are now stranded at coordinates that no longer mean anything. The pin near "the wheel" might now be next to empty space, or hovering over a completely different car.

The pin failed because it never actually understood what the reviewer was talking about. It only ever knew where they clicked, not what they were looking at.

Pins anchor to coordinates. Reviewers think in objects. The whole problem is in that gap.

What it would mean to anchor to things instead

The alternative is straightforward to describe and harder to build: instead of letting reviewers click on coordinates, give them a way to reference the actual visual elements they're talking about. Not pixels, just objects. Not regions, but semantically identified things with stable identities that survive editing.

The reviewer's comment shouldn't be "place this pin at (427, 318)." It should be something more like:

→ A comment, semantically anchored

Reviewer

2 minutes ago

Can we adjust the contrast on @car_2.front_wheel? It's getting lost against the road. Same goes for @car_2.back_wheel.

Now the comment references the wheel, not that spot. The system knows which wheel because it has a stable identifier for it. Hovering the reference in the comment highlights the wheel on the canvas. Editing the document, even significantly, doesn't break the reference, because the reference was never about coordinates in the first place.

This isn't a UI tweak. It's a completely different model of what an annotation is.

Building it: the four problems to solve

Making this real requires solving four distinct problems, each interesting in its own right.

Segment the canvas into elements

The system needs to know that there are three cars, not just a soup of pixels. This is a classic computer vision problem, and modern segmentation models (the SAM family, anything that produces high-quality masks) handle it remarkably well. Run the model over the canvas, get back a set of masks, and you have your initial vocabulary of "things on the page."

The interesting twist is that you don't want one flat list of segments. You want a hierarchy.

Build a hierarchy of compositional relationships

A car contains wheels. A wheel contains a hubcap. The reviewer might want to comment on any level of this hierarchy: the car, the wheel, or the hubcap. The system needs to know all three exist and how they relate.

The approach: segment recursively. Run the segmentation model on the full canvas to find top-level elements. For each one, run it again on just that region to find sub-elements. Repeat until nothing useful comes back. Parent–child relationships fall out naturally from spatial containment plus visual similarity within the same semantic family.

iii

Give every element a stable identity

This is the hardest part. The reviewer's comment is going to live for weeks. The document is going to change. The system needs to identify "car 2" as the same car 2 even after it's moved, resized, or partially re-rendered.

The technique is content fingerprinting: combine perceptual hashes, shape features, and semantic class outputs into a fingerprint that describes what the element is rather than where it is. When the document changes, run segmentation again and match new elements to old fingerprints with a similarity score. High match: it's the same car, just somewhere else now. The identifier sticks.

Expose the elements through hierarchical references

Once the system has a stable, hierarchical vocabulary of things, the UI side is almost trivial. Comments support hierarchical references like @car_2.front_wheel. Autocomplete suggests elements as the reviewer types. Hovering a reference highlights the element on the canvas. References can be nested arbitrarily deep.

What the hierarchy actually looks like

The end result is a tree (or really, a graph) of named elements that the reviewer can reference at any granularity:

→ A canvas, semantically described

The element graph behind the comment

@scene root

@car_1 car, ordinal=1

@car_1.front_wheel wheel, position=front

@car_1.back_wheel wheel, position=back

@car_2 car, ordinal=2

@car_2.front_wheel wheel, position=front

@car_2.back_wheel wheel, position=back

@car_3 car, ordinal=3

The path through the hierarchy becomes the identifier. car:2:wheel:front is a stable handle that survives most reasonable edits, because none of those edits change the underlying content fingerprint of the front wheel of the second car.

→ Why this generalizes

The same model applies far beyond cars. A web design has section → component → button → label. A photograph has scene → person → face → eye. A medical image has region → organ → lesion. Any visual document with compositional structure benefits from the same hierarchical, content-addressable approach.

Surviving change

The whole point of this model is that comments stay anchored to the right things even when the document changes. Let's be specific about what "change" means and why each kind is handled gracefully:

Edit type	What happens with pins	What happens with semantic references
Element repositioned	Pin stays at old coordinates, becomes orphaned	Fingerprint matches; identifier follows the element
Element resized	Pin no longer aligns with intended target	Shape and semantic features still match the fingerprint
Canvas cropped	Pins outside the crop disappear; pins inside drift	Visible elements keep their identifiers; cropped elements are flagged but not lost
Element regrouped	No effect — pins don't understand grouping	Hierarchy updates; references follow into the new group
Layout reflowed	Most pins end up in the wrong place	Identifiers stable; references resolve correctly
Similar new element added	No effect, but adjacent pins become confusing	Disambiguation through ordinals or context; new element gets its own identifier

The mechanism that makes all of this work is the fingerprint matching. When the document is edited, segmentation runs again and produces a fresh hierarchy. Each new element gets matched to the closest fingerprint from the previous version, using a similarity score that combines visual descriptors with semantic class. If the match is above threshold, the identifier is preserved. If not, it's a new element.

This is fundamentally the same idea behind content-addressable storage in version control. Git doesn't care where a file lives in your repository tree; it cares about the hash of its contents. Move the file, rename it, restructure the project. Git still knows what changed and what didn't. The proposal here is essentially Git's model applied to visual documents: address things by what they are, not by where they sit.

Where this gets interesting

Once you have a stable, hierarchical vocabulary of things in a document, a lot of features become natural that were previously hard:

Cross-document references. "Match the styling of @design_v2.hero.cta_button." If both documents share semantic vocabulary, you can reference elements across them.

Versioned comments. Because identifiers are stable, you can see the full history of comments on a single element across all versions of the document. "Everything that was ever said about @car_2.front_wheel" becomes a coherent thread.

Bulk operations. "Apply this change to all @*.front_wheel." Pattern-matching against the hierarchy lets reviewers and designers work with classes of elements, not just instances.

Better AI assistance. If an LLM-based assistant is helping with design review, it can understand and produce comments grounded in semantic references rather than pixel coordinates. The whole conversation becomes more precise.

Accessibility. A semantic vocabulary of canvas elements is, almost for free, a better screen reader experience. Visually-impaired reviewers can navigate the hierarchy rather than guessing at pin positions.

The tradeoffs to be honest about

This isn't a free win. There are real costs:

Segmentation quality is everything. If the model can't reliably identify the elements in your document, the whole system falls apart. For domains where segmentation is well-solved (natural images, common UI patterns), this works. For domains where it isn't (highly abstract art, unusual schematics, novel illustrations), the approach has to fall back gracefully.

Hierarchy disagreements are real. Reviewers and the system may disagree about what counts as an element. Is a button's label a separate element, or part of the button? Different domains want different answers, and the system needs to be tolerant enough to handle either.

Compute cost. Recursive segmentation and fingerprint matching aren't free, especially on large canvases. The user-perceived experience needs to feel instant, which means careful work on incremental updates rather than full recomputes.

Identifier stability under aggressive edits. If you redraw the entire scene from scratch, the system has no way to know that the new car is "the same" car as the old one, and probably shouldn't pretend it does. There's a graceful-degradation story to design around: when the fingerprint match falls below confidence, flag the comment as needing reassociation rather than silently breaking.

None of these are dealbreakers. They're the engineering challenges that make the difference between a demo and a product.

The bigger pattern

Step back from the specifics and there's a broader principle worth naming. We've spent decades building collaborative tools that anchor information to positions: pixel coordinates, file paths, line numbers, page numbers. Every one of these breaks when the underlying thing being referenced changes location.

The systems that survive change well are the ones that anchor to content. Git uses content hashes. Modern databases use logical IDs that survive physical reorganization. Hypermedia survives URL changes through redirects and canonical references. Each of these solved the "anchor to a thing, not a place" problem in its own domain.

Visual collaboration tools are overdue for the same shift. Pins were the temporary solution we built when the systems couldn't see what they were holding. The systems can see now. The annotations should be able to point at what they actually mean.

The pin worked because it was the best we could do at the time. The best we can do now is better. Anchoring to things instead of places isn't a feature — it's the right model finally catching up with the work. Thanks for reading ✦