Product misattribution in merchandising data

Most merchandising teams are sitting on more product data than ever before. The raw material for better decisions is there. But when that data is used for assortment planning, often the numbers don't quite add up. The likely culprit: the same product appearing in multiple places.

This is product misattribution: the condition in which the same product is recorded under different names, codes, or identifiers across systems, with no reliable link between them. It is not the only reason merchandising numbers drift. Calendar cuts, gross-versus-net treatment, and channel-specific exclusions all play a part. But misattribution is a structural cause with a structural fix, and it tends to compound the effect of every other data quality problem a planning team faces.

What makes misattribution distinctive is that it has two separate causes, each operating at a different point in the product lifecycle. Understanding both is the starting point for fixing it properly.

  

Two sources of misattribution

Product misattribution does not have a single origin. It arises from two distinct mechanisms that compound each other across the commercial lifecycle of every product you range.

SOURCE 1:  Product creation misattribution
  • Originates during product development, before a unit is manufactured

  • Caused by duplicate records, naming variations, and data inconsistencies introduced in PLM and ERP systems during ranging and buying

  • Gets locked in at the PLM-to-ERP handoff and persists throughout the product's commercial life

SOURCE 2:  System break misattribution
  • Accumulates as the product travels through the commercial value chain to the customer and back

  • Caused by external identifiers applied by channel partners, marketplaces, logistics providers, and returns systems, each under their own data conventions

  • Compounds at every system boundary and arrives back in your data environment in forms that do not map to the internal product record

Both sources produce the same planning consequence: fragmented records that cannot be counted, compared, or analysed with confidence.


Source 1: Product creation misattribution

In a product-led business, a new product exists as a data record long before it exists as a physical object. That record is created early in the development process, often six to eighteen months before the season, by a product line manager or category merchant working in a PLM system. It is then built out iteratively by multiple people across design, technical, and buying functions over many months.

This is where misattribution enters. Duplicate records are created when someone cannot find an existing entry and creates a new one. Naming variations accumulate when different team members apply their own conventions to the same attributes: "Grey Marl" versus "Marl Grey", "Knitwear" versus "Knit". These are not errors in the conventional sense, merely variations the system has no mechanism to prevent.

The critical moment is the PLM-to-ERP handoff. When a product becomes commercially live, its records transfer from the PLM into the ERP, where purchase orders are raised and stock is allocated. The two systems have different data structures and field lengths, so truncations and reformats occur in the transfer. Duplicate records that existed in the PLM may or may not be caught before they cross over. Once a record is in the ERP with a SKU and purchase orders attached, it is very difficult to change. Whatever state the data was in at the point of transfer tends to persist throughout the product's commercial life.


Where PIM fits

Product Information Management (PIM) systems are designed to govern product content and control how it flows downstream. For organisations without one, implementing a PIM reduces the rate of product creation misattribution going forward. The important qualification is that PIM governs the data your own teams create, from the point of implementation forward. It does not retroactively resolve fragmentation already locked into the ERP, and it does not govern data arriving from outside the organisation. PIM prevents new fragmentation. It does not resolve the fragmentation that already exists or that originates upstream of it.

 

Source 2: System break misattribution

The second source operates entirely outside the brand's own systems. When a product crosses into an external data environment, it almost always acquires a different identity. Wholesale partners work with their own buyer codes. Marketplace platforms issue their own identifiers. Third-party logistics providers create their own warehouse references. Returns data arrives carrying whatever reference the external system used: the marketplace identifier, the buyer code, a plain-text description, or an order number.

The EAN or barcode is intended to serve as the universal constant across all these boundaries. In practice, EANs are duplicated across product generations, misapplied in production, or mapped inconsistently by channel partners. The same physical product accumulates a different identity at each boundary it crosses.

 An example: the same product as it may appear across various systems:

  • E-commerce platform

    • "Grey Essential Crew Sweatshirt" 

    • Platform SKU: EC-2024-0847

  • Wholesale portal 

    • "SWEAT-CREW-GRY-AW24"

    • buyer code: WHL-CR-490

  • Store EPOS

    • "Crew Neck Sweatshirt, Grey Marl"

    • EPOS code: 50091234

  • Returns system:

    • "Grey Sweatshirt (Crew)"

    • return ref: RTN-88821

Each of these is the same product. To any system trying to aggregate performance across channels, they are four separate things.

  

A necessary complication

Before describing the solution, it is worth touching on a complexity: in planning, "the same product" is not a fixed concept. It depends on the decision being made.

A merchandiser may need to analyse performance at style level for range architecture decisions, at style-colour level for option count and buy depth, at SKU level for size curve and replenishment, at franchise level for lifecycle and investment decisions, or at channel-exclusive variant level for assortment strategy. Two records that are technically the same physical item may not be the same planning entity. A colourway that is a core carryover in one region may be a seasonal exclusive in another. A size range that is standard in one channel may be truncated in another.

This matters for entity resolution because the grain at which records should be resolved depends on the decision the resolution is meant to support. Resolving too broadly, collapsing records that should remain distinct for planning purposes, is as damaging as leaving genuine duplicates unresolved. Over-merging can obscure channel-specific performance, flatten meaningful variant differences, and create planning aggregates that are numerically tidy but commercially misleading.

 

Where accurate attribution changes merchandising outcomes

The case for addressing misattribution is a planning accuracy argument. Inflated option counts carry ghost records into range reviews, distorting buy depth and OTB. Sell-through is calculated on a subset of sales when channel records do not resolve to the same product, making styles appear to underperform and driving pessimistic ranging decisions for the following season. Returns arrive under external identifiers that fail to map back to the internal SKU, making net performance figures unreliable. At franchise level, a core style that has accumulated different identifiers across seasons and channels cannot be tracked as a coherent commercial entity at all. All of this makes any future planning based off historical comps either unreliable, or a time sink.

What accurate attribution changes in practice:

  • Option count: Ghost records removed. Range architecture reflects actual distinct styles.

  • Sell-through: All channel sales attributed to the same product. True cross-channel rate visible.

  • Returns: External return identifiers resolved back to internal product. Net performance accurate.

  • Franchise tracking: Carryover identity maintained across seasons. Lifecycle and investment decisions grounded.


The standard responses

Naming conventions, pattern-matching rules, and manual reconciliation are the three responses most teams reach for first. All have real value at the margins. None of them address misattribution at scale. 

Naming governance controls data your own teams create going forward. It has no reach over historical fragmentation, and no authority over data arriving from suppliers, partners, or external channels. Pattern-matching rules break quickly in product data, where names share so much vocabulary that rules generating correct matches also generate costly false positives, merging records that should remain distinct. Manual reconciliation is accurate but slow, and any tool running on the data is only as clean as the last time someone had the bandwidth to do it.

The common limitation is that all three approaches address the symptom rather than the mechanism. A different class of tool is needed, but that tool only delivers value if it is implemented with the right governance model around it.

  

Governance and trust

A resolution layer that planners cannot interrogate or challenge will be rejected regardless of how good the underlying data science is. Before deployment, four questions need owners:

  1. Who holds canonical product identity and can adjudicate borderline matches

  2. How match decisions are made visible to the people using the output

  3. How exceptions are reviewed, overridden, or escalated

  4. How the resolved master is kept current as new products, channels, and partners arrive.

If those questions are answered after the tool goes live rather than before, planners will maintain a parallel version they trust more. The technology is the easier part.

 

The technology that makes resolution possible

With the business case established and the governance questions in view, the technology stack becomes a means rather than an end. The table below defines the key components. They are presented as a vocabulary for evaluating solutions and asking the right questions of data teams or technology vendors, not as a prescription. 


Term

What it is

Entity Resolution (ER)

The process of determining that two or more records, in different systems with different names or codes, refer to the same real-world product. Rather than matching on a single field, ER weighs a combination of attributes together to calculate the probability of a match. This multi-attribute, probabilistic approach handles both the naming variations from product creation and the identifier differences from system breaks.

ER packages (e.g. Senzing, Splink, Dedupe)

Software libraries that run entity resolution automatically on your data. They handle naming variation, missing fields, different character sets, and multi-attribute matching without requiring manual rules or large labelled training datasets. Better tools produce a confidence score alongside each match and surface borderline cases for human review rather than applying all matches automatically.

Knowledge Graph

A structured representation of data that captures not just products, but the relationships between them: which products belong to which range, which supplier provides which product, which products share attributes, which external identifiers map to which internal record. Unlike a flat table, a knowledge graph makes it possible to navigate and query those relationships directly.

Entity-Resolved Knowledge Graph (ERKG)

A knowledge graph built on top of resolved product data. Every node in the graph represents a unique, confirmed product identity, with all its fragmented records (internal and external) linked to it. This is the structure that allows a single query to retrieve accurate performance data across every channel and system the product has touched.

RAG (Retrieval-Augmented Generation)

A technique that allows an AI assistant to look up information in your own data before generating an answer. When you ask an AI tool a question about your product range, RAG is the mechanism by which it retrieves relevant records and uses them to form its response. Its accuracy is bounded directly by the quality of the underlying product identity model.

GraphRAG

RAG that retrieves from an entity-resolved knowledge graph rather than a flat database. Because the graph captures relationships between products and their identifiers across systems, GraphRAG can answer questions that require traversing those connections. Dr. Sullivan's research, and subsequent work by Microsoft Research and data.world, shows consistently that it outperforms standard RAG on questions requiring relationship reasoning across multiple records.

These tools sit downstream of the governance decisions described above. The ER package produces matches; the governance process adjudicates them. The knowledge graph structures the resolved data; the planning hierarchy defines the grain at which it is structured. The RAG layer queries it; the merchandising team defines what questions it needs to answer. In each case, the technology enables what the operating model permits.


What this enables

The benefits of a resolved product master scale with planning maturity but start from the moment the data is more accurate. At the simplest level, a clean reference table exported from the ER package and loaded into a spreadsheet is enough to make option counts reliable and sell-through attributable. For teams on BI tooling, the ERKG becomes the data source feeding reports and dashboards, enabling cross-channel aggregation without double-counting. For forecasting, models train on real demand signals rather than on artefacts of how data was recorded. For AI decision tools, GraphRAG retrieves from a product graph where every external identifier maps back to a canonical record, producing answers that reflect actual commercial performance rather than whichever fragment of it happened to resolve.


Planning level

What resolved data enables

Tools involved

Spreadsheet planning

Accurate option counts, valid LFL, attributed sell-through

ER package + CSV export

BI and reporting

Cross-channel metrics, supplier scorecards, franchise health

ERKG + BI connector

Predictive forecasting

Models trained on real demand signals, not artefacts

ERKG + forecasting layer

AI decision intelligence

Natural language querying, causal analysis, range optimisation

GraphRAG + LangChain

  

The bottom line

Product misattribution is a structural problem with a structural fix. Entity resolution tooling establishes which records belong together across all systems, producing a canonical product master that makes planning data count correctly. The technology is accessible. What determines whether it changes how decisions are made is the governance model around it: clear ownership, transparent match logic, and a process for handling exceptions.


Contact:

hello@thehelm.ai

Contact:

hello@thehelm.ai