Identifiers are better off without meaning

https://varoa.net/2024/05/01/identifiers-are-better-off-without-meaning.html

Once at Last.fm we had an integer overflow in an identify field. I can’t recall where exactly. But I do remember that the inconvenience of having a bunch of Hadoop jobs disrupted while we rushed to update the relevant type couldn’t spoil the collective pride for having more than 2 billion of whatever needed so many ids.

Being frugal with identifiers is seldom a good idea, but for me the worst identifier-related headaches came from IDs that had semantic value.

At Tuenti (once the largest social network in Spain) there was a concept similar to Facebook pages. Pages had types and subtypes. A page type might have been “group”, which had subtypes “business” or “community”. Another type could be “place” with subtypes like “store” or “landmark”.

Page identifiers were strings composed by concatenating numeric identifiers of the type, subtype, and then an increment field in a DB. If you visited https://tuenti.com/p/3_2_6691 you’d instantly know the meaning. It was a “place” (3) of type “store” (2), and the store ID was 6691. Page IDs decomposed in this way would be useful for multiple purposes. Chosing a type-specific implementation of controllers to compose the relevant page, routing to a database shard, that kind of thing.

At some point Product wanted to change the classification of pages. To their frustration, this became problematic because the entire taxonomy was encrusted across the code all the way from URLs to databases.

Another example are New Relic entities. Entities are an abstraction that broadly represents anything that can send telemetry to New Relic. A host, a Kubernetes cluster, an application, a JVM or a network router can be entities. Of course those are all things you want to identify, so entities have Global Unique IDentifier, or GUID. Every single telemetry datapoint is stamped with the GUID of the entity that produced it, so they act as the keystone of the New Relic platform. Features like Service maps, distributed tracing, entity relationships, and many others are built upon them.

Entity GUIDs have meaning. An example of a GUID is 1|APM|APPLICATION|23 where 1 is the account, APM is the domain, APPLICATION is the unique type within that domain, and 23 is a unique identifier within the domain and type. That application might be running in a host 1|INFRA|HOST|12. If we wanted to store that relation, we’d have an entry in some database saying:

"1|INFRA|HOST|12" RUNS "1|APM|APPLICATION|23".

Like at Tuenti, these semantics come handy. If you’re processing millions of telemetry datapoints per second, it’s useful to tell the type of reporting entity on the fly by decomposing the GUID, rather than perform an expensive lookup to an external service. Account IDs can be used to route data to cells (this talk from Andrew Bloomgarden explains how NR used this pattern to scale).

Domains like INFRA or APM corresponded to the original product verticals at New Relic. Years later Product decided (with good criteria) that they created unnecessary fragmentation on the user experience. Types change, get renamed or merged with others. Sometimes (more often than it seems) entities had to migrate from one account to another. In all these cases you would be altering the identifier of many entities.

It was painful but possible to work around many of these problems. But replacing GUIDs with semantic-free identifiers was straight impossible. By virtue of being present in thousands of URLs, NRQL queries, etc. GUIDs had become a public API that thousands of customers relied upon. A technical solution to replace identifiers would have been a major project, but doable. What wasn’t possible was to run a find/replace across the private documentation and workflows of your entire customer base.

If you look closely, the world is full of semantic identifiers. Sometimes they are hard to avoid, but almost always a pain in the neck. Because they embed a specific model of the world. But models become obsolete faster than we’d like.

Addresses make notable examples. The “complex and idiosyncratic” Japanese address system reflects the organic growth of its urban areas. In British postal codes the final part can designate anything from a street to a flat depending on the amount of mail received by the premises.

When I was a kid, license plates would give up the province where the owner lived causing an array of nuisances. They were alleviated with the adoption of European standards. But partly. The root problem being that, like the Domain Name System identifiers (“Galo’s website”) remain tied to administrative authorities (“.net”), which can change regulations, or even disappear.

Nowadays I find most semantic identifiers in resource management. For some reason, when infrastructure teams define access rules to the resources of a particular service, they prefer to create a group named after the team that owns that service, something like “owner-team-access-list”. Identifiers tied to the org chart don’t like it when the service moves to another team, or owner-team is reorged away.

Update: a commenter in Reddit pointed at another great example of problems of identifiers with semantics: the German tank problem. Do send more!

Update: Discussion in HackerNews

May 1, 2024


Any thoughts? send me an email!

To get notified on new posts, follow me on Bluesky / Twitter, or subscribe via RSS feed or email:

{
"by": "srvaroa",
"descendants": 34,
"id": 40247373,
"kids": [
40268435,
40271065,
40270252,
40268095,
40267011,
40267251,
40266989,
40267522,
40271256,
40268305,
40272752,
40267274,
40267038,
40280256,
40267005,
40267978,
40270517
],
"score": 33,
"time": 1714742457,
"title": "Identifiers are better off without meaning",
"type": "story",
"url": "https://varoa.net/2024/05/01/identifiers-are-better-off-without-meaning.html"
}
{
"author": null,
"date": "2024-05-01T22:01:00.000Z",
"description": "Once at Last.fm we had an integer overflow in anidentify field. I can’t recall where exactly. But I do remember thatthe inconvenience of having a bunch of Ha…",
"image": null,
"logo": null,
"publisher": null,
"title": "Identifiers are better off without meaning",
"url": "https://varoa.net/2024/05/01/identifiers-are-better-off-without-meaning"
}
{
"url": "https://varoa.net/2024/05/01/identifiers-are-better-off-without-meaning",
"title": "Identifiers are better off without meaning",
"description": "Once at Last.fm we had an integer overflow in an identify field. I can’t recall where exactly. But I do remember that the inconvenience of having a bunch of Hadoop jobs disrupted while we rushed to update the...",
"links": [
"https://varoa.net/2024/05/01/identifiers-are-better-off-without-meaning",
"https://varoa.net/2024/05/01/identifiers-are-better-off-without-meaning.html"
],
"image": "",
"content": "<article>\n <div>\n <p>Once at <a>Last.fm</a> we had an integer overflow in an\nidentify field. I can’t recall where exactly. But I do remember that\nthe inconvenience of having a bunch of Hadoop jobs disrupted while we\nrushed to update the relevant type couldn’t spoil the collective pride\nfor having more than 2 billion of whatever needed so many ids.</p>\n<p>Being frugal with identifiers is seldom a good idea, but for me the\nworst identifier-related headaches came from IDs that had semantic\nvalue.</p>\n<p>At <a target=\"_blank\" href=\"https://en.wikipedia.org/wiki/Tuenti\">Tuenti</a> (once the largest\nsocial network in Spain) there was a concept similar to Facebook pages.\nPages had types and subtypes. A page type might have been “group”, which\nhad subtypes “business” or “community”. Another type could be “place”\nwith subtypes like “store” or “landmark”.</p>\n<p>Page identifiers were strings composed by concatenating numeric\nidentifiers of the type, subtype, and then an increment field in a DB.\nIf you visited <code>https://tuenti.com/p/3_2_6691</code> you’d instantly know\nthe meaning. It was a “place” (3) of type “store” (2), and the store ID\nwas 6691. Page IDs decomposed in this way would be useful for multiple\npurposes. Chosing a type-specific implementation of controllers to\ncompose the relevant page, routing to a database shard, that kind of\nthing.</p>\n<p>At some point Product wanted to change the classification of pages. To\ntheir frustration, this became problematic because the entire taxonomy\nwas encrusted across the code all the way from URLs to databases.</p>\n<p>Another example are <a target=\"_blank\" href=\"https://docs.newrelic.com/docs/new-relic-solutions/new-relic-one/core-concepts/what-entity-new-relic/#entity-synthesis\">New Relic\nentities</a>.\nEntities are an abstraction that broadly represents anything that can\nsend telemetry to New Relic. A host, a Kubernetes cluster, an\napplication, a JVM or a network router can be entities. Of course those\nare all things you want to identify, so entities have Global Unique\nIDentifier, or GUID. Every single telemetry datapoint is stamped with\nthe GUID of the entity that produced it, so they act as the keystone of\nthe New Relic platform. Features like Service maps, distributed tracing,\nentity relationships, and many others are built upon them.</p>\n<p><a target=\"_blank\" href=\"https://github.com/newrelic/entity-definitions/blob/main/docs/entities/guid_spec.md\">Entity GUIDs have\nmeaning</a>.\nAn example of a GUID is <code>1|APM|APPLICATION|23</code> where 1 is the account,\n<code>APM</code> is the domain, <code>APPLICATION</code> is the unique type within that\ndomain, and <code>23</code> is a unique identifier within the domain and type. That\napplication might be running in a host <code>1|INFRA|HOST|12</code>. If we wanted\nto store that relation, we’d have an entry in some database saying:</p>\n<p><code>\"1|INFRA|HOST|12\" RUNS \"1|APM|APPLICATION|23\"</code>.</p>\n<p>Like at Tuenti, these semantics come handy. If you’re processing\nmillions of telemetry datapoints per second, it’s useful to tell the\ntype of reporting entity on the fly by decomposing the GUID, rather than\nperform an expensive lookup to an external service. Account IDs can be\nused to route data to cells (<a target=\"_blank\" href=\"https://www.youtube.com/watch?app=desktop&amp;v=eMikCXiBlOA\">this talk from Andrew\nBloomgarden</a>\nexplains how NR used this pattern to scale).</p>\n<p>Domains like <code>INFRA</code> or <code>APM</code> corresponded to the original product\nverticals at New Relic. Years later Product decided (with good criteria)\nthat they created unnecessary fragmentation on the user experience.\nTypes change, get renamed or merged with others. Sometimes (more often\nthan it seems) entities had to migrate from one account to another. In\nall these cases you would be altering the identifier of many entities.</p>\n<p>It was painful but possible to work around many of these problems. But\nreplacing GUIDs with semantic-free identifiers was straight impossible.\nBy virtue of being present in thousands of URLs, NRQL queries, etc.\nGUIDs had become a public API that thousands of customers relied upon. A\ntechnical solution to replace identifiers would have been a major\nproject, but doable. What wasn’t possible was to run a find/replace\nacross the private documentation and workflows of your entire customer\nbase.</p>\n<p>If you look closely, the world is full of semantic identifiers.\nSometimes they are hard to avoid, but almost always a pain in\nthe neck. Because they embed a specific model of the world. But models\nbecome obsolete faster than we’d like.</p>\n<p>Addresses make notable examples. The “complex and idiosyncratic”\n<a target=\"_blank\" href=\"https://en.wikipedia.org/wiki/Japanese_addressing_system\">Japanese address\nsystem</a>\nreflects the organic growth of its urban areas. In <a target=\"_blank\" href=\"https://en.wikipedia.org/wiki/Postcodes_in_the_United_Kingdom#Overview\">British postal\ncodes</a>\nthe final part can designate anything from a street to a flat depending\non the amount of mail received by the premises.</p>\n<p>When I was a kid, license plates would give up the province where the\nowner lived causing an <a target=\"_blank\" href=\"https://en.wikipedia.org/wiki/Vehicle_registration_plates_of_Spain#1900_to_1970\">array of\nnuisances</a>.\nThey were alleviated with the adoption of <a target=\"_blank\" href=\"https://en.wikipedia.org/wiki/European_vehicle_registration_plate#European_Union\">European\nstandards</a>.\nBut partly. The root problem being that, like the <a target=\"_blank\" href=\"https://en.wikipedia.org/wiki/Domain_Name_System\">Domain Name\nSystem</a>\nidentifiers (“Galo’s website”) remain tied to administrative authorities\n(“.net”), which can change regulations, or even disappear.</p>\n<p>Nowadays I find most semantic identifiers in resource management. For\nsome reason, when infrastructure teams define access rules to the\nresources of a particular service, they prefer to create a group named\nafter the team that owns that service, something like\n“owner-team-access-list”. Identifiers tied to the org chart don’t like\nit when the service moves to another team, or owner-team is reorged\naway.</p>\n<p>–</p>\n<p><em>Update: a commenter in Reddit pointed at another great example of\nproblems of identifiers with semantics: the <a target=\"_blank\" href=\"https://en.m.wikipedia.org/wiki/German_tank_problem\">German tank\nproblem</a>. Do send\nmore!</em></p>\n<p><em>Update: Discussion in <a target=\"_blank\" href=\"https://news.ycombinator.com/item?id=40247373\">HackerNews</a></em></p>\n <p>\n May 1, 2024</p>\n <hr />\n <p>\n Any thoughts? <a target=\"_blank\" href=\"mailto:anglorvaroa@gmail.com\">send me an email!</a>\n </p>\n </div>\n</article><p>\n To get notified on new posts, follow me on\n <a target=\"_blank\" href=\"https://bsky.app/profile/varoa.net\">Bluesky</a> /\n <a target=\"_blank\" href=\"https://twitter.com/srvaroa\">Twitter</a>, or subscribe via <a target=\"_blank\" href=\"https://varoa.net/feed.xml\">RSS feed</a> or email:\n </p>",
"author": "",
"favicon": "",
"source": "varoa.net",
"published": "",
"ttr": 175,
"type": ""
}