From 6dd6798b14f7fe39a36cbb96348eb23b3e0fbc28 Mon Sep 17 00:00:00 2001 From: Aki Date: Sat, 15 Jun 2024 16:27:01 +0200 Subject: Published Sequential Bias --- sequential_bias.html | 85 ++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 85 insertions(+) create mode 100644 sequential_bias.html (limited to 'sequential_bias.html') diff --git a/sequential_bias.html b/sequential_bias.html new file mode 100644 index 0000000..e6b73cf --- /dev/null +++ b/sequential_bias.html @@ -0,0 +1,85 @@ + + + + + + + + + + +Sequential Bias + +
+ + +

Sequential Bias

+
+ +
+

When dealing with containers we may loosely focus on three major attributes: +

+

And so a list could be considered an ordered, non-unique, associative container while a set could be an +unordered, unique and non-associative one. +

Of course, there are two significant gotchas here. This is a simplified classification that completely forgets to +consider operations supported by container and a number of abstract containers are defined exactly by that. And let's +not forget about the structure and implementation details, especially for the very specialized data structures. +

The other gotcha is that in normal conditions we would reserve "associative" for containers like associative +arrays and not simple lists. Here, I decided to classify it as such, because it is indexed and accessible. A +stack, where elements are not necessarily accessible unless they are at the top and where indexing is a hidden +detail would be considered as non-associative in this scenario. In general, I'd consider order, indexing, accessibility, +and association to play closely together. +

Detailed nature of the classification method, while interesting, is not the topic for today. For today's purpose, +this simplified approach seems to be coherent and sufficient.

+an a container +

With that out of the way, let's talk about generating documents. A sudden change, I know. +

In one of projects we were generating quite a lot of documents. In situations where we were handling data that had no +significant order by itself I noticed I have tendency to sort it before putting it "on paper". Vague reason was that +this way each revision of the document "felt similar". It felt similar because it was quicker to compare it by hand. +This was reinforced by the documentation process which heavily relied on redlines and manual reviews. Consistent +ordering across lifetime of a multi-revision document kept redline size small and easy to skim through. +

But if I represent unsorted data as sorted in two different orders, is the data any different? No, because order does +not matter. Why do I sort it then? Because it's easier to see differences. And why is that? Because I use tools (wetware +or software) designed to handle sequential data. Why? Because these tools are commonly available. +

The exact same thing happened when we were migrating data from the very same system into markdown text files tracked +by git. Since both git and our brains were looking for ordered data, we exported ordered data. +

Now, consider a Python script: +

+def do_one_thing():
+	pass
+
+def do_another_thing():
+	pass
+
+def do_both_things():
+	do_one_thing()
+	do_another_thing()
+
+

If we move do_both_things to the top, will the script change? Yes. It's a list of statements. Python +interprets the code sequentially and the code itself is represented as a text. But will it really change? No. It's a map +of functions. Python will bind the names the same way in both cases and the behaviour will not change. Both answers can +be reasoned to be true. +

Of course, if we take a look at the AST, we will confirm that from parser standpoint the first is true. Body of a +module is a list, so the order matters. This makes sense since Python allows to put behaviour at module level. What if +we simulate this situation in different languages? +

Take C++ next. One can shuffle things around but only after preparing enough declarations first. Funny enough this +reinforces both sides equally. If you shuffle things without declarations, they will easily break - meaning the program +changed. If you prepare all the declarations needed, you can shuffle or split things as much as you want - meaning the +program does not change. +

In C++ case, we could summarise it as legacy and compatibility reasons. Can I continue to analyse and reason it for +every single case or will it make sense to use an unordered representation sometimes? How would changing the assumptions +change the problems, tooling and processes? One way to find out. For programming languages, I currently see one concrete +thing: semantic diff and merge tools. You may already find those in the wild, but it also sounds like a cool weekend +project to work on. Moreover, on the more theoretical side of things, I think there are still things to consider +regarding structure and representation. I'm preparing a post about those, but the summary is "I'm quite stupid." +

Nonetheless, I plan to keep this little bias in mind. Whether it is a result of my brain, tools, legacy or anything +else. If I perform actions on certain representation of data with all its constraints, I will get results for exactly +that and not the actual abstract data that is being represented. That's why analysing and representing problems should +always be on of the priorities. + +

+ -- cgit v1.1