When dealing with containers we may loosely focus on three major attributes:
- order, whenever elements or slots are ordered or not
- uniqueness, whenever one element can show up multiple times
- association, whenever slots are identifiable by another value
And so a list could be considered an ordered, non-unique, associative container while a set could be an
unordered, unique and non-associative one.
Of course, there are two significant gotchas here. This is a simplified classification that completely forgets to
consider operations supported by container and a number of abstract containers are defined exactly by that. And let's
not forget about the structure and implementation details, especially for the very specialized data structures.
The other gotcha is that in normal conditions we would reserve "associative" for containers like associative
arrays and not simple lists. Here, I decided to classify it as such, because it is indexed and accessible. A
stack, where elements are not necessarily accessible unless they are at the top and where indexing is a hidden
detail would be considered as non-associative in this scenario. In general, I'd consider order, indexing, accessibility,
and association to play closely together.
Detailed nature of the classification method, while interesting, is not the topic for today. For today's purpose,
this simplified approach seems to be coherent and sufficient.
With that out of the way, let's talk about generating documents. A sudden change, I know.
In one of projects we were generating quite a lot of documents. In situations where we were handling data that had no
significant order by itself I noticed I have tendency to sort it before putting it "on paper". Vague reason was that
this way each revision of the document "felt similar". It felt similar because it was quicker to compare it by hand.
This was reinforced by the documentation process which heavily relied on redlines and manual reviews. Consistent
ordering across lifetime of a multi-revision document kept redline size small and easy to skim through.
But if I represent unsorted data as sorted in two different orders, is the data any different? No, because order does
not matter. Why do I sort it then? Because it's easier to see differences. And why is that? Because I use tools (wetware
or software) designed to handle sequential data. Why? Because these tools are commonly available.
The exact same thing happened when we were migrating data from the very same system into markdown text files tracked
by git. Since both git and our brains were looking for ordered data, we exported ordered data.
Now, consider a Python script:
def do_one_thing():
pass
def do_another_thing():
pass
def do_both_things():
do_one_thing()
do_another_thing()
If we move do_both_things
to the top, will the script change? Yes. It's a list of statements. Python
interprets the code sequentially and the code itself is represented as a text. But will it really change? No. It's a map
of functions. Python will bind the names the same way in both cases and the behaviour will not change. Both answers can
be reasoned to be true.
Of course, if we take a look at the AST, we will confirm that from parser standpoint the first is true. Body of a
module is a list, so the order matters. This makes sense since Python allows to put behaviour at module level. What if
we simulate this situation in different languages?
Take C++ next. One can shuffle things around but only after preparing enough declarations first. Funny enough this
reinforces both sides equally. If you shuffle things without declarations, they will easily break - meaning the program
changed. If you prepare all the declarations needed, you can shuffle or split things as much as you want - meaning the
program does not change.
In C++ case, we could summarise it as legacy and compatibility reasons. Can I continue to analyse and reason it for
every single case or will it make sense to use an unordered representation sometimes? How would changing the assumptions
change the problems, tooling and processes? One way to find out. For programming languages, I currently see one concrete
thing: semantic diff and merge tools. You may already find those in the wild, but it also sounds like a cool weekend
project to work on. Moreover, on the more theoretical side of things, I think there are still things to consider
regarding structure and representation. I'm preparing a post about those, but the summary is "I'm quite stupid."
Nonetheless, I plan to keep this little bias in mind. Whether it is a result of my brain, tools, legacy or anything
else. If I perform actions on certain representation of data with all its constraints, I will get results for exactly
that and not the actual abstract data that is being represented. That's why analysing and representing problems should
always be on of the priorities.