Skip to main content

Collecting and Categorizing Web Links in Org-mode

Tagwood Forest

Whether they’re called bookmarks, favorites, or weblinks, we all accumulate a large number of them over time. At some point, it all becomes overwhelming and needs organizing. So, I wrote a Python tool for that.

What Has Happened So Far

New weblinks always originate from my browser. whether it's Firefox on my desktop and laptop or on my phone. I could simply leave them in Firefox, allowing me to sync and access them across devices. Sounds good enough, right?

But I also maintain a page with links on this blog that I want to share publicly. Ideally, I wouldn’t have to manage these links separately. That’s why, years ago, I decided not to use built-in or online bookmark managers. Instead, I store all my weblinks as plaintext in Orgmode format. Each link is a heading with the page title, linked to the respective URL.

If I feel like it, I add a short note or a quote from the page. On my laptop and desktop, a small browser add-on (Org Capture) appends new links to my Links.org file using Emacs’ org-capture protocol. On my phone, this add-on doesn’t work, so I share the page with the Orgzly-revived app, which lets me edit Org-mode files on Android. The link also ends up in Links.org, and I sync my entire org directory across all my devices.

The Links.org File in Detail

The file starts with a set of metadata lines that my static site generator needs to create the public link page. It looks like this:

#+BEGIN_COMMENT
.. title: Links
.. description: All the links I can recommend
.. slug: links
.. date: 2019-01-01 00:00:00 UTC+01:00
.. tags: links, emacs, orgmode, qutebrowser, plaintext, pim
.. previewimage: qrcodebus.png
#+END_COMMENT
#+OPTIONS: tags:nil

The last line ensures that tags assigned to individual links aren’t exported to HTML.

All new links start as second-level headings. The first-level headings are divided into four sections:

1. Quicklinks

These are my frequently used links, marked with :noexport:, so they don’t appear in the public link list. They are placed at the top of the file for quick access.

* ------ Quick Links ----- :noexport:
 :PROPERTIES:
 :VISIBILITY: children
 :END:
Social Media Shopping News Finanzen Haustechnik Konten Other
Mastodon Idealo Tagesschau Volksbank FritzBox BibLoad Thingiverse
Feddit Amazon The Guardian Barclays Wechselrichter Stadtbücherei Cults
  eBay   Schw. Hall Stromzähler    
Codeberg Otto Wetter WAF Paypal Home Assistant   Oryx
Wallabag Kaufland Stat. Bundesamt   CUPS    
    Dashboard DE   Syncthing    
ChatGPT   Statista   Jenkins    
Wiki            
             

2. Public Links

These links are visible on the blog. My site generator processes this section and turns it into a dynamic list using a bit of CSS magic.

3. Private Links

This is where things get interesting. Right now (March 4, 2025), this section holds around 450 links, categorized with up to four tags. The organization, however, is… let’s say, questionable. More of a dumpster fire, really.

4. The INBOX

Basically, the same chaotic mess as the private links. just without any tags assigned yet.

So, my tool’s job is to process sections 3 and 4.

What’s the Goal?

The tool should take all weblinks from sections 3 and 4 and organize them into one logical tree structure based on their tags. The tricky words here are one and logical.

Are There Multiple Possible Trees?

Of course, otherwise, it would be too easy. Tags are inherently non-hierarchical. If I tag a link with a and b

* Link :a:b:

I can create two different trees:

├── a
│   └── b
│       └── Link
└── b
    └── a
        └── Link

As soon as I have three tags, I can insert the additional tag at three different levels. in each of the two trees. resulting in six possible trees.

The mathematical representation is 3! (i.e., 3 factorial). If anyone remembers exponential growth from the COVID era, that’s nothing compared to factorial growth. it’s child’s play in comparison.

With six tags, we’re already at 720 possible trees, and just eight tags result in 40,320 trees.

And what do I do with all these trees?

The common approach would be to use them all. That means a link with four tags would appear 24 times in the link list. but there would also be 24 ways to find it. Roughly estimated, my 450 links would result in about 3,000 links in the output file. And that’s too much for me.

Many of these paths wouldn’t even make sense. Suppose I bookmark the location where I last parked my car:

* Auto :germany:munic:mainstreet:

Then I don’t think, "Hmm, I remember parking on Main Street, so show me all the countries that have cities with a Main Street so I can pick the right one."

So, I need to cut down the trees that don’t make sense. This brings in the second key word: the structure should be a logical tree. But, as should have been clear even before Trump, there are some pretty strange interpretations of logic. and that’s not even considering alternative logic.

Computers aren’t particularly good at determining what makes sense in a search. So instead, I rely on a different criterion: frequency. Because counting is something computers are really good at.

If my file contains the tag emacs 200 times but the tag orgmode only 100 times, then the tree looks like this:

└── emacs
    └── orgmode
        └── Link

That works pretty well, but now and then, some strange hierarchies still appear. especially with tags that are rarely used overall. In those cases, a single extra use can already make a difference in how the sorting plays out.

And now I’m even cheating a little.

In most cases, rare tags don’t really matter. But when this happens with major (frequent) tags, it can be quite annoying. That’s why I cheat a bit. Sticking with the car example, I’d actually write it like this:

* Auto :Germany:Munic:mainstreet:

See the difference? The first two tags are now capitalized. For my tool, this signals that country and city are hierarchically more important than the street. This still leaves two possible trees, but it eliminates four others right away. I can only enforce one strict hierarchy level this way, but it helps handle the edge cases quite well.

Why is this cheating? Well, according to the textbook, tags don’t carry hierarchical information, and my programmer’s heart does feel a little dirty about it.

Now, on to the technical part, in case anyone wants to implement this in another language:

The Algorithm

  1. Collecting Links:
    • First, I gather all the links I want to process and store them in a dictionary.
    • The key is a hash of the actual link, which eliminates accidental duplicates.
  2. Generating Permutations:
    • I generate all possible tag permutations.
    • Any permutations that violate my "cheated" tag hierarchy are skipped.
  3. Building the Tree:
    • The list of links is passed into the core function that constructs the tree.
    • Since trees lend themselves well to recursion, the function is recursive.
    • Each function call processes one hierarchy level and then calls itself for deeper levels.
    • The recursion stops when there are no more links left that don’t already belong to a tag.
  4. Inside the Recursive Function:
    • The current tag is added as a node in the tree.
    • All links that fit exactly at this level (i.e., having exactly the required tags for this depth) are added.
    • If links remain, the voting process begins:
      • Each remaining link votes for the tag it would like to see added next.
      • The tag with the most votes is chosen.
      • A new list of links that voted for this tag is created.
      • The function calls itself with this list and the chosen tag, building the subtree.
      • Once processed, those links are removed, and another voting round occurs if any links remain.
  5. Exporting the Tree:
    • Once fully built, the tree is written back into an Org-mode file.
    • Done!

The Result

The output file now contains a hierarchically structured and sorted list of tags:

  • The most frequent tags appear first.
  • If two tags have the same frequency, they are sorted alphabetically .
  • All links are included in the tree.
  • No duplicate links exist.

This gives me a link tree that is highly usable. While it may not be the perfect tree I would create manually, it's close enough. And the best part? There's still plenty of room for optimization in the algorithm.

Tweaking the Voting

Anyone interested in voting systems will have realized that there are dozens, if not hundreds, of different voting methods, each leading to slightly different results and optimized for various goals (fixed number of seats, weighting by stock share, minimizing the tags traveled by horse to the capital, etc.).

This also applies to the voting in this algorithm. Currently, every link votes for every tag that benefits it in some way with one vote. However, other variants could also be considered:

  • If a link sees 3 possible trees for Tag A and 2 possible trees for Tag B, it could vote with 3 and 2 votes for the respective tags.
  • The votes could be weighted based on whether a tag is immediately needed at this level or possibly further down the tree.
  • etc.

Overall, I am already very satisfied with the results. My repository with the current state of the tool can be found (here).

2025-03-04