The internet never forgets.

Rate this post

The internet has its own unique memory that forgets almost nothing. Part of this memory is archive.org, a project initiated by Brewster Kahle in 1996, which has made it its mission to archive the internet. A central component of archive.org is the Wayback Machine.

According to its own figures, the Wayback Machine has access to a database of approximately 1 trillion web pages. Similar to Google, the Wayback Machine is operated via a simple search field. In this search field, you can search for either a specific internet domain or a specific keyword. If something related to the search term is stored in archive.org’s database, the calendar view shows the date on which a so-called snapshot was created. All content from a domain that was freely accessible on that day was included in the snapshot. This makes it easy to recover content that has already been deleted.

However, when working with the Wayback Machine, you need to be aware of certain conditions. While archive.org is a non-profit organization that is financed by donations, there are still some limitations. Furthermore, archive.org is headquartered in the United States. Considering the enormous costs incurred simply for collecting and storing the data, it’s more than just a suspicion that this project has close ties to government agencies. Official bodies also have considerable reasons for wanting such a service without having to adhere to the strict regulations of official government organizations.

One problem that arises from working with the Wayback Machine is the frequency of changes to the archived homepages. Especially with small websites, several changes are made between snapshots. But even seemingly large websites, like spiegelonline.de, don’t have a daily snapshot, as one might expect. The reasons for this are quite varied. In addition, there are various mechanisms that prevent crawlers from indexing the website. The purpose of such efforts can be, among other things, to limit traffic on the server itself, so that resources are available to readers and not blocked by bots.

Another issue arising from this massive amount of data is, of course, the potential use of artificial intelligence to train large LLMs (Learning Management Systems). Large platforms fear losing their users, an aspect I addressed back in 2023. In February 2026, there was also a public discussion on this topic between Wayback Machine board member Mark Graham and Nieman Lab, which can also be found as a blog post at archive.org. Most website operators face this problem, as creating and publishing content costs both time and money. In the case of elmar-dott.com, this includes expenses for the server, domain, books, and various subscriptions. Since we explicitly oppose automated content creation, all articles on elmar-dott.com are based on concrete experience and in-depth research into the respective topics. This also means that many of the solutions described are actually used by the authors themselves. To prevent AI from harvesting the content and thus limiting our visitors to web crawlers, high-quality information is only accessible via subscription. This applies particularly to references, source code, and selected articles.

Another aspect, of course, is the trustworthiness of the stored content. Even though archive.org’s motto is non-profit and its efforts to ensure a freely accessible internet, this doesn’t mean that archive.org doesn’t potentially pursue other, unofficial interests. Electronically stored content is known to be easily manipulated. Therefore, the content collected via archiving services should be considered more of an indicator. Of course, there are ways to protect the collected content from alteration. Blockchain technology would be one such way to detect manipulation.

In the premium article “Harvest Time,” I describe how to gather information using various free and paid APIs. The Wayback Machine can also be used for sensitive research tasks. Because, as is so often the case, mistakes happen in business. Small mishaps are simply human, and sometimes companies can ‘accidentally’ publish sensitive internal information. This could be error messages on the website that reveal which DBMS or server is in use. As soon as you become aware that potentially misusable information appears in any database, the first step is to contact the database owner and request its removal. Often, an explanation and a friendly word are all it takes.

Of course, archive.org isn’t solely focused on websites. Its goal is to create a comprehensive library, which naturally includes digitizing copyright-free books, similar to Project Gutenberg. But films, audio, and software can also be found in the archive. Interestingly, archive.org can also be found on the Onion Tor network under its own Onion domain.

Of course, archive.org isn’t the only organization trying to preserve the internet. The website archive.today also has this goal. However, archive.today’s database isn’t as comprehensive. On the other hand, you can quickly submit your own URL via an input field, and your website will be added to their archive.

As we can see, there are certainly some gems on the internet. You don’t have to be a journalist to delve deeply into research techniques. The field of reconnaissance in cybersecurity also requires a certain amount of intuition. There’s a reason they say: knowledge is power.


This entry was posted in Articels and tagged , , , by Elmar Dott. Bookmark the permalink.
Elmar Dott

About Elmar Dott

Elmar Dott has been implementing large web applications as a freelance consultant in international projects for over 20 years. His focus is on DevOps, configuration management, software architectures & release management. As a trainer, he shares his knowledge in training courses and also speaks regularly on current topics at conferences.

Leave a Reply