The story of the fight to archive the internet

(Image credit: Shutterstock / 300 librarians)

On the same day in 1996, Brewster Kahle founded two separate but closely connected organizations. The first went on to make him very wealthy, and the second has earned him not a single dime.

Alexa Internet (often confused with Alexa, the voice assistant) was a service that crawled the web for metadata and other information, which was then served up via the browser to help people make sense of the content on a website.

A few years later, the company was acquired by Amazon in a deal worth $250 million, and converted into an SEO service. However, despite the change of ownership, Alexa Internet continued to supply the data it collected to the second organization Kahle had founded: a non-profit called the Internet Archive.

It was Kahle’s vision that the Internet Archive would become a modern version of the Library of Alexandria, and provide “universal access to all knowledge," he told TechRadar Pro.

This digital library, over which he still presides, is now home to many billions of archived web pages (accessible for free via a service called the Wayback Machine) and millions of digitized books.

Earlier this year, the Archive celebrated a landmark 25th anniversary, but Kahle is still unsatisfied with its scope. The project is also facing threats unlike any it has encountered before.

Brewster Kahle, founder of the Internet Archive (Image credit: Brewster Kahle)

An early taste

Kahle’s preoccupation with both the internet and the exchange of information can be traced back to the Massachusetts Institute of Technology (MIT), where he studied for a degree in computer science in the 1980s.

At MIT, Kahle and his cohort had access to the Advanced Research Projects Agency Network (more commonly known as ARPANET), a precursor to the internet as it exists today and the source of the first ever email.

ARPANET allowed computers to communicate with one another over telephone lines using a technique called packet switching, whereby data is broken down into small chunks, fired across a network and reassembled at its destination. ARPANET quickly became a hotbed for innovation in the fields of computing and networking.

“We were using the ARPANET intranet for pretty much everything,” said Kahle. “And already we were witnessing some of the problems that would end up playing out over the next 40 years.”

He described an experiment whereby a mailing list was created that included all ARPANET users. The idea was to see what would happen if different virtual communities (represented at the time by a series of smaller mailing lists and Usenet groups) were thrown into one space.

“It was chaos, anarchy and misinformation - it was terrible!” explained Kahle, with a wry smile. “We could basically see civil discourse dissolving in front of our eyes.”

“However, we also saw the power of connecting people across institutions and across the world, with minimal friction and delay.”

From this time onwards, Kahle says, constructing a grand digital repository for knowledge became his primary focus. But he lacked almost all of the tools that would make this possible.

After leaving MIT, he channeled his ambitions into a company called Thinking Machines, which aimed to commercialize research into parallel computing architectures. Here, Kahle was lead engineer on a supercomputer called the Connection Machine (the fastest in the world at the time), which he later used to devise a form of search engine.

Brewster Kahle (second from the right) and his team, next to a prototype of Connected Machine-1. (Image credit: Tamiko Thiel)

The next step was to build a network publishing system that could be used to disseminate digital information. To fill this gap Kahle developed WAIS (short for Wide Area Information Server), an open system that was adopted by companies like the New York Times and Britannica, which wanted to control the distribution of their content in the coming digital age. All of this took place before the internet even existed, it must be remembered.

“I think we were seen as visionaries, but the goal was always to build the digital Library of Alexandria,” Kahle told us. “And this was not a new concept; there was already As We May Think, a key paper by Vannevar Bush from 1945, and Ted Nelson was already doing hypertext and Project Xanadu.”

“In the 1980s, [the library] was something that I thought was already promised, just not yet delivered. So I set out to build it.”

The Library of Alexandria 2.0

Since its conception, the Internet Archive has amassed an impressive 70 petabyte (70,000 terabyte) library of content, comprising 635 billion webpages, but also 34 million books, 14 million audio recordings and more.

This treasure trove of content is stored in high-capacity hard drives at the Internet Archive headquarters, but is also backed up partially in the Netherlands and (as a symbolic gesture) in Alexandria, Egypt.

The non-profit has so far preserved the writings of more than 100 million people, and Kahle has ambitions to increase this figure by a factor of ten. But with more content now published online than the Archive can hope to keep up with, the central question becomes: what is worthy of preservation?

“The Internet Archive crawls the World Wide Web in the same way search engines do,” Kahle explained. “To figure out what to crawl, we work with hundreds of libraries and librarians, who determine what is important to scrape and at what frequency. These people build collections on the subjects they are expert in.”

Approximately 3,000 crawls are performed simultaneously every day, each with different mandates. Some specialize in news, social media or a particular region, for example, and others are steered by the recommendations of the public, who submit web pages they believe are worth archiving.

These crawls capture a main web page, but also a number of offshoots that users can navigate between via the Wayback Machine, creating something that feels much more alive than a static screenshot.

“It’s a massive undertaking by thousands, if not hundreds of thousands, of people to decide what should be saved,” said Kahle. “We’re interested in any signal that can show us what’s worth preserving.”

As well as archiving web pages for posterity, the organization also sees its role as a tool for safeguarding digital evidence. It has been used by journalists, for example, to access material an individual or company has later removed from the public web. It is also fertile ground for students and academics studying the evolution of online culture and digital communication.

However, keeping the Wayback Machine updated with current data is just one way in which the organization seeks to achieve its ultimate goal; the digitization of books is another important facet.

The business of books

Asked whether the mission or purpose of the Internet Archive has changed over its quarter-century history, Kahle returned a resounding “no”. But while the core mission has remained the same, the way in which people use the resource has certainly evolved.

During the pandemic, for example, students were locked out of their libraries and school rooms, and forced to rely on e-learning services and the valiant efforts of parents. Kahle says the Archive saw the use of its digital book lending service skyrocket, and received a flood of messages from libraries that wanted to lend their collections in digital form.

Spurred into action, the Internet Archive launched the National Emergency Library. Usually, the organization lends one digital book for every physical copy it owns (a practice known as controlled digital lending), which means a digital copy can only be loaned out to one person at a time. But under this emergency scheme, the waitlist-based system was discarded for a period of fourteen weeks.

Many students, teachers and other readers celebrated the initiative, but the Emergency Library was met with disgust by copyright organizations that saw it as a flagrant breach of the rights of authors, who were also struggling due to the pandemic. A collective of publishers (including Penguin Random House, Harper Collins, Hachette and Wiley) is also taking the Internet Archive to court over “wilful mass copyright infringement”.

"The Internet Archive does not seek to 'free knowledge'; it seeks to destroy the carefully calibrated ecosystem that makes books possible in the first place — and to undermine the copyright law that stands in its way,” assert the publishers.

As you might imagine, Kahle disagrees. “We’ve been lending books for ten years. These publishers contend that we are not allowed to lend - and it’s outrageous,” he said, with uncharacteristic forcefulness.

“What libraries do is buy, preserve and lend materials. But these lawsuits represent a massive threat to the core function of libraries in the digital world; publishers are saying you cannot buy, cannot preserve and cannot lend.”

At the time of writing, the lawsuit is in discovery, with further statements to be delivered in the spring.

An opportunity lost

Over the years, the Internet Archive has been sustained by a combination of funds from Kahle’s own pocket, fees charged to libraries for digitization services, and contributions from members of the public.

However, keeping its services operational will become more and more expensive as the library expands, unless technical advances cut the cost of data storage, server hosting and the other technologies on which the non-profit relies.

Although Kahle says his personal wealth is sufficient to guarantee the longevity of the Internet Archive (or at least its trove of data), he recently put out a call for donations to help fight the ongoing lawsuit, but also other obstacles to the free flow of information.

“The internet community has not done enough to build reliable and responsible organizations to support the digital world. And we could see the dangers from the very beginning,” said Kahle, referring both to the crisis of misinformation and the stranglehold of Big Tech.

“If we do not strike a good balance, we could end up with an information environment where everything we read is monitored and vetted by a small group of companies and governments. We will have lost the opportunity the internet has given us.”

To highlight these issues, the Internet Archive recently launched the Wayforward Machine, a satirical take on the Wayback Machine that promises to let users “visit the future of the internet”.

Plugging a URL into the Wayforward Machine generates a page plastered with an endless stream of pop-ups, some of which demand payment or personal information, while others simply note that access to information is denied. The message is hardly subtle.

“We don’t hold the levers of power, but we run a library. Although a library cannot solve all these problems, it’s a necessary component for a digital ecosystem. We need libraries to be supported, used and defended. If we do not defend our open institutions, they will be crushed,” said Kahle.

“We can have platforms and systems that are driven by altruism, not advertising models. We can have a world with many winners, where people participate, learn and find new communities.”

Asked whether he is optimistic about reaching this utopian ideal, Kahle nodded: “But we need to really want it.”

Also check out our list of the best cloud hosting services around

Joel Khalili is the News and Features Editor at TechRadar Pro, covering cybersecurity, data privacy, cloud, AI, blockchain, internet infrastructure, 5G, data storage and computing. He's responsible for curating our news content, as well as commissioning and producing features on the technologies that are transforming the way the world does business.

An early taste

Are you a pro? Subscribe to our newsletter

The Library of Alexandria 2.0

The business of books

An opportunity lost