Build your own search engine with YaCy
Search the web on your own terms
Mainstream search engines like Google are pretty good at what they do, but many people choose not to use them because of privacy concerns. Then there are those who are concerned about content falling through the cracks just because the creator hasn’t followed the best practices for search engine optimization (SEO).
YaCy, an open source distributed search engine, works pretty much like its mainstream peers, but doesn’t suffer from any of their ills. YaCy uses a peer-to-peer (P2P) network, so every user running an instance of the search engine joins in the effort to index the internet. The index is distributed and redundant across all YaCy users.
To further bolster its privacy credentials, YaCy ensures that no one can tell who has searched for what words, in essence making all searches functionally anonymous.
YaCy only indexes publicly accessible, non-password-protected pages. You can also use it as a search engine for your website, or use it to index pages on the intranet, which it ensures aren’t accessible to anyone outside your network.
Installation
YaCy is written in Java and runs on Windows, macOS, and Linux. Search engines are complex beasts, but thanks to YaCy’s distributed nature, you don't need a fast machine, nor a lot of space to run a YaCy client.
Installation is fairly simple. Before you begin, ensure you have Java installed on the machine. Windows and macOS users can obtain pre-built binaries from Adoptium, while Linux users can pull it from their official repositories.
For instance, Debian users can use sudo apt install default-jdk, while Fedora users can search for the available versions with sudo dnf search openjdk, before installing the latest version with sudo dnf install <openjdk-package-name>.
Are you a pro? Subscribe to our newsletter
Sign up to the TechRadar Pro newsletter to get all the top news, opinion, features and guidance your business needs to succeed!
Once you have Java installed, download the YaCy executable for your platform, and extract it. For instance, the command sudo tar --extract --file yacy_*z --directory /opt -v, will extract the installer under the /opt directory on Linux. Now simply change into the extracted directory and start YaCy:
# cd /opt/yacy
# ./startYACY.sh
YaCy is now running on port 8090 on your computer. Fire up a web browser, and head to http://localhost:8090 to access the YaCy instance. You can now search the internet just as you would using a regular search engine.
Crawl the internet
There's much more you can do with the YaCy search engine than just search passively. For instance, since P2P indexing is user-driven, you can ask YaCy to crawl any website.
To access the advanced administrative controls of your search engine, click the Administration button in the top-right corner. This brings up the admin panel, which among other things lets you tweak how your YaCy instance interacts with other YaCy clients in the network.
To initiate a manual web crawl, navigate to Load Web Pages, Crawler option under the First Steps menu. Enter the URL in the space provided and hit Start New Crawl. As the crawler gets underway, it’ll start showing all kinds of statistics about the crawl, and you can scroll down to view the structure of the scrolled website graphically.
After initiating the crawl, head to Monitoring > Index Browser to view how many pages have been indexed and view other details, such as their name and number of outbound links.
For now you can go with the default option, and explore the other options, such as limiting the crawler, once you get comfortable with YaCy. The search engine can run multiple crawls at the same time, and you can either initiate them serially from under the First Steps section, or head to Production > Advanced Crawler to crawl multiple websites at the same time.
Once the crawl job starts, YaCy indexes the URLs you enter and stores the index on your local machine. To ensure your index is available to YaCy users all over the globe, you’ll have to join YaCy’s P2P network.
For this you must open port 8090 in your router's firewall. Log into your router’s administration page and look for a configuration panel controlling the firewall or port forwarding.
Once you find the preferences for your router's firewall, add port 8090 to the whitelist. If your router is doing port forwarding, then you must forward the incoming traffic to your computer's IP address, using the same port.
After you’ve joined the YaCy network, you can toggle the Do remote indexing option under the Advanced Crawler. This enables your client to broadcast the URLs it is indexing, and other clients on the network that have opted to accept requests can help you perform the crawl.
Your very own Google
Instead of searching the web, you can use YaCy to search through your own data or to implement a search system for local file shares inside your corporate intranet.
For this you’ll need to run YaCy as an internal indexer. In these modes, only people in your local network can use your personalized instance of YaCy to find shared files, and none of the data is shared with users outside your network.
Head to Administration > First steps > Use Case & Account. Here you can specify basic details such as the language for YaCy’s interface.
You’ll also be able to change the behavior of your YaCy instance from here. The default option is to use your client as part of YaCy’s global P2P network to help crawl and index the web.
To create a search portal for your own website, you need to select the Search portal for your own web pages option. Then scroll down and press the Set Configuration button. Next, you need to crawl your domain to generate the content that will be available through your search tool.
To integrate the search into your website, scroll down the left-side column to the Search Portal Integration section. You’re dropped to the Portal Configuration page, from where you can customize YaCy’s appearance with your corporate branding to blend it into your website. When you are done, hit the Change Search Page button. You can now use any of the generated iframe code snippets to integrate the YaCy-powered customized search into your website.
Similarly, to use YaCy to index the local network, you’ll have to select the third option in the First Steps section. You can then use the Advanced Crawler to crawl your intranet.
Conclusion
There’s so much more you can do with YaCy. The project doesn’t offer enough documentation to cover all the features of the search engine. However, the project is fairly intuitive, and its interface is verbose enough to help you toggle the correct option.
All things considered, YaCy is one of the best options for users who want an unbiased, ad-free, privacy-respecting, anonymous web search engine that you can also use to help users search for content on your website or privately inside your intranet.
With almost two decades of writing and reporting on Linux, Mayank Sharma would like everyone to think he’s TechRadar Pro’s expert on the topic. Of course, he’s just as interested in other computing topics, particularly cybersecurity, cloud, containers, and coding.