This is a guest post by Pulkit Kedia, a backend engineer at Womaniya.
Solr is a search engine built on top of Apache Lucene. Apache Lucene uses an inverted index to store documents(data) and gives you search and indexing functionality via a Java API. However, to use features like full text you would need to write code in Java.
Solr is a more advanced version of Lucene's search. It offers more functionality and is designed for scalability. Solr comes loaded with features like Pagination, sorting, faceting, auto-suggest, spell check etc. Also, Solr uses a trie structure for numeric and date data types e.g. there is normal int field and another tint field which signifies the trie int field.
Solr is really fast for text searching/analyzing and credit goes to its inverted index structure. If your application requires extensive text searching, Solr is a good choice. Several companies like Netflix, Verizon, AT&T, and Qualcomm use Solr as their search engine. Even Amazon Cloudsearch which is a search engine service by AWS uses Solr internally.
This article provides a method to deploy Solr in production and deals with creating Solr collections. If you are just starting with Solr, you should start by building a Solr core. Core is a single node Solr server, with no shards and replicas, while collections consist of various shards and its replicas which are the cores.
Implementation
In a distributed search, a collection is a logical index across multiple servers. The part of each server that runs a collection is called a core. So in a non-distributed search, a core and a collection are the same because there is only one server.
In production, you need a collection to be implemented rather than a Solr core, because a core won't be able to hold production data (unless you do vertical scaling). Apache Zookeeper helps create the connection across multiple servers.
There are two ways you can set this up:
- Multiple Solr servers and use Zookeeper on one of the servers
- Zookeeper on a different server and all the other Solr servers connecting to it
We'll go through the process of implementing using the second approach. The first approach is similar to the second one but the latter is a more scalable approach.
Installing Solr
Spawn up 3 servers and install Solr on 2 servers (note: you can spawn any number of solr servers – we use 3 in our example). To install Solr, you need to install Java first, then download the desired version and untar it.
Installation: wget http://archive.apache.org/dist/lucene/solr/8.1.0/solr-8.1.0.tgz
Untar: tar -zxvf solr-8.1.0.tgz
You can start Solr by going to the /home/ubuntu/solr-8.0.0 folder with bin/solr start or in the bin folder with ./solr start. This would start solr on port 8983, and you can test it in the browser.
Replicate the exact same steps to install Solr on your 2nd server.
Also remember to setup the list of IP's and names for each in /etc/hosts
For example :
IPv4 Public IP-solr-node-1 solr-node-1
IPv4 Public IP-solr-node-2 solr-node-2
IPv4 Public IP-zookeeper-node zookeeper-node
Installing Zookeeper
Now the 3rd server would require only zookeeper to which you would push configsets.
Installation: wget https://archive.apache.org/dist/zookeeper/zookeeper-3.4.9/zookeeper-3.4.9.tar.gz
Untar : tar -zxvf zookeeper-3.4.9.tar.gz
If you like, you can add the path to zookeeper to the bashrc file.
Next, in the zookeeper-3.4.9 folder, there is a sample configuration file which comes with zookeeper -> zoo_sample.cfg. Copy this file in the path and rename it to zoo.cfg. The configuration file contains various parameters like "dataDir" which specifies the directory to store the snapshots of in-memory database and transaction logs, "maxClientCnxns" which limits the max number of client connections etc.
Open the zoo.cfg file and uncomment "autopurge.snapRetainCount=3" and "autopurge.purgeInterval=1" and edit the "dataDir = data"
Next start the zookeeper.
bin/zkServer.sh start
Creating A Configset
Configsets are basically the blueprint of the data to be stored. Configsets are stored at server/solr/configsets
You can create your own configset and use it to store your data. Change the managed-schema file content to customise the config.
- You can modify the <field> tag to denote the data fields to be stored in one document
- you can define the type or create a new type by defining it with the <fieldType> tag.
- the id field is compulsory so you cannot delete that
There are many other things you can do in Solr like dynamic fields, copy fields etc. Explaining each of them is beyond the scope of this blog but for more information, here is the official documentation.
Now that you've created a config and have chmod -R 777 config folder, push the config to the zookeeper.
bin/solr zk upconfig -n config_folder_name -d /solr-8.0.0/server/solr/configsets/config_folder_name/ -z zookeeper-node:2181
After pushing the config, start SolrCloud on each Solr servers. To install SolrCloud, refer to this documentation.
Connecting to Zookeeper
To connect to the zookeeper:
bin/solr start -cloud -s example/cloud/node1/solr/ -c -p 8983 -h solr-node-1 -z zookeeper-node:2181
Solr stores the inverted index at this location -> example/cloud/node1/solr/ , so you need to mention that path while connecting. Zookeeper will automatically distribute shards and replicas over the two Solr servers. When you add some data, a hash would be generated and then it would in a particular shard. This is all handled by zookeeper.
To add data to the server you need to POST to the link http://<IP>:8983/solr/<collection_name>/update?commit=true .
The IP can be of any server as the data automatically gets distributed among the shards.
To get data from your solr, search http://<IP>:8983/solr/user/select?q=<searchString>
Note: If you are using one of the Solr servers as a zookeeper, all the above steps are the same but replace zookeeper ip with that solr nodes ip and port to 9983 instead of 2181
Troubleshooting
Here are a couple common problems that may arise while setting up SolrCloud.
After you have created SolrCloud and are connecting to zookeeper, you may see an error like 8983 or 7574 is already in use.
Solution;
:fuser -k 8983/tcp -
This would find the process and kill it. Another error you may see is that SolrCloud cannot find the newly created configset.
Solution: Do chmod 777 to the new configset. The more secure approach is to chown the folder to solr user.
Conclusion
Solr has a large community of experienced users and contributors and is more mature when compared to its competitors. Solr faces competition from Elasticsearch, which is open source and is also built on Apache Lucene. Elasticsearch is considered to be better at searching dynamic data such as log data while Solr handles static data better. In terms of scaling, while Elasticsearch has better in-built scalability features, with Zookeeper and SolrCloud, it’s easy to scale with Solr too.
Author bio: Pulkit Kidia is a backend engineer with experience in cloud services, system design and creating scalable backend systems. He loves to learn and integrate new backend technologies.