Have you ever had a need to do a recommendation engine (something akin to Amazon) for a website? Ranking search results? Search at scale is hard. Scale can be performance (how long do you really want to wait for the result back from a website), size (the Amazon search database is a huge database, I’m sure) or velocity (have you ever seen the rate at which a domain controller produces data!). Now, I’m not saying “Amazon uses Solr” here – I’m saying you can do it too, but you have to have the right tool.
Search is hard
I’m going to throw it out there. Search is hard. At least, it’s hard when you need to do it at scale. Many companies exist to ensure you don’t need to know how hard it is. Just think what happens when you type a search keyword into the Google search box. It’s just plain hard.
You can simplify the problem by understanding the data you want to search, how you want the results to be presented and how you intend to get there; and then utilizing the appropriate tool for the job.
Choose by Understanding the data
Are you searching structured data? Does it look like an excel table with headers at the top? Do you know what sort of data is in each column? If so, you’ve got a great case for a relational database running with SQL. If you need Enterprise features, I’d suggest SQL Server or Oracle. If not, then try MySQL or PostgreSQL.
Are you searching small messages, like events, in a time-series manner? Are you wanting to do searching that start “tell me what happened between these two points in time?” Then you want something like ELK or Splunk.
Smaller blobs of data? Well, if those blobs are key-value pairs, then try a NoSQL solution like MongoDB. If they are JSON, try Firebase.
How about bigger bits of data, like full documents? Then you want to go to a specific Document-based search system, like Apache Solr.
Sure, everyone in those environments will tell you that you can store all data in all of these databases. But sometimes, the database is tuned for a specific use – SQL databases for structured data, Splunk for time-series data, Apache Solr for document data.
Installing Apache Solr
My favorite method of trying out these technologies right now is to use Docker. I can spin up a new image quickly, try it out and then shut it down. That’s not to say that I would want to run a container in production. I’m using Docker as a solution to stamp out an example environment quickly.
To install and start running Apache Solr quickly, use the following:
docker run -d -p 8983:8983 -t makuk66/docker-solr
Apparently, there are companies out there that run clusters of Apache Solr in the hundreds of machines. If that is the case, I’m not worried at this point about scaling (although I do wonder what they could be searching!)
Before I can use Apache Solr, I need to create a collection. First of all, I need to know what the name of the container is. I use docker ps to find that out:
In my case, the name is jolly_lumier. Don’t like the name? Well, let’s stop that instance and re-run it with a new name:
docker stop f4dc01217dd3
docker rm f4dc01217dd3
docker run -d -p 8983:8983 --name solr1 -t makuk66/docker-solr
Now I can reference the container by my name. To create a collection:
Note I am referencing the container by name with the -t argument.
But what are collections and shards?
Great question. Terminology tends to bite me a lot. Every single technology has their own terminology and usually it isn’t defined well. There is an assumption that you already know what they are talking about Fortunately, Apache Solr has a Getting Started Document to define some things.
A collection is a set of documents that have been indexed together. It’s the raw data plus the index together. Collections implement a scaling technique called sharding in which the collection is split into multiple shards in order to scale up the number of documents in a collection beyond what could physically fit on a single server. Those shards can exist on one or more servers.
If you are familiar with map-reduce, then this will sound familiar. Incoming search queries are distributed to every shard in the collection. The shards all respond and then the results are merged. Check out the Apache Solr Wiki for more information on this. For right now, it’s enough to know that a collection is a set of data you want to search and shards are where that data is actually stored.
The Admin Interface
Apache Solr has a web-based admin interface. In my case, I’ve forwarded the local port 8983 to the port 8983 on the container. I can access the web-based admin interface at http://localhost:8983/solr. You will note that I could have created a collection (called a Core in the non-cloud version of Apache Solr) from the web interface.
Sending Documents to Solr
Solr provides a Linux script called bin/post to post data. It’s a wrapped Java function, which means you need Java on your system in order to use it. Want to index the entire Project Gutenberg – you can do it, but there are some extra steps – most notably installing Solr and Java on your client machine..
For my first test, I wanted to index some data I already had. I have a Dungeons and Dragons Spell List with all the statistics of the individual spells. This is in a single CSV file. To do this, I can do the following:
curl 'http://localhost:8983/solr/collection1/update?commit=true' --data-binary @Data/spells.csv -H 'Content-type:application/csv'
You will get something like this back:
<?xml version="1.0" encoding="UTF-8"?>
<lst name="responseHeader"><int name="status">0</int><int name="QTime">902</int></lst>
According to the manual, that means success (since the status is 0). Non-zero status means a failure of some description.
Now, let’s take a look at the data. We can do that by using the web-UI at http://localhost:8983/solr/collection1/browse – it’s fairly basic and can search for all sorts of things. I can not only search for things like “Fire” (and my favorite Fireball spell), but also for things like “Druid=Yes” to find all the Druid spells.
My keen interest is in using this programatically, however. I don’t want my users to even be aware of the search capabilities of the Solr system. I’d prefer them not to know what I was running. After all, do you think “that’s a nice Solr implementation” when browsing your favorite web shop?
If I want to look for the Fireball spell, I do the following:
The syntax and options for a query is extensive. You can read all about it on their wiki. The response is an XML document. If I want it in another format, I use the wt parameter:
It’s a good practice to use quotes around your URL so that you don’t end up with the shell preempting your meaning with special characters (like the ampersand, which puts a process in the background).
What else can Solr do?
Turns out – lots of things. Here are a bunch of my favorite things:
- Faceting – when you search for something and you get a table that says “keywords (12)” – that’s faceting. It groups things together to allow for better drill-down navigation.
- Geo-spatial – location-based search (find something “near here”)
- Query Suggestions – that drop-down from google that suggests searches? Yep – you can do that too
- Clustering – automatically discover groups of related search hits
There is a lot missing from a base Apache Solr deployment. I’ll try to put some of the more important ones down here, but there is a solution – check out LucidWorks. LucidWorks was founded by the guys who wrote Solr and it adds a lot of the enterprise features that you will want in their Fusion product.
- Authentication – talking of enterprise features, top of the list is authentication. That’s right – Solr has no authentication – not even an encrypted channel. That means anyone (out of the box) can just submit a document to your Solr instance if they have a route to the port that it’s running on. It relies on the web container (Jetty, Tomcat or JBoss for example) to do the authentication. This isn’t really a big problem as authentication is pretty well documented. Incidentally, the Docker image uses Jetty for the web container.
- Getting Data In – I was going to call this crawling. However, it is more than that. If you have a fairly static set of data, then maybe the API’s and command line tools are good enough. What if you want to index the data in your SharePoint application? How about all the emails flowing through your Exchange server? You will need to write (quite complex) code for this purpose.
- Monitoring – if you are running a large Solr deployment, then you will want to monitor those instances. Solr exposes this stuff via JMX – not exactly the friendliest approach.
- Orchestration – this is only important if you have gone into production with a nice resilient cluster. How do you bring up additional nodes when the load gets high and how do you run multi-node systems? The answer is zookeeper and it’s not pretty to set up and has several issues of its own.
What’s Solr not good at
Solr isn’t good at time-series data. Ok – it’s not that hard, but it’s still not the best thing for the job. Similarly, if you are doing per-row or per-field updates to records, then perhaps you should be using a relational database instead.
If you are indexing documents, however, then this is the tool to use. It’s easy to set up and get started. It has a REST interface for programmatic access and it likely does all the search and analytics related stuff you want.
Go ahead, explore!