Fun with Elasticsearch!

The basics of installing and querying Elasticsearch

Welcome to Elasticsearch!

Elasticsearch is a full text search engine. It is a scalable, RESTful interface built on top of the Lucene search engine.

Elasticsearch is similar to other search technology like Solr, in that they both run atop Lucene. However, Elasticsearch is optimized for scaling up and out, with built in management that allows one Elasticsearch index to be replicated across many machines. Compared to Lucene, Elasticsearch also has a RESTful API, which means any program that can communicate over http and parse JSON is able to easily interact with it.

Elasticsearch stores data in indexes. You can have multiple indexes on a single installation of Elasticsearch. It's up to you, the user, to decide how to organize your data into different indices.

In this class, we're going to go over the basics of:

Requirements

Elasticsearch requires Java 1.7 or higher.

For Mac and Windows

Visit this link: http://www.oracle.com/technetwork/java/javase/downloads/jdk8-downloads-2133151.html

and select the distribution that matches your operating system. Accept the license agreement, download and install Java.

Windows Users Only

To interact with Elasticsearch, we'll be using a command line program called cURL. This is widely available on unix based operating systems, but not windows. The easiest way to install cURL is to install git (also useful for other things!). You can install git here: http://git-scm.com/download/win

When going through the install process make sure to select the option that says "Install git and option unix tools". Once the installation is complete, you should be able to run the native Windows Powershell and the cURL command will be available.

For Ubuntu

For Ubuntu, you can paste the following into the command line:

sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java7-installer

and to make sure it installed correctly, you can run:

java -version

Getting Elasticsearch

You can download Elasticsearch here: http://www.elasticsearch.org/overview/elkdownloads/

Once you've downloaded it, extract the zip file. In your shell (terminal on Mac, Powershell on Windows), cd into the extracted directory and run

./bin/plugin -i elasticsearch/marvel/latest

This will install the Marvel plugin, which has a more user friendly console (called Sense) that we can use to interact with Elasticsearch.

Running Elasticsearch!

In your shell in the same directory as above, run

(Mac)

./bin/elasticsearch

(Windows)

./bin/elasticsearch.bat

You should see some output from Elasticsearch that should indicate that it's running. But let's query Elasticsearch to make sure! We'll do this two ways. From the command line and in the browser.

Open a new tab in your terminal, and run

curl -XGET http://localhost:9200

curl is a command line tool to transfer data to and from a server over a variety of protocols, but in this case it's using http, just like most internet traffic does. the -XGET argument tells curl we want to perform a GET operation on the server, and http://localhost:9200 is the default address of Elasticsearch.

Most server software that you run on your machine will show up at http://localhost/ and most software has a default port number where it listens for requests. For Elasticsearch, the default port is 9200, but really you can run it on any port you like.

The response should look like this (with a different name though)

{
  "status" : 200,
  "name" : "Jack Frost",
  "version" : {
    "number" : "1.3.4",
    "build_hash" : "a70f3ccb52200f8f2c87e9c370c6597448eb3e45",
    "build_timestamp" : "2014-09-30T09:07:17Z",
    "build_snapshot" : false,
    "lucene_version" : "4.9"
  },
  "tagline" : "You Know, for Search"
}

Hooray! Elasticsearch works! You can also visit http://localhost:9200 in your web browser, and you should see the same response.

Let's Put Some Data in There

So that we can get started right away practicing querying, I've created a bulk file of chapters from Jane Austen's Pride and Prejudice. You can download the file here: https://raw.githubusercontent.com/kaitlin/elasticsearch/gh-pages/pride.json

Save the file somewhere you can cd into. Then we're going to do a bulk insert from the command line. After changing into the directory where you've downloaded the file, run

curl -s -XPOST localhost:9200/pride/_bulk --data-binary @pride.json; echo

This command looks similar to above, but we've changed a few things. It follows this loose format:

curl -s -XPOST HOSTNAME:PORT/INDEX_NAME/OPERATION_TYPE

Since we're sending data to Elasticsearch, we're using -XPOST instead of -XGET. Also, we've added on to the url we're posting to. We've decided to name the Elasticsearch index that's going to house this data pride, so we've added that to the url path. And we're doing a bulk insert, so we've also added _bulk to the url to let Elasticsearch know it should expect bulk data.

the --data-binary argument tells curl that binary data is about to follow. @pride.json dumps the JSON data out of the file and into the upload stream. The semicolon ; ends the command. Then we add another command, echo to tell our terminal to print out the response returned from Elasticsearch.

You should see a response that looks something like this:

{"took":2008,"errors":false,"items":[{"index":{"_index":"pride","_type":"chapter","_id":"1","_version":1,"status":201}},{"index":{"_index":"pride","_type":"chapter","_id":"2","_version":1,"status":201}},{"index":{"_index":"pride","_type":"chapter","_id":"3","_version":1,"status":201}},....

The response is JSON object that tells you no errors occured, and it enumerates the items that were inserted into Elasticsearch under the items key. Now we have some data to query.

Getting Data Out of Elasticsearch

There are two main ways to fetch data from Elasticsearch. One is queries, and the other is filters. In general, queries should be used when you are doing a search on text or need to order results by relevance, and filters should be used when you want to filter on an exact value, or for binary(true/false) searches. You can also combine these types if you need to search within a filtered result set, or filter on a set of search results.

Querying Elasticsearch

Let's use the Sense console to make our queries. You can access Sense in your browser at http://localhost:9200/_plugin/marvel/sense/index.html

First you'll see an overlay with some helpful tips on using Sense. After that, you can see the left pane which has a default query in it. This query:

GET /pride/_search
{
  "query": {
    "match_all": {}
  }
}

Will match everything. To restrict the search to just our pride index, we've added the line GET /pride/_search above our query object. To run the query, click the green triangle.

You should see lots of results in the right pane. A total of 61 in fact (the number of chapters in Pride & Prejudice).

All Elasticsearch queries start out with this basic structure:

GET /pride/_search
{
    "query": {

    }
}

This is important because there are soooo many different types of queries you can perform in Elasticsearch. The only way to get familiar with them all is to read the documentation on the Query DSL (DSL == Domain Specific Language) located here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl.html

But for some reason, all of the examples in the documentation leave out this basic structure and the examples all start as if you're filling in the basic structure above with more details. So if you're copying from an example in the docs and you get a SearchQueryParseException, try wrapping your query with the above structure.

Doesn't Pride and Prejudice Have a Famous Line ?

Yes! And it has the words "truth universally acknowledged". Let's search for that and see what chapter it appears in.

GET /pride/_search
{
    "query": {
        "match": {
           "_all": "truth universally acknowledged"
        }
    }
}

This looks similar to our first query, except that we've added text to match. You should have 30 hits. Hmm, that didn't quite give us what we were looking for. Cruising through the text of the chapters we can see that it returned results for chapters that had any one of these three words. Let's try adding the and operator to the match query.

GET /pride/_search
{
    "query": {
        "match": {
           "_all": {
             "query": "truth universally",
             "operator": "and"
           }
        }
    }
}

In this query, we've added more attributes to the query, so instead of text as the value for the _all key, we have an object instead. This object has query and operator keys. The query is our actual text query, and the operator is the string "and" to tell elasticsearch we want all of these words to appear.

This time, when we run the query we get two results. So, there are two chapters that have both "truth" and "universally" in them. But the words aren't necessarily next to each other. What we want now is a phrase query.

GET /pride/_search
{
    "query": {
        "match": {
           "_all": {
             "query": "truth universally",
             "type": "phrase"
           }
        }
    }
}

This query looks similar to the one above, but instead of using an operator, we're specifying a query type from a set of predefined types. In this case, it's phrase, indicating that not only do we want all these words to appear in the result, but we want them to be next to each other as well. This show's us one result, Chapter One. In fact, it's the first line of the book!

OK, now let's say that I can't remember when Mr. Collins enters the story. Well that's easy, I'll just search for the phrase "Mr. Collins" just like above.

GET /pride/_search
{
    "query": {
        "match": {
           "text": {
             "query": "Mr. Collins",
             "type": "phrase"
           }
        }
    }
}

So this looks OK, except the chapters aren't ordered by their chapter number. I want to sort them in ascending order by the chapter attribute of each object. The sort key should be adjacent to your top level query object. Like this:

GET /pride/_search
{
    "query": {
        "match": {
           "text": {
             "query": "Mr. Collins",
             "type": "phrase"
           }
        }
    },
    "sort" : [
      {"chapter" : {"order" : "asc"}}
    ]
}

The sort key has an array of objects for its value. Each sort object has a key with the name of the attribute we're sorting by, and that key contains an object that describes the order of the sort. In this case, the order is "asc" for ascending.

There are dozens of ways to query Elasticsearch. Try reading through all the different query types at http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-queries.html and trying some out for yourself!

Filtering in Elasticsearch

Let's say we want to retrieve some specific chapters from the book. Since we're going to be selecting for specific values, a filter is most efficient. Here's an example that retrieves chapters 10-20. We can use a range filter.

GET /pride/_search
{
    "query": {
      "constant_score" : {
        "filter" : {
              "range" : {
                  "chapter" : {
                      "gte": 10,
                      "lte": 20
                  }
              }
          }
      }
    }
}

This filters out chapters that are greater than or equal to 10 and less than or equal to 20. What if I want to combine this with a search? Well we can add a term filter for that. But once we add more than one filter, the structure gets a bit more complicated.

GET /pride/_search
{
  "query": {
    "filtered" : {
        "filter" : {
            "and" : [
                {
                    "range" : {
                        "chapter" : {
                            "gte" : 10,
                            "lte" : 20
                        }
                    }
                },
                {
                    "term" : { "text" : "darcy" }
                }
            ]
        }
    }
  }
}

Notice how we've put each filter to be applied into an array with the key "and". This means that we want all the filters to apply to the result set. The "and" array is nested inside a "filter" key, which is then nested inside a "filtered" key. And of course the whole thing is wrapped in a "query" object.

Whew! As you can see the queries can get large and complicated, but the query DSL is powerful and specific enough to be used for pretty much any logical combination you can imagine. Continue reading about the different filters available here: http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-filters.html

That's It!

This covers the basics of installing and using Elasticsearch. Here are some more resources to keep you going:

Thanks for watching!