elasticsearch training notes

Notes taken during elasticsearch training.

Components

node: instance
cluster: one or more nodes
document: data that you want to search
index: collection of documents (could be called a table)
shard: piece of elaticsearch index

Clustering

Give cluster sensible name, defaults to “elasticsearch”

Nodes

each node has a name, but set it to something that makes sense, eg node1, node2 otherwise is uses the first 7 digits of UUID

Documents

Index

Shard

Indexs are partitioned into shards and distributed across multiple nodes

Each shard is a standalone lucene index

Default for an index is 5

Index details

Dynamic index creation

Normally good idea to turn off

set action.auto_create_index to false to disable

Index with specifying ID

leave off the ID to have elasticsearch to generate one

works only with post not put
Generated ID comes back in response
rule of thumb, have ID use PUT, auto gen use POST
Post is update
Put is insert

CRUD

index: PUT my_index/doc/4
create: PUT my_index/doc/4/_create
read: GET my_index/doc/4
update: POST my_index/doc/4/_update

{
  "doc": {
    "comment" : "updated comment"
  }
}

delete: DELETE my_index/doc/4

Retrieve

Used GET to retrieve

Bulk API

POST _bulk

using curl will need an extra new line on end, console doesn’t

create will only create if item doesn’t exist
update will only update if item exists
index will do both (update or create) depending on item status

Configuring

try to move start up options to yml config file

Index settings

shards

shards defaults to 3, but recommended to be 5, a 2 node cluster:

(or maybe not!??) not sure if this works like this:

node 1: shard 1,2,3
node 2: shard 4,5

if you lose node 1, then you will still have at least 2 shards.

read only and read write on index

make an index read only:

PUT my_tweets/_settings {
    "index.blocks.write": false
    }

node settings

in yml config file can specify diff path for

data path (fast disk)
log path (slow disk)

cluster settings

can turn on logging for the cluster, instant change, can produce a lot of data

persist or transient (survives a restart or not)

be careful not to set persistent cluster node above the real node number

eg if you set it to way to high (eg 200) then it will never start until cluster gets to 200. it can’t be reduced because the cluster can’t actually start and you have to basically throw away your data to start.

precedence

transient
persistant
cli
yml

some settings can only be set in the yml file

eg name of cluster, number of shards

examlple of dynamic settings

cluster and node name

cluster name can be in cli, but dont, use yml file

node name can be the same as it’s UUID under the hood, but it’s crazy town and dont do it

http vs transport

http for rest api, port 9200

transport for internal, port 9300

http.host is localhost by default

transport.host is localhost by default

network can be avoided hardcoding, eg site global see doco

Development vs. Production Mode

transport.bind_host

dev is not bound to external port
prod is bound to external port

Explicitly Enforcing Bootstrap Checks

DO NOT set this to false, you will regret it:

-Des.enforce.bootstrap.checks=true

JVM Configuration

do not set jvm above 32g, no value and it’s a waste of resource.

best node has 64g mem, 32g for elasticsearch, 32g for OS

Set Xms and Xmx to the same size, and typically to no more than 50% of your physical RAM

sometimes it’s worth going smaller to speed up cluster

JVM Heap Size

default is 2g too small, 8 is good, do not exceeed 30 (for above of 32)

see https://www.elastic.co/blog/a-heap-of-trouble

node roles

|--------------------------------|
| Master Master Master           |
|                                |
| Data Data Data Data Data       |
|------------------|----|--------|
                   |    |
Injest Injest -> Coord Coord

master nodes

minimum 3 master nodes

set master for quorum to (n/2) + 1

Dedicated Master and Data Nodes

do not send client requests to dedicated master nodes

dedicated master nodes do not need to be big and beefy, storage etc

Configuring Dedicated Nodes

For dedicated master eligible node:

node.master: true
node.data: false
node.ingest: false

For dedicated data node:

node.master: false
node.data: true
node.ingest: false

For coordinating only Node

node.master: false
node.data: false
node.ingest: false

leave other nodes for queries, incase the coord node goes down.

For dedicated ingest node:

node.master: false
node.data: false
node.ingest: true

machine learing

give range, learn normal, alert when outside of expected

tribe node

lets you query across multiple clusters, but doesn’t scale well

deprecated, but see cross cluster search

shards

default number of primary shards for an index is 5

it’s fixed once created and can’t be changed without redindex

dynamic indexs

disable or enable, or you can whitelist patterns

document routing

used to force into a diff shard, useful for parent/child relationships

deleting an index

can delete indexs with wildcards, but can disable

PUT _cluster/settings
"persistent": {
    "action.destructive_requires_name" : true
    }
}

alias for indexes

an alias in used for indexes, eg you want other name

see page 174 and 175 of training pdf

useful when GET from lots of indexes

eg GET trx-20171112,trx-20171113/_search

or instead GET month/_search

useful to decouple index name from code, eg you want to move an old index to a new index with more shards.

index templates

index templates are useful when indexes are built every eg day, and you want it diff from default settings for cluster.

saves you setting every time you create the index

chapter 6

exact values vs full text

exact values: are not analyzed (eg must be full case if data is full case)
full text: converted into terms for inverted index

stop words are removed as part of the full text analysis, such as “to”, “an”, “a”

desc: "An apple a day"

decc.keyword: "An apple a day"

desc: ["apple", "day"]

char filters: clean data, eg remove all html tags, divs
tokenizer:
token filter:

mappings

field data types: text, keyword, date. integer etc

mappings convert json after injestion and give the fields value types eg integer etc

err_1623 and err_1802 use a keyword for this, if you use “text” it will get chopped up: eg “err” and “1623”

(aside: nested means update all items if something removed down tree, parent child doesnt)

define mappings

elasticsearch will guess but can get it wrong

do it when creating new index

can you change a mapping? No, needs to reindex

Searching

by default it only returns top 10, need to set the size to get more

segments

Live in shards, created every 1 second with index.refresh_interval, or buffer full

can change index.refresh_interval eg 5 seconds

each segment is a file on disk under data dir

be careful about inodes

transaction log

can replay if sudden outage, should replay to write segments from mem to disk

flush

can force flush and flush with sync, should not need to do, but maybe before backups taken?

merging segments

mostly automatic, logs you want to use “merging with index only”

Can force merge api, again useful for backups or moving to cold storage

only use on old data that wont be written to again

(note: check out a tool called “curator”: snapshot, restore, close, forcemerge)

chapter 7

reindex

do I need to create new index before? No, but if issues with orig then they will be copied as well.

external index powerful to build specific indexs from master index, eg “just with item == disney” etc

multiple sources

eg. combine multiple indexs for each day of the month into one month.

can also “rollover” or “shrink”, may also be useful for combining logs etc

reindex from remote cluster

move data from eg dev to uat

closing index

reduces load on cluster, no ram or cpu, disk only

index is then not available for operations, search etc

removed replications, do a snapshot or you will loose data if you lose a node

Will keep every primary shard though, so make sure shards aren’t on the same node or you will lose them.

delayed shard allocation

useful if network dips for a few seconds and you don’t want replicas etc create

primary shards will always happen, it’s for replication only

default is 1 min

useful for upgrading nodes, eg it will be back in 5 mins after reboot.

index priority

index.priority can be changed to make certain indexes be recovered before others

total shards per node

hot warm data nodes

hot nodes power servers, useful for injesting data, bulk injesting etc

warm nodes less powerfull servers in cluster, useful for queries, spread out load

can also be useful for providing better searches when customer pays more money

see shard filtering

shard filtering

node.attr, does not contain node name, so if you want to exclude one specific node you have to set it.

chapter 8 capacity planing

can always use reindex from cluster to move into an archive cluster

shard over allocation, number of shareds more than number of nodes

makes eaiser to grow cluster to new nodes

a little over allocation is good, but too much is bad

capacity planning

depends on your use case

define sla before starting

number of primary shards

index in parallel until error code 429 - over capacity

scaling with replicas

capacity planning

fixed sized data

data grows slowly but lots of searches

time based data

lots of data but not huge searchs, logs etc

stuff with timestamps etc

searching usually involves a time stamp

search for recent events

time frame

time based data is best organised using time based indices

see page 359 and 360

set up aliases for time based searches across multiple date spans

(note: alias to single index can be written and read, alias to multiple indcies can only be read)

VERY good tool to test cluster performance and benchmark is es_rally (open source)

http://demo.elastic.co

xpack monitoring is free (xpack basic)

chapter 9 cluster management

shard allocation awareness

cluster.routing.allocation.awareness

forced awareness

You can configure forced awareness to avoid overwhelming a zone

PUT _cluster/settings
"persistent": {
    "cluster": {
        "routing": {
            "allocation.awareness.attributes": "my_rack_id",
        "allocation.awareness.force.my_rack_id.values": "rack1,rack2"
        }
    }
    }
}

installing plugins

xpack is a plugin from elastic

page 381

cluster backup

snapshot and restore

repository

on every node

repository types, see page 389

taking a snapshot

Try not to take snapshots on indexes that are currently being written to eg today

not encrypted, but not readable outside elasticsearch

restoring from a snapshot

renaming indicies: you can rename when you restore so you don’t overwrite data

restore to a diff cluster: also used to move data from another cluster

incremental snapshots

snapshots are segment level, as are increments

Chapter 10 Monitoring

_cluster/pending_tasks check cluster tasks if cluster is sluggish, will show what’s currently running

Monitoring (xpack)

if indexing goes up and not coming down, good idea to scale out

index latency should be less than 1000ms ideally, but depends

Stats API

Cluster, Node, Indicies stats; json

Tasks monitoring

check pending tasks first

GET _cluster/pending_tasks

Cat API

GET _cat/nodes

xpack monitoring

best practice to set up monitoring on a it’s own seperate dedicated cluster

search latency, send alert if taking too long

The Search Slow Log off by default, have to turn on. see page 443 for more info

check thread_pool for rejected and long running queries on nodes, shows the cluster is having issues.

logstash

Chapter 11 upgrading cluster

minor upgrade rolling upgrade

major upgrade needs cluster stop, upgrade, start

Rolling upgrade

Full cluster upgrade

use kafka as a message broker to hold data

do sync flush before node restart will commit everything to disk

Chapter 12 Production checklist

security

out of box no authentication, authorization or encryption

firewall
reverse proxy
secure REST with SSL
read only configs
xpack security

Bootstrap checks

disable swapping means Xms and Xms file sizes same in JVM??? see page 492

Linux checks

see page 494 for details

Best practices

avoid running over WAN Links

minimise hops between nodes

don’t use LVM raid etc, just straight FS on the disk; lose raid lose all disks, better to lose individual disks and node will be okay. see page 499

Todo

read “rollover API (auto rollover big index and create alias)” and “shrink API”
downloade esrally tool and test data