Notes taken during elasticsearch training.
Components
- node: instance
- cluster: one or more nodes
- document: data that you want to search
- index: collection of documents (could be called a table)
- shard: piece of elaticsearch index
Clustering
Give cluster sensible name, defaults to “elasticsearch”
Nodes
each node has a name, but set it to something that makes sense, eg node1, node2 otherwise is uses the first 7 digits of UUID
Documents
Index
Index details
Dynamic index creation
Normally good idea to turn off
set action.auto_create_index to false to disable
Index with specifying ID
leave off the ID to have elasticsearch to generate one
- works only with post not put
- Generated ID comes back in response
- rule of thumb, have ID use PUT, auto gen use POST
- Post is update
- Put is insert
CRUD
- index: PUT
my_index/doc/4 - create: PUT
my_index/doc/4/_create - read: GET
my_index/doc/4 - update: POST
my_index/doc/4/_update
{
"doc": {
"comment" : "updated comment"
}
}
- delete: DELETE
my_index/doc/4
Retrieve
Used GET to retrieve
Bulk API
POST _bulk
using curl will need an extra new line on end, console doesn’t
- create will only create if item doesn’t exist
- update will only update if item exists
- index will do both (update or create) depending on item status
Configuring
try to move start up options to yml config file
Index settings
read only and read write on index
make an index read only:
PUT my_tweets/_settings {
"index.blocks.write": false
}
node settings
in yml config file can specify diff path for
- data path (fast disk)
- log path (slow disk)
cluster settings
can turn on logging for the cluster, instant change, can produce a lot of data
persist or transient (survives a restart or not)
be careful not to set persistent cluster node above the real node number
eg if you set it to way to high (eg 200) then it will never start until cluster gets to 200. it can’t be reduced because the cluster can’t actually start and you have to basically throw away your data to start.
precedence
- transient
- persistant
- cli
- yml
some settings can only be set in the yml file
eg name of cluster, number of shards
examlple of dynamic settings
cluster and node name
cluster name can be in cli, but dont, use yml file
node name can be the same as it’s UUID under the hood, but it’s crazy town and dont do it
http vs transport
http for rest api, port 9200
transport for internal, port 9300
http.host is localhost by default
transport.host is localhost by default
network can be avoided hardcoding, eg site global see doco
Development vs. Production Mode
transport.bind_host
- dev is not bound to external port
- prod is bound to external port
Explicitly Enforcing Bootstrap Checks
DO NOT set this to false, you will regret it:
-Des.enforce.bootstrap.checks=true
JVM Configuration
do not set jvm above 32g, no value and it’s a waste of resource.
best node has 64g mem, 32g for elasticsearch, 32g for OS
Set Xms and Xmx to the same size, and typically to no more than 50% of your physical RAM
sometimes it’s worth going smaller to speed up cluster
JVM Heap Size
default is 2g too small, 8 is good, do not exceeed 30 (for above of 32)
node roles
|--------------------------------|
| Master Master Master |
| |
| Data Data Data Data Data |
|------------------|----|--------|
| |
Injest Injest -> Coord Coord
master nodes
minimum 3 master nodes
set master for quorum to (n/2) + 1
Dedicated Master and Data Nodes
do not send client requests to dedicated master nodes
dedicated master nodes do not need to be big and beefy, storage etc
Configuring Dedicated Nodes
- For dedicated master eligible node:
node.master: true
node.data: false
node.ingest: false
- For dedicated data node:
node.master: false
node.data: true
node.ingest: false
- For coordinating only Node
node.master: false
node.data: false
node.ingest: false
leave other nodes for queries, incase the coord node goes down.
- For dedicated ingest node:
node.master: false
node.data: false
node.ingest: true
machine learing
give range, learn normal, alert when outside of expected
tribe node
lets you query across multiple clusters, but doesn’t scale well
deprecated, but see cross cluster search
dynamic indexs
disable or enable, or you can whitelist patterns
document routing
used to force into a diff shard, useful for parent/child relationships
deleting an index
can delete indexs with wildcards, but can disable
PUT _cluster/settings
"persistent": {
"action.destructive_requires_name" : true
}
}
alias for indexes
an alias in used for indexes, eg you want other name
see page 174 and 175 of training pdf
useful when GET from lots of indexes
eg GET trx-20171112,trx-20171113/_search
or instead GET month/_search
useful to decouple index name from code, eg you want to move an old index to a new index with more shards.
index templates
index templates are useful when indexes are built every eg day, and you want it diff from default settings for cluster.
saves you setting every time you create the index
chapter 6
exact values vs full text
- exact values: are not analyzed (eg must be full case if data is full case)
- full text: converted into terms for inverted index
stop words are removed as part of the full text analysis, such as “to”, “an”, “a”
desc: "An apple a day"
decc.keyword: "An apple a day"
desc: ["apple", "day"]
- char filters: clean data, eg remove all html tags, divs
- tokenizer:
- token filter:
mappings
field data types: text, keyword, date. integer etc
mappings convert json after injestion and give the fields value types eg integer etc
err_1623 and err_1802 use a keyword for this, if you use “text” it will get chopped up: eg “err” and “1623”
(aside: nested means update all items if something removed down tree, parent child doesnt)
define mappings
elasticsearch will guess but can get it wrong
do it when creating new index
can you change a mapping? No, needs to reindex
Searching
by default it only returns top 10, need to set the size to get more
segments
Live in shards, created every 1 second with index.refresh_interval, or buffer full
can change index.refresh_interval eg 5 seconds
each segment is a file on disk under data dir
be careful about inodes
transaction log
can replay if sudden outage, should replay to write segments from mem to disk
flush
can force flush and flush with sync, should not need to do, but maybe before backups taken?
merging segments
mostly automatic, logs you want to use “merging with index only”
Can force merge api, again useful for backups or moving to cold storage
only use on old data that wont be written to again
(note: check out a tool called “curator”: snapshot, restore, close, forcemerge)
chapter 7
reindex
do I need to create new index before? No, but if issues with orig then they will be copied as well.
external index powerful to build specific indexs from master index, eg “just with item == disney” etc
multiple sources
eg. combine multiple indexs for each day of the month into one month.
can also “rollover” or “shrink”, may also be useful for combining logs etc
reindex from remote cluster
move data from eg dev to uat
closing index
reduces load on cluster, no ram or cpu, disk only
index is then not available for operations, search etc
removed replications, do a snapshot or you will loose data if you lose a node
Will keep every primary shard though, so make sure shards aren’t on the same node or you will lose them.
delayed shard allocation
useful if network dips for a few seconds and you don’t want replicas etc create
primary shards will always happen, it’s for replication only
default is 1 min
useful for upgrading nodes, eg it will be back in 5 mins after reboot.
index priority
index.priority can be changed to make certain indexes be recovered before others
total shards per node
hot warm data nodes
hot nodes power servers, useful for injesting data, bulk injesting etc
warm nodes less powerfull servers in cluster, useful for queries, spread out load
can also be useful for providing better searches when customer pays more money
see shard filtering
chapter 8 capacity planing
can always use reindex from cluster to move into an archive cluster
shard over allocation, number of shareds more than number of nodes
makes eaiser to grow cluster to new nodes
a little over allocation is good, but too much is bad
capacity planning
depends on your use case
define sla before starting
number of primary shards
index in parallel until error code 429 - over capacity
scaling with replicas
capacity planning
fixed sized data
data grows slowly but lots of searches
time based data
lots of data but not huge searchs, logs etc
stuff with timestamps etc
searching usually involves a time stamp
search for recent events
time frame
time based data is best organised using time based indices
see page 359 and 360
set up aliases for time based searches across multiple date spans
(note: alias to single index can be written and read, alias to multiple indcies can only be read)
VERY good tool to test cluster performance and benchmark is es_rally (open source)
xpack monitoring is free (xpack basic)
chapter 9 cluster management
forced awareness
You can configure forced awareness to avoid overwhelming a zone
PUT _cluster/settings
"persistent": {
"cluster": {
"routing": {
"allocation.awareness.attributes": "my_rack_id",
"allocation.awareness.force.my_rack_id.values": "rack1,rack2"
}
}
}
}
installing plugins
xpack is a plugin from elastic
page 381
cluster backup
snapshot and restore
repository
on every node
repository types, see page 389
taking a snapshot
Try not to take snapshots on indexes that are currently being written to eg today
not encrypted, but not readable outside elasticsearch
restoring from a snapshot
renaming indicies: you can rename when you restore so you don’t overwrite data
restore to a diff cluster: also used to move data from another cluster
incremental snapshots
snapshots are segment level, as are increments
Chapter 10 Monitoring
_cluster/pending_tasks check cluster tasks if cluster is sluggish, will show what’s currently running
Monitoring (xpack)
if indexing goes up and not coming down, good idea to scale out
index latency should be less than 1000ms ideally, but depends
Stats API
Cluster, Node, Indicies stats; json
Tasks monitoring
check pending tasks first
GET _cluster/pending_tasks
Cat API
GET _cat/nodes
xpack monitoring
best practice to set up monitoring on a it’s own seperate dedicated cluster
search latency, send alert if taking too long
The Search Slow Log off by default, have to turn on. see page 443 for more info
check thread_pool for rejected and long running queries on nodes, shows the cluster is having issues.
logstash
Chapter 11 upgrading cluster
minor upgrade rolling upgrade
major upgrade needs cluster stop, upgrade, start
Rolling upgrade
Full cluster upgrade
use kafka as a message broker to hold data
do sync flush before node restart will commit everything to disk
Chapter 12 Production checklist
security
out of box no authentication, authorization or encryption
- firewall
- reverse proxy
- secure REST with SSL
- read only configs
- xpack security
Bootstrap checks
disable swapping means Xms and Xms file sizes same in JVM??? see page 492
Linux checks
see page 494 for details
Best practices
avoid running over WAN Links
minimise hops between nodes
don’t use LVM raid etc, just straight FS on the disk; lose raid lose all disks, better to lose individual disks and node will be okay. see page 499
Todo
- read “rollover API (auto rollover big index and create alias)” and “shrink API”
- downloade esrally tool and test data