A perfect tool to help achieve efficient indexing of data is — Elasticsearch. It does an impeccable job of what it does which makes searching and analytics operations much more faster. In one of our previous article we have seen how to install configure a basic setup of Elasticsearch on Ubuntu. In this article we will see how you can actually stream data from MySQL to Elasticsearch with Logstash for indexing which can further be used for search and analytics.
Logstash was initially created for log collection, processing, storage and searching activities. These logs were originally system logs, but developers quickly found out that it can also be used to structure the data in a way that Elasticsearch can understand and store.
Now, it looks like we have all the answers to index MySQL data in Elasticsearch with Logstash right? Wrong. There is a teeny tiny problem to connect MySQL with Logstash. Logstash is originally developed on Java and MySQL requires a JDBC driver known as Connector/J separately so that Logstash can fetch data from it.
Although, this seems a lot of work, its actually quite simple. So, now since we have all the ingredients lets start cooking.
This article assumes that you already have an Elasticsearch instance running on port 9200 and MySQL instance running on port 3306. This article is written for a Ubuntu server
The MySQL Story
Lets take a look at a simple MySQL database
CREATE TABLE IF NOT EXISTS `student` (
`id` INT NOT NULL AUTO_INCREMENT,
`first_name` VARCHAR(255) NOT NULL,
`last_name` VARCHAR(255) NOT NULL,
PRIMARY KEY (`id`));
CREATE TABLE IF NOT EXISTS `marks` (
`id` INT NOT NULL AUTO_INCREMENT,
`marks` VARCHAR(45) NOT NULL,
`student_id` INT NOT NULL,
PRIMARY KEY (`id`),
INDEX `fk_marks_student_idx` (`student_id` ASC),
CONSTRAINT `fk_marks_student`
FOREIGN KEY (`student_id`)
REFERENCES `student` (`id`)
ON DELETE NO ACTION
ON UPDATE NO ACTION);
Our sample database has 2 tables — students, marks. The marks tables has an indexed Foreign Key student_id that references from the student table. As the data grows in this table MySQL performance decays. To get the marks of a student we have to use the expensive JOIN clause on 2 tables. Hence, it would be much more easier if only the required data is indexed in form of tuples.
MySQL Connector/J
Let’s start with MySQL’ dependencies. To download and extract the MySQL Connector/J package use the following command
$ wget https://dev.mysql.com/get/Downloads/Connector-J/mysql-connector-java-5.1.42.zip && unzip mysql-connector-java-5.1.42.zip
The extracted package directory will consist of a .jar file — mysql-connector-java-5.1.42-bin.jar
Note the path of the connector file as we will be using this path in the logstash configuration file
Logstash to the rescue
Logstash is a standalone application and does not need to be installed. So, it can be downloaded anywhere on a server where Elasticsearch is running. In this example we are going to download Logstash 5.4.1. You can use the following command to download logstash and extract the package
$ wget https://artifacts.elastic.co/downloads/logstash/logstash-5.4.1.zip && unzip logstash-5.4.1.zip
Your logstash package is now extracted in logstash-5.4.1 directory.
Let’s now create a file student_marks.conf with the below configuration.
input {
jdbc {
jdbc_connection_string => "jdbc:mysql://localhost:3306/db_name"
jdbc_user => "db_user"
jdbc_password => "db_password"
jdbc_validate_connection => true
jdbc_driver_library => "/path/to/mysql-connector-java/mysql-connector-java-5.1.42-bin.jar"
jdbc_driver_class => "com.mysql.jdbc.Driver"
jdbc_fetch_size => 100
statement => "SELECT a.id, b.marks FROM student AS a JOIN marks AS b ON b.student_id = a.id"
}
}
output {
elasticsearch {
index => "student_marks"
document_type => "student_mark"
document_id => "%{id}"
hosts => "localhost:9200"
}
}
In the configuration file there are 2 blocks viz., input and output. In the input block we have specified that the input is going to come from a jdbc connector. The jdbc block contains the MySQL connection, database user and password, and the path to Connector/J which we have downloaded. Another important definition in the jdbc block is statement. Statement will have the SQL query of the result that you want to index. In the above example the result of the statement is the primary key from the student table and its corresponding marks.
The output block has another block — elasticsearch. This block defines the parameters of the data index. The document_id should be the primary ID column of the indexing data which in our case is the id column form the students table.
Stream data to Elasticsearch with Logstash
The main logstash application file is in /bin within the logstash directory. To start streaming data from MySQL to Elasticsearch use the following command
$ ./logstash -f /path/to/students_marks.conf
The above command will start pushing all the result from MySQL statement to Elasticsearch. Once the streaming is complete use the following command to check the index status. This command shows the total count of data indexed.
$ curl http://localhost:9200/student_marks/_count
The output of the above command will be something like following in JSON
{"count":1172,"_shards":{"total":5,"successful":5,"failed":0}}
In the output you can see that the value of count property shows that the index student_marks has 1172 documents indexed.
That’s it! You can now use the indexed data from Elasticsearch to put your data search or analytics on steroids.