Category: MySQL General

Import data from MySQL to hadoop using SQOOP

As a part of our job we import/move a lot of data from relational databases (Mainly from Oracle and MySQL) to Hadoop. Most of our data stores are in Oracle with a few internal data stores running on MySQL.

SQOOP (SQL for Hadoop) is an Apache tool to import data from relational databases (There are separate drivers for each database) to hadoop. Here in this blog we will try to import data from a MySQL table to Hadoop file system.

Here, I have a MySQL instance running on the local machine on which my Hadoop cluster also running. You will have to download and place the driver in appropriate directory for SQOOP to connect to that database. Drivers are already present in my machine as SQOOP offers a very extensive support for MySQL.

Below link will give you a list of available drivers and their locations if you are using a different database.

https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1773570

First let me login to the primary node in my 3 node cluster (Virtual/Created by Vagrant and VirtualBox).

vagrant ssh node1

Let us check the connection and data in the MySQL database.

mysql -u root -h localhost -p
Enter password: ********
MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| my_test |
| mysql |
| performance_schema |
| test |
+--------------------+
5 rows in set (0.04 sec)

use my_test;

MariaDB [my_test]> show tables;
+-------------------+
| Tables_in_my_test |
+-------------------+
| name_data |
| name_data2 |
+-------------------+
2 rows in set (0.02 sec)

Now, lets check the data.

select count(*) from name_data;

MariaDB [my_test]> select count(*) from name_data;
+----------+
| count(*) |
+----------+
| 1858689 |
+----------+

MariaDB [my_test]> select * from name_data limit 3;
+------+--------+-------+
| Name | Gender | count |
+------+--------+-------+
| Mary | F | 7065 |
| Anna | F | 2604 |
| Emma | F | 2003 |
+------+--------+-------+

Now we are sure that we have data in MySQL table, lets check our HADOOP home directory.

hadoop fs -ls /user/vagrant/

[vagrant@node1 ~]$ hadoop fs -ls /user/vagrant
Found 5 items
drwx------ - vagrant hdfs 0 2016-12-20 02:56 /user/vagrant/.Trash
drwxr-xr-x - vagrant hdfs 0 2016-10-26 04:46 /user/vagrant/.hiveJars
drwx------ - vagrant hdfs 0 2016-11-13 23:44 /user/vagrant/.staging
drwxr-xr-x - vagrant hdfs 0 2016-12-06 04:13 /user/vagrant/test_files

Now we wants to move the data from MySQL to the /users/vagrant/name_data directory. Below is th sqoop command to move import data.

[vagrant@node1 ~]$ sqoop import –connect jdbc:mysql://localhost/my_test –username root –password ******* –table name_data –m 1 –target-dir /user/vagrant/my_data

Once this command is completed, data will be present in /user/vagrant/my_data.

[vagrant@node1 ~]$ hadoop fs -ls /user/vagrant/my_data
Found 2 items
-rw-r--r-- 3 vagrant hdfs 0 2016-12-20 03:20 /user/vagrant/my_data/_SUCCESS
-rw-r--r-- 3 vagrant hdfs 22125615 2016-12-20 03:20 /user/vagrant/my_data/part-m-00000

[vagrant@node1 ~]$ hadoop fs -cat /user/vagrant/my_data/part-m-00000| wc -l
1858689
[vagrant@node1 ~]$

[vagrant@node1 ~]$ hadoop fs -cat /user/vagrant/my_data/part-m-00000| head -3
Mary,F,7065
Anna,F,2604
Emma,F,2003

We can also create a config file and store the commands in it for re-usability.

[vagrant@node1 ~]$ cat sqoop_test_config.cnf
import
--connect
jdbc:mysql://localhost/my_test
--username
root

[vagrant@node1 ~]$ sqoop --options-file ./sqoop_test_config.cnf --password ***** --m 1 --table name_data --target-dir /user/vagrant/my_data

This also does the same job, but now we have the flexibility to save, edit and reuse the commands.