Import data from MySQL to hadoop using SQOOP

As a part of our job we import/move a lot of data from relational databases (Mainly from Oracle and MySQL) to Hadoop. Most of our data stores are in Oracle with a few internal data stores running on MySQL.

SQOOP (SQL for Hadoop) is an Apache tool to import data from relational databases (There are separate drivers for each database) to hadoop. Here in this blog we will try to import data from a MySQL table to Hadoop file system.

Here, I have a MySQL instance running on the local machine on which my Hadoop cluster also running. You will have to download and place the driver in appropriate directory for SQOOP to connect to that database. Drivers are already present in my machine as SQOOP offers a very extensive support for MySQL.

Below link will give you a list of available drivers and their locations if you are using a different database.

First let me login to the primary node in my 3 node cluster (Virtual/Created by Vagrant and VirtualBox).

vagrant ssh node1

Let us check the connection and data in the MySQL database.

mysql -u root -h localhost -p
Enter password: ********
MariaDB [(none)]> show databases;
| Database |
| information_schema |
| my_test |
| mysql |
| performance_schema |
| test |
5 rows in set (0.04 sec)

use my_test;

MariaDB [my_test]> show tables;
| Tables_in_my_test |
| name_data |
| name_data2 |
2 rows in set (0.02 sec)

Now, lets check the data.

select count(*) from name_data;

MariaDB [my_test]> select count(*) from name_data;
| count(*) |
| 1858689 |

MariaDB [my_test]> select * from name_data limit 3;
| Name | Gender | count |
| Mary | F | 7065 |
| Anna | F | 2604 |
| Emma | F | 2003 |

Now we are sure that we have data in MySQL table, lets check our HADOOP home directory.

hadoop fs -ls /user/vagrant/

[vagrant@node1 ~]$ hadoop fs -ls /user/vagrant
Found 5 items
drwx------ - vagrant hdfs 0 2016-12-20 02:56 /user/vagrant/.Trash
drwxr-xr-x - vagrant hdfs 0 2016-10-26 04:46 /user/vagrant/.hiveJars
drwx------ - vagrant hdfs 0 2016-11-13 23:44 /user/vagrant/.staging
drwxr-xr-x - vagrant hdfs 0 2016-12-06 04:13 /user/vagrant/test_files

Now we wants to move the data from MySQL to the /users/vagrant/name_data directory. Below is th sqoop command to move import data.

[vagrant@node1 ~]$ sqoop import –connect jdbc:mysql://localhost/my_test –username root –password ******* –table name_data –m 1 –target-dir /user/vagrant/my_data

Once this command is completed, data will be present in /user/vagrant/my_data.

[vagrant@node1 ~]$ hadoop fs -ls /user/vagrant/my_data
Found 2 items
-rw-r--r-- 3 vagrant hdfs 0 2016-12-20 03:20 /user/vagrant/my_data/_SUCCESS
-rw-r--r-- 3 vagrant hdfs 22125615 2016-12-20 03:20 /user/vagrant/my_data/part-m-00000

[vagrant@node1 ~]$ hadoop fs -cat /user/vagrant/my_data/part-m-00000| wc -l
[vagrant@node1 ~]$

[vagrant@node1 ~]$ hadoop fs -cat /user/vagrant/my_data/part-m-00000| head -3

We can also create a config file and store the commands in it for re-usability.

[vagrant@node1 ~]$ cat sqoop_test_config.cnf

[vagrant@node1 ~]$ sqoop --options-file ./sqoop_test_config.cnf --password ***** --m 1 --table name_data --target-dir /user/vagrant/my_data

This also does the same job, but now we have the flexibility to save, edit and reuse the commands.


Delete lines from multiple files (recursively) in Linux

We had a requirement to delete a line which matches a particular pattern from multiple ksh files. These lines of code was used to log execution status and we no longer needed it after an architecture change.

Opening hundreds of files and deleting the lines manually was a painful task, We achieved this by combing find and sed commands.

find . -name “*.ksh” -type f | xargs sed -i -e ‘/Search String/d’

Find command searches for ksh files recursively in the current directory and lists them. The second part, xargs and sed commands searches for the pattern in each file and delete it.

You can refer the manual pages if you need more information on these commands.

Deleting thousands of files from Linux directory

Our project team uses a script which creates thousands of files in a folder and we sometime has to manually clean up all those files.

rm command failed saying the list is too long and we had to find another method to do this. rather than writing s shell script and delete the files one by one we used the find command.

Here how we had done it.

>>>ls -lrt| wc -l

We have 250K files in this directory and we tried removing them using the rm command.

>>> rm *.env
 ksh: rm: /bin/rm: cannot execute [Argument list too long]

This issue can be easily solved by the find command and we noticed find runs faster in such situations.

>>>find . -name "*.env" -delete

Above listed find command deleted all .env files in the current directory.

Access Oracle from Python

If you are a data engineer you might have to pull data from various sources and relational databases (RDBMs) is one of them. Python has separate modules to connect to each of relational databases (Like Oracle, MySQL, PostgresSQL etc.). Today we will discuss how we can connect and query Oracle database from Python.

I hope you have working Python environment set up already. Here I’m using Python CLI with Python 2.7 for this tutorial.

1. Download and install oracle 12c database.

I have downloaded Oracle 12C database from OTN ( and installed in my CentOS 7 laptop. Since it is installed in my laptop, server name is localhost and the database name is pdborcl.localdomain

2. We need to install an python module called cx_Oracle which can help us to connect and extract data from the Oracle instance. It gave me couple of errors when I first tried to install on my machine.

[oracle@ora-c7 ~]$ sudo pip install cx_Oracle
 Collecting cx_Oracle
 Using cached cx_Oracle-5.2.1.tar.gz
 Complete output from command python egg_info:
 Traceback (most recent call last):
 File "", line 1, in 
 File "/tmp/pip-build-kkqi2h/cx-Oracle/", line 170, in 
 raise DistutilsSetupError("cannot locate an Oracle software " \
 distutils.errors.DistutilsSetupError: cannot locate an Oracle software installation

 Command "python egg_info" failed with error code 1 in /tmp/pip-build-kkqi2h/cx-Oracle/

This is because some packages are missing and we need to install them before installing cx_Oracle. Lets install setuptools and ez-setup.

[root@ora-c7 oracle]# pip install --upgrade setuptools
 Collecting setuptools
 Downloading setuptools-28.7.1-py2.py3-none-any.whl (472kB)
 100% |████████████████████████████████| 481kB 857kB/s
 Installing collected packages: setuptools
 Found existing installation: setuptools 20.10.1
 Uninstalling setuptools-20.10.1:
 Successfully uninstalled setuptools-20.10.1
 Successfully installed setuptools-28.7.1

[root@ora-c7 oracle]# pip install ez_setup
 Collecting ez_setup
 Downloading ez_setup-0.9.tar.gz
 Installing collected packages: ez-setup
 Running install for ez-setup ... done
 Successfully installed ez-setup-0.9

Now lets try to reinstall cx_Oracle.

[root@ora-c7 oracle]# pip install cx_Oracle
 Collecting cx_Oracle
 Using cached cx_Oracle-5.2.1.tar.gz
 Installing collected packages: cx-Oracle
 Running install for cx-Oracle ... done
 Successfully installed cx-Oracle-5.2.1

Lets try to run a sample program which connects and pull some data. We will connect and get the oracle database version and query a table. This is a simple program we can do much complex processing with Cx_Oracle, I will list them in another blog.

Oracle version I have is 12C and the table has 100 records in it.

[oracle@ora-c7 ~]$ python
 Python 2.7.5 (default, Sep 15 2016, 22:37:39)
 [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import cx_Oracle
 >>> from pprint import pprint
 >>> db=cx_Oracle.connect('******','*****','localhost:1521/pdborcl.localdomain')
 >>> print db.version
 >>> cursor = db.cursor()
 >>> cursor.execute('select count(*) from my_tab');
 >>> pprint(cursor.fetchall())

OEM Express in 12C

If you are looking for some basic database monitoring, Oracle 12C provides Enterprise Manager Express for that purpose. It is a light weight version of EM with limited features. I use it to monitor the oracle 12c instance installed on my laptop which runs on CentOS 7.0.

For accessing the EM express we have to enable the HTTP and HTTPS port for the database . Here is how you can check the HTTP and HTTPS port settings for your database.

SQL> select dbms_xdb.getHttpPort() from dual;


SQL> select dbms_xdb_config.getHttpsPort() from dual;


If you want to manually set up the HTTPS or HTTP port you can do that by logging into the database and running the below commands.

SQL> exec dbms_xdb_config.sethttpsport(5501);


SQL> exec dbms_xdb_config.sethttpport(8080);

In order to use EM Express, a database user needs to have been granted the EM_EXPRESS_BASIC or EM_EXPRESS_ALL role.  The DBA role includes both the EM_EXPRESS_BASIC and the EM_EXPRESS_ALL roles.

Once this is done use the below URL to access EM Express.


Login page.


Home Page.


Performance Home.


It may not have all the OEM features but a nice tool indeed.

How To Get/Set Current Database Name in Hive

It is very easy to find/display current database information in Hive Command Line Interface (CLI). Just run the below command and terminal will display the current Hive database you are connected to.

hive> set hiveconf:hive.cli.print.current.db=true;
 hive (my_db)>

This value will be overwritten if you restart the Hive CLI or you open a new one. We can set it my permanently by editing the

~/.hiverc file.

restart the hive CLI after updating the file and it will show which DB you are connected to.

hive (default)> use my_db ;
 Time taken: 3.189 seconds
 hive (my_db)>

Oracle INMEMORY option for heap table

Oracle inmemory options allows users to store their data in memory. This is how we can enable inmemory option which provides faster access to table data.

SQL> show parameter inmemory_size

 NAME                                 TYPE        VALUE
 ------------------------------------ ----------- ------------------------------
 inmemory_size                        big integer 0

Here in my database inmemory  size is set to zero. We can change it by editing this parameter, restart the database after changing this parameter.

SQL> alter system set inmemory_size=100M scope=spfile;

System altered.

SQL> shutdown immediate;
 Database closed.
 Database dismounted.
 ORACLE instance shut down.
 SQL> startup open;
 ORACLE instance started.

Total System Global Area 2432696320 bytes
 Fixed Size 2927288 bytes
 Variable Size 637535560 bytes
 Database Buffers 1660944384 bytes
 Redo Buffers 13848576 bytes
 In-Memory Area 117440512 bytes
 Database mounted.
 Database opened.
 SQL> alter pluggable database all open;

Pluggable database altered.

Check the value of inmemory_size again for the database.

SQL> show parameter inmemory_size

NAME                                  TYPE        VALUE
 ------------------------------------ ----------- ------------------------------
 inmemory_size                        big integer 112M

Now we can see the parameter is set to 112MB and this can be used to store our table data. Now inmemory store doesn’t contain any table and this can be verified by the below command.

SQL> select count(*) from v$im_segments;


Now, pin the table into inmemory store.

SQL> alter table my_tab inmemory;

Table altered.

SQL> select count(*) from v$im_segments;


Table is stored in INMEMORY store only after we query it first time.

SQL> select count(*) from v$im_segments;


SQL> select count(*) from my_tab;


SQL> select count(*) from v$im_segments;


Execution plan clearly show if the table is accessed from memory. So, lets get the execution plan for the above count statement.

 SQL> set autotrace traceonly explain;
SQL> select count(*) from my_tab;

Execution Plan
Plan hash value: 2076930067

| Id  | Operation            | Name   | Rows  | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT        |         |       1 |       1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE         |         |       1 |          |         |
|   2 |   TABLE ACCESS INMEMORY FULL| MY_TAB |       3 |       1   (0)| 00:00:01 |