Deleting thousands of files from Linux directory

Our project team uses a script which creates thousands of files in a folder and we sometime has to manually clean up all those files.

rm command failed saying the list is too long and we had to find another method to do this. rather than writing s shell script and delete the files one by one we used the find command.

Here how we had done it.

>>>ls -lrt| wc -l

We have 250K files in this directory and we tried removing them using the rm command.

>>> rm *.env
 ksh: rm: /bin/rm: cannot execute [Argument list too long]

This issue can be easily solved by the find command and we noticed find runs faster in such situations.

>>>find . -name "*.env" -delete

Above listed find command deleted all .env files in the current directory.

Access Oracle from Python

If you are a data engineer you might have to pull data from various sources and relational databases (RDBMs) is one of them. Python has separate modules to connect to each of relational databases (Like Oracle, MySQL, PostgresSQL etc.). Today we will discuss how we can connect and query Oracle database from Python.

I hope you have working Python environment set up already. Here I’m using Python CLI with Python 2.7 for this tutorial.

1. Download and install oracle 12c database.

I have downloaded Oracle 12C database from OTN ( and installed in my CentOS 7 laptop. Since it is installed in my laptop, server name is localhost and the database name is pdborcl.localdomain

2. We need to install an python module called cx_Oracle which can help us to connect and extract data from the Oracle instance. It gave me couple of errors when I first tried to install on my machine.

[oracle@ora-c7 ~]$ sudo pip install cx_Oracle
 Collecting cx_Oracle
 Using cached cx_Oracle-5.2.1.tar.gz
 Complete output from command python egg_info:
 Traceback (most recent call last):
 File "", line 1, in 
 File "/tmp/pip-build-kkqi2h/cx-Oracle/", line 170, in 
 raise DistutilsSetupError("cannot locate an Oracle software " \
 distutils.errors.DistutilsSetupError: cannot locate an Oracle software installation

 Command "python egg_info" failed with error code 1 in /tmp/pip-build-kkqi2h/cx-Oracle/

This is because some packages are missing and we need to install them before installing cx_Oracle. Lets install setuptools and ez-setup.

[root@ora-c7 oracle]# pip install --upgrade setuptools
 Collecting setuptools
 Downloading setuptools-28.7.1-py2.py3-none-any.whl (472kB)
 100% |████████████████████████████████| 481kB 857kB/s
 Installing collected packages: setuptools
 Found existing installation: setuptools 20.10.1
 Uninstalling setuptools-20.10.1:
 Successfully uninstalled setuptools-20.10.1
 Successfully installed setuptools-28.7.1

[root@ora-c7 oracle]# pip install ez_setup
 Collecting ez_setup
 Downloading ez_setup-0.9.tar.gz
 Installing collected packages: ez-setup
 Running install for ez-setup ... done
 Successfully installed ez-setup-0.9

Now lets try to reinstall cx_Oracle.

[root@ora-c7 oracle]# pip install cx_Oracle
 Collecting cx_Oracle
 Using cached cx_Oracle-5.2.1.tar.gz
 Installing collected packages: cx-Oracle
 Running install for cx-Oracle ... done
 Successfully installed cx-Oracle-5.2.1

Lets try to run a sample program which connects and pull some data. We will connect and get the oracle database version and query a table. This is a simple program we can do much complex processing with Cx_Oracle, I will list them in another blog.

Oracle version I have is 12C and the table has 100 records in it.

[oracle@ora-c7 ~]$ python
 Python 2.7.5 (default, Sep 15 2016, 22:37:39)
 [GCC 4.8.5 20150623 (Red Hat 4.8.5-4)] on linux2
 Type "help", "copyright", "credits" or "license" for more information.
 >>> import cx_Oracle
 >>> from pprint import pprint
 >>> db=cx_Oracle.connect('******','*****','localhost:1521/pdborcl.localdomain')
 >>> print db.version
 >>> cursor = db.cursor()
 >>> cursor.execute('select count(*) from my_tab');
 >>> pprint(cursor.fetchall())

OEM Express in 12C

If you are looking for some basic database monitoring, Oracle 12C provides Enterprise Manager Express for that purpose. It is a light weight version of EM with limited features. I use it to monitor the oracle 12c instance installed on my laptop which runs on CentOS 7.0.

For accessing the EM express we have to enable the HTTP and HTTPS port for the database . Here is how you can check the HTTP and HTTPS port settings for your database.

SQL> select dbms_xdb.getHttpPort() from dual;


SQL> select dbms_xdb_config.getHttpsPort() from dual;


If you want to manually set up the HTTPS or HTTP port you can do that by logging into the database and running the below commands.

SQL> exec dbms_xdb_config.sethttpsport(5501);


SQL> exec dbms_xdb_config.sethttpport(8080);

In order to use EM Express, a database user needs to have been granted the EM_EXPRESS_BASIC or EM_EXPRESS_ALL role.  The DBA role includes both the EM_EXPRESS_BASIC and the EM_EXPRESS_ALL roles.

Once this is done use the below URL to access EM Express.


Login page.


Home Page.


Performance Home.


It may not have all the OEM features but a nice tool indeed.

How To Get/Set Current Database Name in Hive

It is very easy to find/display current database information in Hive Command Line Interface (CLI). Just run the below command and terminal will display the current Hive database you are connected to.

hive> set hiveconf:hive.cli.print.current.db=true;
 hive (my_db)>

This value will be overwritten if you restart the Hive CLI or you open a new one. We can set it my permanently by editing the

~/.hiverc file.

restart the hive CLI after updating the file and it will show which DB you are connected to.

hive (default)> use my_db ;
 Time taken: 3.189 seconds
 hive (my_db)>

Oracle INMEMORY option for heap table

Oracle inmemory options allows users to store their data in memory. This is how we can enable inmemory option which provides faster access to table data.

SQL> show parameter inmemory_size

 NAME                                 TYPE        VALUE
 ------------------------------------ ----------- ------------------------------
 inmemory_size                        big integer 0

Here in my database inmemory  size is set to zero. We can change it by editing this parameter, restart the database after changing this parameter.

SQL> alter system set inmemory_size=100M scope=spfile;

System altered.

SQL> shutdown immediate;
 Database closed.
 Database dismounted.
 ORACLE instance shut down.
 SQL> startup open;
 ORACLE instance started.

Total System Global Area 2432696320 bytes
 Fixed Size 2927288 bytes
 Variable Size 637535560 bytes
 Database Buffers 1660944384 bytes
 Redo Buffers 13848576 bytes
 In-Memory Area 117440512 bytes
 Database mounted.
 Database opened.
 SQL> alter pluggable database all open;

Pluggable database altered.

Check the value of inmemory_size again for the database.

SQL> show parameter inmemory_size

NAME                                  TYPE        VALUE
 ------------------------------------ ----------- ------------------------------
 inmemory_size                        big integer 112M

Now we can see the parameter is set to 112MB and this can be used to store our table data. Now inmemory store doesn’t contain any table and this can be verified by the below command.

SQL> select count(*) from v$im_segments;


Now, pin the table into inmemory store.

SQL> alter table my_tab inmemory;

Table altered.

SQL> select count(*) from v$im_segments;


Table is stored in INMEMORY store only after we query it first time.

SQL> select count(*) from v$im_segments;


SQL> select count(*) from my_tab;


SQL> select count(*) from v$im_segments;


Execution plan clearly show if the table is accessed from memory. So, lets get the execution plan for the above count statement.

 SQL> set autotrace traceonly explain;
SQL> select count(*) from my_tab;

Execution Plan
Plan hash value: 2076930067

| Id  | Operation            | Name   | Rows  | Cost (%CPU)| Time     |
|   0 | SELECT STATEMENT        |         |       1 |       1   (0)| 00:00:01 |
|   1 |  SORT AGGREGATE         |         |       1 |          |         |
|   2 |   TABLE ACCESS INMEMORY FULL| MY_TAB |       3 |       1   (0)| 00:00:01 |

Virtualbox – Vagrant issue

I have  3 nodes Hadoop cluster installed in my laptop using virtualbox and Vagrant. I got the following error when i tried to start my cluster yesterday.

[root@ora-c7 hadoop-ops-course-master]# vagrant up node1 node2 node3
 The provider 'virtualbox' that was requested to back the machine
 'node1' is reporting that it isn't usable on this system. The
 reason is shown below:

VirtualBox is complaining that the kernel module is not loaded. Please
 run `VBoxManage --version` or open the VirtualBox GUI to see the error
 message which should contain instructions on how to fix this error.

Initially i thought this is an issue with vagrant and thought on reinstalling the whole cluster, but when I tried to start the virtualbox from the terminal i got a different error message asking me to recompile the Virtualbox kernel.

[root@ora-c7 hadoop-ops-course-master]# virtualbox --version
WARNING: The vboxdrv kernel module is not loaded. Either there is no module
 available for the current kernel (3.10.0-327.36.1.el7.x86_64) or it failed to
 load. Please recompile the kernel module and install it by

sudo /sbin/rcvboxdrv setup

You will not be able to start VMs until this problem is fixed.
START /usr/bin/firefox ""

I recompiled the kernel and everything started working fine again.

[root@ora-c7 Downloads]# /sbin/rcvboxdrv setup

Stopping VirtualBox kernel modules [ OK ]
Recompiling VirtualBox kernel modules [ OK ]
Starting VirtualBox kernel modules [ OK ]
[root@ora-c7 Downloads]#

Top-N queries in oracle 12c

Oracle 12c has a new feature to select top N queries easily. This can be done by using the FETCH clause and the syntax is as follows.

[ OFFSET offset { ROW | ROWS } ]
[ FETCH { FIRST | NEXT } [ { rowcount | percent PERCENT } ]
    { ROW | ROWS } { ONLY | WITH TIES } ]

Lets create a table with 100 rows and see how this command works. We will insert a sequence of 1-100 into the table from dual using the connect by clause.

SQL> create table my_tab as
  2  select level id from dual
  3  connect by level <= 100;

Table created.

SQL> select count(*) from my_tab;


Now lets check how this feature now. We will sort the ids in descending order and pick the top 5/6 values from the list.

SQL> select * from my_tab
  2  order by id desc
  3  fetch first 5 rows only;


SQL> select * from my_tab
  2  order by id desc
  3  fetch first 6 percent rows only;


6 rows selected.

SQL> select * from my_tab
  2  order by id desc
  3  offset 5 rows fetch next 5 percent rows only;