Month: December 2016

Automatic Big Table Caching in Oracle 12C

Oracle uses its buffer cache for caching or storing recently accessed table blocks. This helps Oracle to access the same blocks faster if they are required again. Since the buffer cache resides in memory and memory is expensive this cache size is always limited. Oracle uses a LRU (Least Recently Used) algorithm to keep the most recently access blocks in memory. It is a complex algorithm for buffer cache management, but lets simplify it by saying oracle keeps a simple queue which holds the most recently used data at the hot end. As we query more and more data, existing data in the queue is pushed backwards and finally moves out of the queue.

When you query some data which had already been moved out of cache, oracle finds this data is no longer in memory and it goes for physical reads (from disks) and this is an expensive and time consuming operation.

One of the big issue with this kind of a cache is, if you query a big table most of the queue can be replaced by the data from that table alone and all subsequent queries may go for physical reads. Oracle can’t allow this and such reads bypass the buffer cache to maintain its balance. Oracle usually avoids moving the blocks of huge tables into buffer cache by using direct path reads/writes which uses the PGA (Program Global Area) which is not a shared memory area. Since PGA is not shared among the users, such caching of data is not useful for other users of the database. And this may lead to extensive physical read operations.

Recent versions of oracle (12c) is trying to overcome this issue by identifying the big tables in the database and caching data from those tables effectively. This is done by reserving a part of buffer cache for storing big tables.

Let’s test this feature by creating couple of big tables (> 1 Million rows).

SQL> create table my_test_tbl
 2 (Name varchar2(100),
 3 Emp_No integer,
 4 Dept_no integer);

Table created.

SQL> insert into my_test_tbl
 2 select 'John Doe', level, mod(level,10)
 3 from dual
 4 connect by level <= 1000000; SQL> commit;

Commit complete.

SQL> select count(*) from my_test_tbl;

COUNT(*)
 ----------
 1000000

We need to analyze the table, so that the metadata will be updated.

SQL> analyze table my_test_tbl compute statistics;

We will have to set a parameter at system level so that a part of the buffer cache (40% in our case) will be allocated for caching big tables. First , lets check the size of buffer cache allocated for this database.

SQL> select component, current_size/power(1024,3) current_size_GB from v$memory_dynamic_components
 2 where component = 'DEFAULT buffer cache'

COMPONENT CURRENT_SIZE_GB
 -------------------- ---------------
 DEFAULT buffer cache 1.546875

We have 1.5GB of buffer cache, let’s allocate 40% of this for caching big tables.

SQL> show parameter big_table

NAME TYPE VALUE
 ------------------------------------ ----------- ------------------------------
 db_big_table_cache_percent_target string 0

SQL> alter system set db_big_table_cache_percent_target = 40;

System altered.

SQL> show parameter big_table

NAME TYPE VALUE
 ------------------------------------ ----------- ------------------------------
 db_big_table_cache_percent_target string 40

Now, if we query the table it will be cached in to the big table cache.

SQL> select count(*) from my_test_tbl;

COUNT(*)
 ----------
 1000000

Please make note that we don’t have to restart the DB for modifying this parameter. Lets check the caching of the table and how much of it is cached,

SQL> select * from V$BT_SCAN_CACHE;

bt_pic_1Clearly shows 40% is reserved for big tables.

bt_pic_2

We have already queried the table once and oracle had identified that the table is indeed a big one. Now we have table in cache, we can check the size of the table on disk and how much of it is cached. Since the V$BT_SCAN_OBJ_TEMPS table contains the object id we can join it with DBA_OBJECTS and find out the table name. Once we have the table name DBA_TABLES will give us the size of the table on disk (blocks).

SQL> select object_name from dba_objects where object_id = 92742

OBJECT_NAME
 --------------------
 MY_TEST_TBL

SQL> column table_name format a20
 select table_name, blocks from dba_tables
 where table_name = 'MY_TEST_TBL';

TABLE_NAME BLOCKS
 -------------------- ----------
 MY_TEST_TBL 6922

The whole table is cached now and the temperature is set to 1000, if we use this table more and more the temperature of this table will go up making it hot. Below code snippet will query my_test_tbl 10,000 times and this will help us to increase the temperature of the table.

SQL> declare
 2 l_count integer;
 3 begin
 4 for i in 1..10000
 5 loop
 6 select count(*) into l_count from my_test_tbl;
 7 end loop;
 8 end;
 9 /

PL/SQL procedure successfully completed.

Check the V$BT_SCAN_OBJ_TEMPS table again to see if the temperature value has gone up.

bt_pic_4

We can see the temperature of the table has gone up because of the frequent querying, now we are creating another table and see if that is also getting cached. We will have 2 million records in this table.

SQL> create table my_test_tbl2
 2 as select * from MY_TEST_TBL;

SQL> insert into my_test_tbl2 select * from my_test_tbl;

1000000 rows created.

SQL> analyze table my_test_tbl2 compute statistics;

Table analyzed.

SQL> select table_name, blocks from dba_tables
 where table_name = 'MY_TEST_TBL2';

TABLE_NAME BLOCKS
 -------------------- ----------
 MY_TEST_TBL2 6224

SQL> select count(*) from MY_TEST_TBL2;

COUNT(*)
 ----------
 2000000

We can see the new table in cache with initial temperature value of 1000.

bt_pic_5

Lets run the snippet again to query the new table, this time we will query only 100 times.

Query V$BT_SCAN_OBJ_TEMPS again to see the new temperature value of second table.

bt_pic_6

This temperature value helps oracle to prioritize tables in memory and identify which table is frequently queried. Based on this information oracle decides which table stays in memory and which table has to move out.

We have to remember currently we don’t have any option to move individual tables to the cache. It is completely automated and done by Oracle’s discretion. Our table may  or may not be  moved to this cache, but if you have big tables which you think may get benefited from this option then you can check this option.

Import data from MySQL to hadoop using SQOOP

As a part of our job we import/move a lot of data from relational databases (Mainly from Oracle and MySQL) to Hadoop. Most of our data stores are in Oracle with a few internal data stores running on MySQL.

SQOOP (SQL for Hadoop) is an Apache tool to import data from relational databases (There are separate drivers for each database) to hadoop. Here in this blog we will try to import data from a MySQL table to Hadoop file system.

Here, I have a MySQL instance running on the local machine on which my Hadoop cluster also running. You will have to download and place the driver in appropriate directory for SQOOP to connect to that database. Drivers are already present in my machine as SQOOP offers a very extensive support for MySQL.

Below link will give you a list of available drivers and their locations if you are using a different database.

https://sqoop.apache.org/docs/1.4.0-incubating/SqoopUserGuide.html#id1773570

First let me login to the primary node in my 3 node cluster (Virtual/Created by Vagrant and VirtualBox).

vagrant ssh node1

Let us check the connection and data in the MySQL database.

mysql -u root -h localhost -p
Enter password: ********
MariaDB [(none)]> show databases;
+--------------------+
| Database |
+--------------------+
| information_schema |
| my_test |
| mysql |
| performance_schema |
| test |
+--------------------+
5 rows in set (0.04 sec)

use my_test;

MariaDB [my_test]> show tables;
+-------------------+
| Tables_in_my_test |
+-------------------+
| name_data |
| name_data2 |
+-------------------+
2 rows in set (0.02 sec)

Now, lets check the data.

select count(*) from name_data;

MariaDB [my_test]> select count(*) from name_data;
+----------+
| count(*) |
+----------+
| 1858689 |
+----------+

MariaDB [my_test]> select * from name_data limit 3;
+------+--------+-------+
| Name | Gender | count |
+------+--------+-------+
| Mary | F | 7065 |
| Anna | F | 2604 |
| Emma | F | 2003 |
+------+--------+-------+

Now we are sure that we have data in MySQL table, lets check our HADOOP home directory.

hadoop fs -ls /user/vagrant/

[vagrant@node1 ~]$ hadoop fs -ls /user/vagrant
Found 5 items
drwx------ - vagrant hdfs 0 2016-12-20 02:56 /user/vagrant/.Trash
drwxr-xr-x - vagrant hdfs 0 2016-10-26 04:46 /user/vagrant/.hiveJars
drwx------ - vagrant hdfs 0 2016-11-13 23:44 /user/vagrant/.staging
drwxr-xr-x - vagrant hdfs 0 2016-12-06 04:13 /user/vagrant/test_files

Now we wants to move the data from MySQL to the /users/vagrant/name_data directory. Below is th sqoop command to move import data.

[vagrant@node1 ~]$ sqoop import –connect jdbc:mysql://localhost/my_test –username root –password ******* –table name_data –m 1 –target-dir /user/vagrant/my_data

Once this command is completed, data will be present in /user/vagrant/my_data.

[vagrant@node1 ~]$ hadoop fs -ls /user/vagrant/my_data
Found 2 items
-rw-r--r-- 3 vagrant hdfs 0 2016-12-20 03:20 /user/vagrant/my_data/_SUCCESS
-rw-r--r-- 3 vagrant hdfs 22125615 2016-12-20 03:20 /user/vagrant/my_data/part-m-00000

[vagrant@node1 ~]$ hadoop fs -cat /user/vagrant/my_data/part-m-00000| wc -l
1858689
[vagrant@node1 ~]$

[vagrant@node1 ~]$ hadoop fs -cat /user/vagrant/my_data/part-m-00000| head -3
Mary,F,7065
Anna,F,2604
Emma,F,2003

We can also create a config file and store the commands in it for re-usability.

[vagrant@node1 ~]$ cat sqoop_test_config.cnf
import
--connect
jdbc:mysql://localhost/my_test
--username
root

[vagrant@node1 ~]$ sqoop --options-file ./sqoop_test_config.cnf --password ***** --m 1 --table name_data --target-dir /user/vagrant/my_data

This also does the same job, but now we have the flexibility to save, edit and reuse the commands.

Delete lines from multiple files (recursively) in Linux

We had a requirement to delete a line which matches a particular pattern from multiple ksh files. These lines of code was used to log execution status and we no longer needed it after an architecture change.

Opening hundreds of files and deleting the lines manually was a painful task, We achieved this by combing find and sed commands.

find . -name “*.ksh” -type f | xargs sed -i -e ‘/Search String/d’

Find command searches for ksh files recursively in the current directory and lists them. The second part, xargs and sed commands searches for the pattern in each file and delete it.

You can refer the manual pages if you need more information on these commands.