March 3, 2014 2 Comments
This is going to be another one of those posts, a bit like this one, that discuss the use of Oracle’s database product with Advanced Format devices. I wish there weren’t so many of these posts, but it seems that Oracle has a lot of issues with it’s implementation of 4k support.
(Before reading on, if you aren’t sure what I’m talking about here then please have a read of this page…)
In the last post I built a database which used Oracle ASM (and the Linux ASMLib kernel driver) but found that if the database used an SPFILE which was located on a 4k device (within an ASM diskgroup) it didn’t work. Today, I’m going to forego ASM and use a filesystem instead (something I would never do in real life).
Building a 4k Filesystem
Let’s start with a single 4k LUN being presented from my Violin array. I’ve already configured the Linux device mapper multipathing so that it presents itself as a nicely-named device in the /dev/mapper directory:
[oracle@half-server4 ~]$ ls -l /dev/mapper/fs4ktest lrwxrwxrwx 1 root root 7 Feb 25 15:53 /dev/mapper/fs4ktest -> ../dm-7 [oracle@half-server4 ~]$ fdisk -l /dev/mapper/fs4ktest Note: sector size is 4096 (not 512) Disk /dev/mapper/fs4ktest: 215.8 GB, 215822106624 bytes 255 heads, 63 sectors/track, 3279 cylinders Units = cylinders of 16065 * 4096 = 65802240 bytes Sector size (logical/physical): 4096 bytes / 4096 bytes I/O size (minimum/optimal): 4096 bytes / 524288 bytes
We can see that this is indeed a 4k device, i.e. it has not only a 4096 byte physical blocksize, but a 4096 byte logical blocksize too. The fdisk command has even taken the time to print a special “Note” to ensure we see the sector size is not the usual 512 bytes. The next thing to do is format it with a filesystem so I’m going to use ext4:
[root@half-server4 ~]# mkfs.ext4 /dev/mapper/fs4test mke2fs 1.41.12 (17-May-2010) Filesystem label= OS type: Linux Block size=4096 (log=2) Fragment size=4096 (log=2) Stride=1 blocks, Stripe width=128 blocks 2097152 inodes, 8388608 blocks 419430 blocks (5.00%) reserved for the super user First data block=0 Maximum filesystem blocks=4294967296 256 block groups 32768 blocks per group, 32768 fragments per group 8192 inodes per group Superblock backups stored on blocks: 32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 4096000, 7962624 Writing inode tables: done Creating journal (32768 blocks): done Writing superblocks and filesystem accounting information: done
Now it needs to be mounted. I’m just going to stick it on a mount point in a new top-level directory called /fstest:
[root@half-server4 ~]# mkdir -p /fstest/fs4ktest [root@half-server4 ~]# chown -R oracle:oinstall /fstest [root@half-server4 ~]# mount /dev/mapper/fs4ktest /fstest/fs4ktest
Finally, we I am going to create an Oracle database using this filesystem. I’m not going to cut and paste all the output for that, because it’s all a bit dull… so let’s just skip to the bit where DBCA has completed and the database is open.
Oracle On A 4k Filesystem
So the good news is, it worked. The database is up and running and the datafiles are located on the 4k filesystem:
[oracle@half-server4 fstest]$ sqlplus / as sysdba SQL*Plus: Release 22.214.171.124.0 Production on Mon Mar 3 21:30:02 2014 Copyright (c) 1982, 2013, Oracle. All rights reserved. Connected to: Oracle Database 12c Enterprise Edition Release 126.96.36.199.0 - 64bit Production With the Partitioning, OLAP, Advanced Analytics and Real Application Testing options SQL> select name from v$datafile; NAME --------------------------------------------------------------------------------------- /fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_system_9k9wxmw6_.dbf /fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_sysaux_9k9ww6r6_.dbf /fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_undotbs1_9k9wz257_.dbf /fstest/fs4ktest/oracle/oradata/FSTEST/datafile/o1_mf_users_9k9wz123_.dbf
Cool. We can end this post here then, right? Well, no… because there is a bit of a problem with this database. Let’s just have a quick check of the FILESYSTEMIO_OPTIONS parameter:
SQL> show parameter filesystem NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ filesystemio_options string none
This parameter controls the way that I/O is performed for files located on filesystems. It isn’t relevant for databases using Oracle ASM (for which the DISK_ASYNCH_IO parameter exists instead), but here it’s making a massive difference. According to the Oracle documentation, it has four possible options:
ASYNCH:enable asynchronous I/O on file system files, which has no timing requirement for transmission.
DIRECTIO:enable direct I/O on file system files, which bypasses the buffer cache.
SETALL:enable both asynchronous and direct I/O on file system files.
NONE:disable both asynchronous and direct I/O on file system files.
Normally, when I see filesystem-based databases, I find this parameter set to SETALL. This means asynchronous and direct I/O, but here it is set to NONE which means neither. And it’s the DIRECTIO that we are interested in.
One Buffer Cache Is Enough
As you are no doubt aware, Oracle databases have a buffer cache which is used to cache copies of database blocks. However, the Linux operating system also has its own buffer cache for filesystems. Most people would consider it ineffective to use two levels of cache – and if that is the case, it will obviously be the Oracle buffer cache that needs to be used. So let’s set the parameter to use direct I/O and then restart the database (as the parameter is not dynamic):
SQL> alter system set filesystemio_options='directIO' scope=spfile; System altered. SQL> shutdown immediate Database closed. Database dismounted. ORACLE instance shut down. SQL> startup ORACLE instance started. Total System Global Area 1.3896E+10 bytes Fixed Size 4663568 bytes Variable Size 2751465200 bytes Database Buffers 1.1107E+10 bytes Redo Buffers 33673216 bytes ORA-00205: error in identifying control file, check alert log for more info
Oh dear. What happened?
[oracle@half-server4 ~]$ tail /u01/app/oracle/diag/rdbms/fstest/fstest/trace/alert_fstest.log ORA-00210: cannot open the specified control file ORA-00202: control file: '/fstest/fs4ktest/oracle/oradata/FSTEST/controlfile/o1_mf_9k9wzp31_.ctl' ORA-27047: unable to read the header block of file Linux-x86_64 Error: 22: Invalid argument Additional information: 1 ORA-205 signalled during: ALTER DATABASE MOUNT...
The answer, which you can find in My Oracle Support note 1133713.1, is that Oracle does not support 4k devices with direct I/O. This has been the case for a long time – I remember first discovering this nearly two years ago, on 188.8.131.52, yet there is no sign of it being fixed. According to the note, “It is not yet known in which version this support will be available.” Pah.
There’s More: Diagnostic Destination on 4k
And then there’s the diagnostic destination. How about if I choose to locate this on a 4k filesystem?
SQL> show parameter diagnostic_dest NAME TYPE VALUE ------------------------------------ ----------- ------------------------------ diagnostic_dest string /u01/app/oracle SQL> alter system set diagnostic_dest='/fstest/fs4ktest/oracle' scope=spfile; System altered.
I’ll give it a few minutes and then go and look in some of the files… guess what I see?
ORA-48101: error encountered when attempting to read a file [block] [/fstest/fs4ktest/oracle/diag/rdbms/fstest/fstest/metadata/INCIDENT.ams]  ORA-27072: File I/O error Linux-x86_64 Error: 22: Invalid argument Additional information: 4 Additional information: 1 Additional information: -1
Look familiar? (This is nothing to do with direct I/O by the way, I disabled that again before this test.)
So let’s be honest, things aren’t going all that well here. There are still a lot of things that do not appear to work properly when using 4k devices. Luckily, my Violin array can present storage as 512 byte to avoid this sort of issue, but really I feel that Oracle needs to get cracking on its Advanced Format support. This is not just a flash memory thing, pretty much every major disk vendor is making Advanced Format devices now from Western Digital, through HGST to Seagate.
Time to get with the programme?