I developed some PL/SQL procedures for that using original awrextr.sql/awrload.sql scripts as a template.
That solution have some drawbacks, for example, an AWR retention for foreign databases always set to 40150 years after each load:
SQL> select dbid, retention from dba_hist_wr_control; DBID RETENTION ---------- -------------------- 1110059808 +00366 00:00:00.0 1917063347 +40150 00:00:00.0 1996649024 +40150 00:00:00.0We could change the retention but that would always be reset on the next AWR load.
To prevent the excessive AWR data growth, I executed DBMS_WORKLOAD_REPOSITORY.DROP_SNAPSHOT_RANGE procedure once a year.
The 2015 year was no exception.
I also executed ALTER TABLE SHRINK SPACE command after I had executed DBMS_WORKLOAD_REPOSITORY.DROP_SNAPSHOT_RANGE procedures.
Usually that command takes a while to execute, because the some of AWR tables are more than 50G in size, for example, WRH$_ACTIVE_SESSION_HISTORY.
It's a base table for DBA_HIST_ACTIVE_SESS_HISTORY view and contains an ASH data that's huge.
This year things went wrong:
SQL> alter table SYS.WRH$_ACTIVE_SESSION_HISTORY shrink space; -- AFTER 4-8 hours * ERROR at line 1: ORA-03113: end-of-file on communication channel Process ID: 635 Session ID: 281 Serial number: 54363alert.log was not very helpful:
Mon Dec 21 14:37:31 2015 Process 0x3cd728e40 appears to be hung while dumping Current time = 1689592362, process death time = 1689530965 interval = 60000 Called from location UNKNOWN:UNKNOWN Attempting to kill process 0x3cd728e40 with OS pid = 17740 OSD kill succeeded for process 3cd728e40 Mon Dec 21 14:39:46 2015 Errors in file /oracle/diag/rdbms/orcl/orcl/trace/orcl_ora_25447.trc:But a relevant trace file (which was more than 4G in size) spotted a light on a problem:
*** ACTION NAME:() 2015-12-21 14:39:46.890 oer 8102.2 - obj# 1021709, rdba: 0x086ad43c(afn 33, blk# 2806844) kdk key 8102.2: ncol: 6, len: 26 key: (26): 06 c5 0c 0b 06 63 09 04 c3 08 0e 3e 02 c1 02 05 c4 62 46 05 3b 03 c2 02 3d ff mask: (4096): 81 90 20 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00Well, string "oer 8102.2" it's a sign of famous ORA-8102 error:
[oracle@localhost ~]$ oerr ora 8102 08102, 00000, "index key not found, obj# %s, file %s, block %s (%s)" // *Cause: Internal error: possible inconsistency in index // *Action: Send trace file to your customer support representative, along // with information on reproducing the errorobj# 1021709 in the trace file easily shows to us where the problem is:
SQL> select xmltype(cursor(select * from dba_objects where object_id=1021709)) xml_rec from dual; XML_REC ---------------------------------------------------------------------------------------------------------- <ROWSET> <ROW> <OWNER>SYS <OBJECT_NAME>WRH$_ACTIVE_SESSION_HISTORY_PK <SUBOBJECT_NAME>WRH$_ACTIVE_1110059808_66853 <OBJECT_ID>1021709 <DATA_OBJECT_ID>1080082 <OBJECT_TYPE>INDEX PARTITION <CREATED>27.10.2015 01:28:05 <LAST_DDL_TIME>22.12.2015 09:11:22 <TIMESTAMP>2015-10-27:01:28:05 <STATUS>VALID <TEMPORARY>N <GENERATED>N <SECONDARY>N <NAMESPACE>4 <SHARING>NONE <ORACLE_MAINTAINED>Y </ROW> </ROWSET>The index WRH$_ACTIVE_SESSION_HISTORY_PK is local partitioned on next columns:
CREATE UNIQUE INDEX "SYS"."WRH$_ACTIVE_SESSION_HISTORY_PK" ON "SYS"."WRH$_ACTIVE_SESSION_HISTORY" ("DBID", "SNAP_ID", "INSTANCE_NUMBER", "SAMPLE_ID", "SESSION_ID", "CON_DBID")The all of the above columns are numbers.
The index key:
key: (26): 06 c5 0c 0b 06 63 09 04 c3 08 0e 3e 02 c1 02 05 c4 62 46 05 3b 03 c2 02 3d ffcan be decrypted easily:
SQL> with input as ( 2 -- index key 3 select '06 c5 0c 0b 06 63 09 04 c3 08 0e 3e 02 c1 02 05 c4 62 46 05 3b 03 c2 02 3d ff' c from dual), 4 t(pos, num,c) as ( 5 -- traverse the index key recursively 6 select 0 pos, 0, replace(c, ' ') c from input union all 7 select pos+1, 8 utl_raw.cast_to_number(hextoraw(substr(c, 3, to_number(substr(c,1,2), 'xx')*2))), 9 substr(c, (to_number(substr(c, 1, 2), 'xx')+1)*2+1) 10 from t 11 where c<>'ff') 12 select pos, num 13 from t 14 where pos>0 15 order by pos 16 / POS NUM ---------- ---------- 1 1110059808 2 71361 3 1 4 97690458 5 160Ok, now we've known that: dbid=1110059808, snap_id=71361, instance_number=1, sample_id=97690458, session_id=160.
But the index was built on six columns when the problem key contains only five.
What is the CON_DBID for the problem key? That's where the trouble lies!
CON_DBID was introduced in Oracle 12.1 and added as a "DEFAULT NOT NULL" column, i.e. that column addition was a metadata-only operation without updating the table blocks (unless _add_col_optim_enabled was not set to FALSE).
SQL> select to_char(property, 'fm0xxxxxxx') 2 from sys.col$ 3 where obj# = 22800 4 and name = 'CON_DBID'; TO_CHAR(PROPERTY,'FM0XXXXXX --------------------------- 40000000Notice that property was 0x40000000 - I speculated that such property was set to columns which were added with default column optimization.
Some MOS notes prove that, i.e. Table Columns Have Wrong Default Value After Transporting Tablespaces (Doc ID 1602994.1).
I dumped the relevant table block and found that CON_DBID column was not present in it.
The pieces of that puzzle started to fit together.
I had constructed a simple test case which reproduced an ORA-8102 error:
SQL> create table t(x int, pad varchar2(100)) enable row movement; SQL> insert /*+ append*/ 2 into t 3 select level, lpad('x', 100, 'x') 4 from dual 5 connect by level<=1e5; SQL> commit; SQL> SQL> alter table t add y int default 10 not null; SQL> SQL> create index t_xy_i on t(x,y); SQL> SQL> delete t where x<=1e5/2; SQL> commit; SQL> SQL> alter table t shrink space; alter table t shrink space * ERROR at line 1: ORA-08102: index key not found, obj# 91957, file 10, block 3990 (2)The problem is present in Oracle 11.2.0.4 and more newest versions. I didn't test in earlier 11g patchsets.
Oracle Support raised a new bug: ORA-8102 ON ALTER TABLE SHRINK SPACE WITH ADD COL OPTIMIZATION (unpublished), which is still under investigation.
The tables suspected to an ORA-8102 error can be identified easily by bitand(col$.property, 1073741824)=1073741824 (0x40000000 in hex):
SQL> select o.object_name 2 from sys.col$ c, 3 dba_objects o 4 where bitand(c.property, 1073741824)=1073741824 5 and o.object_id=c.obj# 6 and o.owner='SYS' 7 order by o.object_name;I restricted above query to SYS schema. Below is the output of that query in one of database which was upgraded from 11.2.0.4 to 12.1.0.2 in the 2015:
You could see that mostly AWR tables are affected (WRH$%, WRI%). The brand new 12c database's AWR tables won't have property set to 1073741824 (0x40000000), unless you wouldn't create some.
If you set the _add_col_optim_enabled parameter to FALSE, you will be safe.
If you will face a same issue one day, the possible workaround for the issue is to rebuild table in anyway to update the problem column (MOVE table, redefine table online, or set column value equals to self).