Monday 18 May 2015

Accelerate ETL by Leveraging Delphix Virtual Databases

Many downstream database and analytics systems need to be routinely refreshed with processed versions of the latest changes from production.   Typically the new data needs go through an ETL process to prepare it for downstream use in an Analytics Engine, Warehouse, Data Mart, Reader Farm, etc.

There are several major benefits Delphix offers in this scenario.

Firstly Delphix maintains its own up-to-date copy of source data by seamlessly synchronising with the source databases and retaining this in a secure central data hub.

Second, all ETL processes can be run directly from virtual databases (VDBs) provisioned by Delphix, completely offloading production and without the need for physical copies.  You can create as many VDBs as you want to run these processes in parallel.

Third, if you wish to keep pace with new transactions you can use Delphix Live Sources, which uses an Oracle Standby to keep the Delphix dSource copy continuously updated but also enables reporting using Oracle Active Dataguard.  If synchronising with a target Hadoop cluster you can run Apache Sqoop jobs against a virtual data warehouse that is constantly being updated from the physical data warehouse.  You can run your ETL directly on the changed data stored in the virtual data warehouse and import only the processed changes into your Hadoop cluster.

In addition it is also possible to use Delphix to help identify deltas between two versions of a source database by creating two VDBs based on two point-in-time snapshots.

If the two VDBs (snap1 and snap2) are based on an Oracle source database, first you simply create a database link connecting to snap1 from snap2.

On snap2 you create tables to store:

1.     INSERTS, rows that have been added since the old version (snap1)
2.     UPDATES, rows that have been modified between the two versions
3.     DELETES, rows that no longer exist in the new version (snap2)

You can identify INSERTS, UPDATES and DELETES between the two versions using the MINUS set operator between the corresponding tables in the two VDBs.

For example to find and populate my local table “delta_inserts” with new rows, I use something like:
           
/* create empty table with correct structure */

create table delta_inserts as select * from table where 1=0;
alter table delta_inserts nologging;

/* populate table with new rows only */

insert /*+ append */
  into delta_inserts
  with new_rows as
  (
    select id from table@snap2
     minus
    select id from table@snap1
)
/* fetch entire row from latest version */
select latest.*
  from table@snap2 latest,
       new_rows delta
 where latest.c_id = delta.c_id;

The SQL is similar for UPDATES and DELETES.  However for UPDATES we need to compare the entire row in the minus operation since any column value might have changed.  For DELETES we only need to store the key or id we will use to identify the rows to delete during the apply process.

Now we have two tables containing copies of changed rows and one with a list of rows to be deleted.

We can now run our ETL processing on only the changed rows held in delta_inserts and delta_updates making this process substantially more efficient than running against the full dataset.

The final phase is to implement an apply process to merge the processed deltas with the target database.  The options for applying the changes will depend on the nature of the changes and the type of the target database which might be relational, Hadoop (HDFS), etc.

For Hadoop you can use Apache Sqoop to perform incremental imports and merges.  For relational databases there are more options.

The apply process would usually

1.     delete all the rows in the target table having the keys saved in delta_deletes
2.     insert or append all the rows to the target table having processed rows saved in delta_inserts
3.     merge or update all the rows in the target table having processed rows saved in delta_updates

If there are foreign key dependencies then you might need to disable and re-enable them before and after the apply process.

In summary, Delphix is ideal for provisioning data sets for ETL processing.  Delphix Virtual Databases (VDBs) can also underpin the process to identify changes between two versions of the same database. ETL jobs can then operate against only the changed rows and finally once processed these changes can be used to refresh the target database using a simple apply process.




Learn more at Delphix Agile Data Management

No comments:

Post a Comment