Software for Data Replication
For many application scenarios, you must use data from existing systems in SAP HANA. The process of first replicating data structures and then an existing data set (initial load) is called data replication. If the data is subsequently changed in the original system (for example, after creating a new business partner), the mirrored data is updated as well (delta load). The existing systems can be systems of the SAP Business Suite, SAP NetWeaver BW, or any other data source.
See More: What is SAP Hana? Software Components of SAP HANA ? Hardware and Software Innovations in SAP HANA ?
Depending on the data source and usage scenario, different mechanisms and tools can be used for data replication.
To benefit from these hardware trends, SAP has been working in close cooperation with hardware manufacturers during the development of SAP HANA. Consequently, the SAP HANA database currently only runs on hardware certified by SAP.
Current hard disks provide 15,000 rpm. Assuming that the disk needs 0.5 rotations on average per access, two milliseconds are already needed for these 0.5 rotations. In addition to this, the times for positioning the read/write head and the transfer time must be added, which results in a total of about six to eight milliseconds.
When using Flash memory, no mechanical parts need be moved. This results in access times of about 200 microseconds. In SAP HANA, performance-critical data is placed in this type of memory and then loaded into the main memory.
Access to the main memory, (or DRAM, dynamic random access memory) is even faster. Typical access times are 60 to 100 nanoseconds. The exact access time depends on the access location within memory. With the NUMA architecture (non-uniform memory access) used in SAP HANA, a processor can access its own local memory faster than memory that
is within the same system but is being managed by other processors. With the currently certified systems, this memory area section has a size of up to 4 TB.
See More: How Write operation works in Hana? What is Main Storage, Delta Storage ? Delta merge Operation
Access times to caches in the CPU are usually indicated as clock ticks. In case of a CPU with a clock speed of 2.4 GHz, a cycle takes about 0.42 nanoseconds. The hardware certified for SAP HANA uses three caches, referred to as L1 to L3 cache. L1 cache can be accessed in three to four clock ticks, L2 cache in about ten clock ticks, and L3 cache in about 40 clock ticks. L1 cache has a size of 64 KB, L2 cache of 256 KB, and L3 cache of
30 MB. Each server comprises only one L3 cache which is used by all CPUs, while each CPU has its own L2 and L1 cache. This is illustrated in below diagram.
Main memory as the new bottleneck
When sizing an SAP HANA system, enough capacity should be assigned to place all data in the main memory so that all reading accesses can usually be executed on this memory. When accessing the data for the first time (e.g., after starting the system), the data is loaded into the main memory. You can also manually or automatically unload the data from the main memory. This can be necessary if, for example, the system tries to use
more than the available memory size.
In the past, access to the hard disk was usually the performance bottleneck; with SAP HANA, however, main memory access is now the bottleneck.
Even though these accesses are up to 100,000 times faster than hard-disk accesses, they are still four to 60 times slower than accesses to CPU caches, which is why the main memory is the new bottleneck for SAP HANA.
The memory algorithms in SAP HANA are implemented in such a way that they can work directly with the L1 cache in the CPU wherever possible. Data transport from the main memory to the CPU caches must therefore be kept to a minimum—which has major effects on the software innovations described in the next section.
The software innovations in SAP HANA make optimal use of the previously described hardware. This is done through two ways: By keeping the data transport between the main memory and CPU caches to a minimum (e.g., by means of compression), and by fully leveraging the CPUs using parallel threads for data processing.
SAP HANA provides software optimizations in the following areas:
- Data layout in the main memory
Data Layout in the Main Memory
In every relational database, the entries of a database table must be stored in a certain data layout.
Let’s now take a look at the third area of software innovation: partitioning. Partitioning is used whenever very large quantities of data must be maintained and managed.
Advantages of partitioning
This technique greatly facilitates data management for database administrators. A typical task is the deletion of data (such as after an archiving operation was completed successfully). There is no need to search large amounts of information for the data to be deleted; instead, database administrators can simply remove an entire partition. Moreover, partitioning can increase application performance.
There are basically two technical variants of partitioning:
- With vertical partitioning, tables are divided into smaller sections on a column basis. For a table with seven columns, column 1 to 5 could perhaps be stored in one partition, while column 6 and 7 are stored in a different partition.
- With horizontal partitioning, tables are divided into smaller sections on a row basis. Rows 1 to 1,000,000 are then perhaps stored in one partition, while rows 1,000,001 to 2,000,000 are placed in another partition.
SAP HANA supports only horizontal partitioning. The data in a table is distributed across different partitions on a row basis, while the records within the partitions are stored on a column basis.
The below Diagram shows how horizontal partitioning is used for a table with the two columns Name and Gender in case of column-based data storage. On the left side, the table is shown with a dictionary vector (DV) and an attribute vector (AV) for both the column Name and the column Gender. On the right side, the data was partitioned using the round-robin technique, which will be explained in more detail next. The consecutive rows were distributed across two partitions by turns (the first row was stored in the first partition, the second row in the second partition, the third row again in the first partition, and so on).