VTU Study Materials: Index Data Structure

Monday, 9 February 2015

Index Data Structure

Index Data Structure

One way to organize data entries is to hash data entries on the search key.
Another way to organize data entries is to build a tree-like data structure that directs a search for data entries.
The choice of hash or tree indexing techniques can be combined with any of the three alternatives for data entries.

Hash-Based Indexing:

We can organize records using a technique called hashing to quickly find records that have a given search key value.

For example, if the file of employee records is hashed on the name field, we can retrieve all records about Joe.
In this approach, the records in a file are grouped in buckets, where a bucket consists of a primary page and, possibly, additional pages linked in a chain.
The bucket to which a record belongs can be determined by applying a special function, called a hash function, to the search key.
Given a bucket number, a hash-based index structure allows us to retrieve the primary page for the bucket in one or two disk l/Os.
On inserts, the record is inserted into the appropriate bucket, with 'overflow' pages allocated as necessary.
To search for a record with a given search key value, we apply the hash function to identify the bucket to which such records belong and look at all pages in that bucket.
If we do not have the search key value for the record, for example, the index is based on sal and we want records with a given age value, we have to scan all pages in the file.

Hash-Based Indexing

Explanation of Fig 1.3

Hash indexing is illustrated in above figure,where the data is stored in a file that is hashed on age;the data entries in this first index file are the actual data records.
Applying the hash function to the age field identifies the page that the record belongs to. The hash function h for this example is quite simple; it converts the search key value to its binary representation and uses the two least significant bits as the bucket identifier.
Above figure also shows an index with search key sal that contains (sal, rid) pairs as data entries.
The rid (short for record id) component of a data entry in this second index is a pointer to a record with search key value sal (and is shown in the figure as an arrow pointing to the data record).
The Above Figure illustrates the use of Alternative (1) and (2) for data entries.
The file of employee records is hashed on age, and Alternative (1) is used for data entries.
The second index, on sal, also uses hashing to locate data entries, which are now (sal, rid of employee Record pairs; that is, Alternative (2) is used for data entries.
Note that the search key for an index can be any sequence of one or more fields, and it need not uniquely identify records.

Tree-Based Indexing:

The data entries are arranged in sorted order by search key value, and a hierarchical search data structure is maintained that directs searches to the correct page of data entries.

Tree Structured index

Fig 1.4

Explanation of Fig 1.4:

Each node in the figure 1.4 (e.g., nodes labeled A, B, L1, L2) is a physical page, and retrieving a node involves a disk I/O.
The lowest level of the tree, called the leaf level, contains the data entries; in our example, these are employee records.
To understand in a better way, we have drawn Figure 1.4 as if there were additional employee records, some with age less than 22 and some with age greater than 50.
Additional records with age less than 22 would appear in leaf pages to the left page L1,and records with age greater than 50 would appear in leaf pages to the right of page L3.

Working procedure:

All searches begin at the topmost node, called the root, and the contents of pages in non-leaf levels direct searches to the correct leaf page.
Non-leaf pages contain node pointers separated by search key values.
The node pointer to the left of a key value k points to a subtree that contains only data entries less than k.
The node pointer to the right of a key value k points to a subtree that contains only data entries greater than or equal to k.
In our example search, we look for data entries with search key value > 24, and get directed to the middle child, node A.
Again, examining the contents of this node, we are directed to node B.
Examining the contents of node B, we are directed to leaf node Ll, which contains data entries we are looking for.
Observe that leaf nodes L2 and L3 also contain data entries that satisfy our search criterion.
To facilitate retrieval of such qualifying entries during search, all leaf pages are maintained in a doubly-linked list.
Thus, we can fetch page L2 using the 'next' pointer on page L1, and then fetch page L3 using the 'next' pointer on L2.
Thus, the number of disk I/Os incurred during a search is equal to the length of a path from the root to a leaf, plus the number of leaf pages with qualifying data entries.
The B+ tree is an index structure that ensures that all paths from the root to a leaf in a given tree are of the same length, that is, the structure is always balanced in height.
Finding the correct leaf page is faster than binary search of the pages in a sorted file because each non-leaf node can accommodate a very large number of node-pointers and the height of the tree is rarely more than three or four in practice.
Fan-out (F):
The average number of children for a non-leaf node is called the fan-out of the tree.If every non-leaf node has n children, a tree of height h has n to the power h leaf pages.In practice, F (fan-out) is at least 100, which means a tree of height four contains 100 million leaf pages.

No comments:

Post a Comment

counter for blogspot