Monday, 9 February 2015

Index Data Structure

Index Data Structure

One way to organize data entries is  to hash data entries on the search key.
Another  way  to  organize  data entries is to build  a tree-like  data structure that  directs a search for data entries.
The  choice  of  hash or  tree  indexing  techniques  can  be  combined  with  any  of  the  three alternatives for data entries.


Hash-Based Indexing:

We  can  organize  records  using  a  technique  called  hashing  to  quickly  find  records   that have a given search key value.


For example, if the file of employee records is hashed on the name field, we can retrieve all records about Joe.
In this approach, the records in a file are grouped in buckets, where a bucket consists of a primary page and, possibly, additional pages linked in a chain.
The bucket to which  a  record belongs can be determined by applying a special  function, called a hash function, to the search key.
Given  a bucket  number, a hash-based  index structure  allows  us to retrieve  the  primary page for the bucket in one or two disk l/Os.
On  inserts,  the  record  is  inserted  into  the  appropriate  bucket,  with  'overflow'  pages allocated as necessary.
To  search  for  a  record  with  a  given  search  key  value,  we  apply  the  hash  function  to identify the bucket to which such records belong and look at all pages in that bucket.
If we do not have the search key value for the record, for example, the index is based on sal and we want records with a given age value, we have to scan all pages in the file.

Hash-Based Indexing




Explanation of Fig 1.3

Hash indexing is illustrated in above figure,where the data is stored in a file that is hashed on age;the data entries in this first index file are the actual data records.
Applying  the hash  function  to  the age  field identifies the  page  that  the  record  belongs to.  The hash function  h  for  this  example  is  quite  simple; it  converts  the  search  key  value  to  its  binary representation and uses the two least significant bits as the bucket identifier.
Above figure also shows an index with search key sal that contains (sal, rid) pairs as data entries.
The rid  (short for  record  id)  component  of  a data  entry  in  this  second index is  a pointer  to  a  record with search key value sal (and is shown in the figure as an arrow pointing to the data record).
The Above Figure  illustrates the use of Alternative (1) and (2) for data entries.
The file of employee records is hashed on age, and Alternative (1) is used for data entries.
The second index, on sal, also uses hashing to locate data entries, which are now (sal, rid of employee Record pairs; that is, Alternative (2) is used for data entries.
Note  that  the  search  key  for  an  index  can  be  any  sequence  of  one  or  more fields,  and  it  need  not uniquely identify records.



Tree-Based Indexing:

The  data  entries  are  arranged  in  sorted  order  by  search  key  value,  and  a hierarchical search  data  structure  is  maintained  that  directs  searches  to  the  correct  page  of  data entries.


 Tree Structured index

Fig 1.4


Explanation of Fig 1.4:

Each  node  in  the  figure  1.4  (e.g.,  nodes  labeled  A,  B,  L1,  L2)  is  a  physical  page,  and retrieving a node involves a disk I/O.
The  lowest  level  of  the  tree,  called  the  leaf  level,  contains  the  data  entries;  in  our example, these are employee records.
To  understand  in  a  better  way,  we  have  drawn  Figure  1.4  as  if  there  were  additional employee records, some with age less than 22 and some with age greater than 50.
Additional records with age  less than 22 would  appear in leaf pages to the left page L1,and records with age greater than 50 would appear in leaf pages to the right of page L3.

Working procedure:

  • All  searches  begin  at  the  topmost  node,  called  the  root,  and  the  contents  of  pages  in non-leaf levels direct searches to the correct leaf page.
  • Non-leaf pages contain node pointers separated by search key values.
  • The  node  pointer to  the left  of a key value  k  points to  a subtree  that contains only data entries less than k.
  • The node pointer to the right of a key value k points to a subtree that contains only data entries greater than or equal to k.
  • In  our  example  search,  we  look  for  data  entries  with  search  key  value  >  24,  and  get directed to the middle child, node A.
  • Again, examining the contents of this node, we are directed to node B.
  • Examining  the  contents  of node  B,  we  are directed  to  leaf node  Ll,  which  contains  data entries we are looking for.
  • Observe  that  leaf  nodes  L2  and  L3  also  contain  data  entries  that  satisfy  our  search criterion.
  • To  facilitate  retrieval  of  such  qualifying  entries  during  search,  all  leaf  pages  are maintained in a doubly-linked list.
  • Thus,  we  can  fetch page L2  using the  'next' pointer  on  page  L1,  and  then  fetch page L3 using the 'next' pointer on L2.
  • Thus,  the  number of  disk I/Os incurred during a  search is equal to the length  of  a  path from the root to a leaf, plus the number of leaf pages with qualifying data entries.
  • The  B+  tree is an index  structure  that ensures that all  paths from  the root  to a  leaf in a given tree are of the same length, that is, the structure is always balanced in height.
  • Finding  the  correct  leaf  page  is  faster  than  binary  search  of  the  pages  in  a  sorted  file because  each  non-leaf  node  can  accommodate  a  very  large  number  of  node-pointers and the height of the tree is rarely more than three or four in practice.

Fan-out (F):

The  average  number  of  children  for  a  non-leaf  node  is  called  the  fan-out  of  the tree.If every non-leaf node has n children, a tree of height h has n to the power h leaf pages.In practice, F (fan-out) is at least 100, which means a tree of height four contains 100 million leaf pages.




  



No comments:

Post a Comment