Monday, 9 February 2015

What is File Organizations & Indexing?

File Organizations & Indexing


  • The  file  of  records  is  an  important  abstraction  in  a  DBMS,  and  is  implemented  by  the files and file layers of the code.
  • A file can be created, destroyed, and have records inserted into and deleted from it.
  • It also supports scans.
  • A  scan  operation  allows  us  to  step  through  a ll  the  records  in  the  file  one  at  a  time.  A relation is typically stored as a file of records.
  • The file layer stores the records in a file in a collection of disk pages.
  • It  keeps  track  of  pages  allocated  to  each  file,  and  as  records  are  inserted  into  and deleted from the file, it also tracks available space within pages allocated to the file.

Heap file or unordered file:
  • Heap or unordered file  is  the  simplest  file  structure,  records  in a  heap  file  are stored  in random order across the pages of the file.
  • A  heap  file  organization  supports  retrieval  of  all  records,  or  retrieval  of  a  particular record  specified by  its  rid,  the file manager  must  keep  track of  the pages  allocated for the file.
Index:


  • An index is a data structure that organizes data records on disk to optimize certain kinds of retrieval operations.
  • An  index allows us to efficiently retrieve all records that satisfy search  conditions on the search key fields of the index.
  • We can  also create  additional  indexes on a given  collection of data  records, each with a different search key, to speed up search operations that are not efficiently supported by the file organization used to store the data records.
Consider our example of employee records.

We can store  the records in a file organized as  an  index on employee age;  this  is an alternative to sorting the file by age.
Additionally,  we  can  create  an  auxiliary  index  file  based  on  salary,  to  speed  up queries involving salary.
The  first  file  contains  employee  records,  and  the  second  contains  records  that allow us to locate employee records satisfying a query on salary.

We use the term data entry to refer to the records stored in an index file.
A  data  entry  with  search  key  value  k,  denoted  as k*,  contains  enough  information  to locate (one or more) data records with search key value k.


There are three main alternatives for what to store as a data entry in an index:

  1. A data entry k* is an actual data record (with search key value k).
  2. A  data  entry  is  a  (k,  rid)  pair,  where  rid  is  the  record  id  of  a  data  record  with search key value k.
  3. A  data  entry  is  a  (k,  rid-list)  pair,  where  rid-list  is  a  list  of  record  ids  of  data records with search key value k.
Alternative (1) is used for the index that stores actual data records.
Alternative  (2)  and  (3)  which  contain  data  entries  that  point  to  data  records  are independent of the file organization that is used for the indexed file.
If we want to build more than one index on a collection of data records-for example, We want to  build  indexes on  both the  age and the  salary  fields for  a  collection of  employee records,at most (maximum) one of the indexes  should use Alternative (1) because we should avoid storing data records multiple times.

1.2.1 Clustered Indexes:

“if a file is organized based on ordering of data records or close to the ordering of data entries  in  some  index,  we  say  that  the  index  as  a  clustered  index,  Otherwise,  it clustered is an unclustered index”

  • An  index  that  uses  Alternative  (1)  is  clustered,  by  definition.  An  index  that  uses
  • Alternative  (2) or (3)  can be a clustered  index only if the data records are sorted on the search key field.
Clustered indexes in Practice:

In  practice,  files  are rarely kept sorted since  this  is  too  expensive  to  maintain when the data is updated.
So,  in  practice, a  clustered  index is an  index that  uses  Alternative  (1),  and  indexes that use Alternatives (2) or (3) are unclustered.

Clustered Indexes


Unclustered Index




The Cost of using an Index:

  • The cost  of using  an index to  answer a range search  query can  vary  tremendously  based  on whether the index is clustered.
  • If the  index  is clustered,  i.e.  we  are using the  search key of a clustered  file, the “rids” in qualifying  data  entries  point  to  a  contiguous  collection  of  records,  and  we  need  to retrieve only a few data pages.” that points to
  • If the index is unclustered, each qualifying data entry could contain a “rid a distinct data page, leading to as many data page l/Os .


No comments:

Post a Comment