VTU Study Materials: What is File Organizations & Indexing?

Monday, 9 February 2015

What is File Organizations & Indexing?

File Organizations & Indexing

The file of records is an important abstraction in a DBMS, and is implemented by the files and file layers of the code.
A file can be created, destroyed, and have records inserted into and deleted from it.
It also supports scans.
A scan operation allows us to step through a ll the records in the file one at a time. A relation is typically stored as a file of records.
The file layer stores the records in a file in a collection of disk pages.
It keeps track of pages allocated to each file, and as records are inserted into and deleted from the file, it also tracks available space within pages allocated to the file.

Heap file or unordered file:
Heap or unordered file is the simplest file structure, records in a heap file are stored in random order across the pages of the file.
A heap file organization supports retrieval of all records, or retrieval of a particular record specified by its rid, the file manager must keep track of the pages allocated for the file.
Index:

An index is a data structure that organizes data records on disk to optimize certain kinds of retrieval operations.
An index allows us to efficiently retrieve all records that satisfy search conditions on the search key fields of the index.
We can also create additional indexes on a given collection of data records, each with a different search key, to speed up search operations that are not efficiently supported by the file organization used to store the data records.
Consider our example of employee records.

We can store the records in a file organized as an index on employee age; this is an alternative to sorting the file by age.
Additionally, we can create an auxiliary index file based on salary, to speed up queries involving salary.
The first file contains employee records, and the second contains records that allow us to locate employee records satisfying a query on salary.

We use the term data entry to refer to the records stored in an index file.
A data entry with search key value k, denoted as k, contains enough information to locate (one or more) data records with search key value k.

There are three main alternatives for what to store as a data entry in an index:
A data entry k is an actual data record (with search key value k).
A data entry is a (k, rid) pair, where rid is the record id of a data record with search key value k.
A data entry is a (k, rid-list) pair, where rid-list is a list of record ids of data records with search key value k.
Alternative (1) is used for the index that stores actual data records.
Alternative (2) and (3) which contain data entries that point to data records are independent of the file organization that is used for the indexed file.
If we want to build more than one index on a collection of data records-for example, We want to build indexes on both the age and the salary fields for a collection of employee records,at most (maximum) one of the indexes should use Alternative (1) because we should avoid storing data records multiple times.

1.2.1 Clustered Indexes:

“if a file is organized based on ordering of data records or close to the ordering of data entries in some index, we say that the index as a clustered index, Otherwise, it clustered is an unclustered index”

An index that uses Alternative (1) is clustered, by definition. An index that uses
Alternative (2) or (3) can be a clustered index only if the data records are sorted on the search key field.
Clustered indexes in Practice:

In practice, files are rarely kept sorted since this is too expensive to maintain when the data is updated.
So, in practice, a clustered index is an index that uses Alternative (1), and indexes that use Alternatives (2) or (3) are unclustered.

Clustered Indexes

Unclustered Index

The Cost of using an Index:

The cost of using an index to answer a range search query can vary tremendously based on whether the index is clustered.
If the index is clustered, i.e. we are using the search key of a clustered file, the “rids” in qualifying data entries point to a contiguous collection of records, and we need to retrieve only a few data pages.” that points to
If the index is unclustered, each qualifying data entry could contain a “rid a distinct data page, leading to as many data page l/Os .

No comments:

Post a Comment

counter for blogspot