Application Checkpointing in Grid Environment with Improved Checkpoint Reliability through Replication
Abstract—Grid technologies are emerging as the next generation of distributed computing, allowing the aggregation of heterogeneous resources that are geographically distributed. The heterogeneous nature of the grid makes it more vulnerable to faults which lead to either the failure of the job or delay in completing the execution of the job.< Final Year Projects > Checkpointing is one of the many fault tolerance techniques which are used to make Grid more efficient and reliable. In this paper we have developed an application checkpointing based fault tolerance technique for Alchemi based Grid environment. In this technique application threads generate their checkpoints and store them in the checkpoint table at the manager node. In case a thread fails checkpoint of the corresponding thread is used to resume the execution from the point of failure. This technique introduces a slight overhead in fault free situations but very effective in case of a node failure. Increased checkpoint frequency improves job’s resuming capability but also increases the overhead of generating and storing checkpoints which results in increased processing time of the job.
sales on Site11,021