Converting the VCF file to an efficient (HDF5) formatΒΆ

Todo

Relate with HPC part

Note

Notebooks:

In this chapter we are going to scafold the code to convert a VCF file into a HDF5 representation. As you recall from the previous chapter traversing the complete VCF file can take days or even weeks so a typical big data workflow starts by converting this format to something that we can manipulate more efficiently for analysis. In our case we will concentrate on understanding a general framework to convert big data in a fast and reliable way. We are not worried with the intricacies of converting this specific VCF file, but in providing the gist of the strategy to do data preparation.

As a sequential operation takes a lot of time we will start by breaking breaking the genome into smaller pieces and do concurrent processing. This problem is not completely trivial because all of our small computation programs will need to write to shared data structure and while concurrent reads are trivial against HDF5 files, concurrent writes are difficult at best.

So we will make each process write a small HDF5 per computation and then a single final procedure will collect all the small HDF5 files and create a single HDF5 file. Note that this single procedure will be much faster as it will read data not from a VCF file, but from a HDF5 file.