Question: Using Hdf5 To Store Bio-Data
8
gravatar for Pierre Lindenbaum
10.7 years ago by
France
Pierre Lindenbaum140 wrote:

Hi all, has anobody ever used the HDF5 API to store some biological data (genotypes...). I know about this kind of reference (BioHDF...) but I'm looking for some source code I could browse to understand how I can access data faster.

Pierre

PS: hum, I'm a new user. I'm not allowed to add the following tags: storage database hdf5 source code

hdf biohdf storage • 1.5k views
ADD COMMENTlink modified 10.6 years ago by User 50340 • written 10.7 years ago by Pierre Lindenbaum140
4
gravatar for Fernando Muñiz
10.7 years ago by
Fernando Muñiz60 wrote:

What I do have is a netCDF-3 based Java application that I could show you. NetCDF-3 is basically the same idea as HDF, but quite more limited as it cannot do compound datatypes among other limitations.

But here's a small test code example to toy with:

package netCDF;

import java.io.File; import ucar.ma2.; import ucar.nc2.; import java.io.IOException; import java.util.ArrayList;

/**


  • @author Fernando Muñiz Fernandez
  • IBE, Institute of Evolutionary Biology (UPF-CSIC)
  • CEXS-UPF-PRBB

  • THIS TO CREATE THE netCDF-3 GENOTYPE FILE */ public class CreateNetcdf {

    public static NetcdfFileWriteable setDimsAndAttributes(Integer studyId, String technology, String description, String strand, int sampleSetSize, int markerSetSize) throws InvalidRangeException, IOException {

    ///////////// CREATE netCDF-3 FILE ////////////
    String genotypesFolder = "/media/data/genotypes";
    File pathToStudy = new File(genotypesFolder+"/netCDF_test");
    int gtSpan = constants.cNetCDF.Strides.STRIDE_GT;
    int markerSpan = constants.cNetCDF.Strides.STRIDE_MARKER_NAME;
    int sampleSpan = constants.cNetCDF.Strides.STRIDE_SAMPLE_NAME;
    
    String matrixName = "prototype";
    String writeFileName = pathToStudy+"/"+matrixName+".nc";
    NetcdfFileWriteable ncfile = NetcdfFileWriteable.createNew(writeFileName, false);
    
    // add dimensions
    Dimension samplesDim = ncfile.addDimension("samples", sampleSetSize);
    Dimension markersDim = ncfile.addDimension("markers", markerSetSize);
    Dimension gtSpanDim = ncfile.addDimension("span", gtSpan);
    ArrayList dims = new ArrayList();
    dims.add(samplesDim);
    dims.add(markersDim);
    dims.add(gtSpanDim);
    
    ArrayList markerGenotypeDims = new ArrayList();
    markerGenotypeDims.add(markersDim);
    markerGenotypeDims.add(markerSpan);
    
    ArrayList markerPositionDim = new ArrayList();
    markerPositionDim.add(markersDim);
    
    ArrayList markerPropertyDim32 = new ArrayList();
    markerPropertyDim32.add(markersDim);
    markerPropertyDim32.add(32);
    
    ArrayList markerPropertyDim16 = new ArrayList();
    markerPropertyDim16.add(markersDim);
    markerPropertyDim16.add(16);
    
    ArrayList markerPropertyDim8 = new ArrayList();
    markerPropertyDim8.add(markersDim);
    markerPropertyDim8.add(8);
    
    ArrayList markerPropertyDim2 = new ArrayList();
    markerPropertyDim2.add(markersDim);
    markerPropertyDim2.add(2);
    
    ArrayList markerPropertyDim1 = new ArrayList();
    markerPropertyDim1.add(markersDim);
    markerPropertyDim1.add(1);
    
    ArrayList sampleSetDims = new ArrayList();
    sampleSetDims.add(samplesDim);
    sampleSetDims.add(sampleSpan);
    
    // Define Marker Variables
    ncfile.addVariable("markerset", DataType.CHAR, markerGenotypeDims);
    ncfile.addVariableAttribute("markerset", constants.cNetCDF.Attributes.LENGTH, markerSetSize);
    
    ncfile.addVariable("marker_chromosome", DataType.CHAR, markerPropertyDim8);
    ncfile.addVariable("marker_position", DataType.CHAR, markerPropertyDim32);
    ncfile.addVariable("marker_position_int", DataType.INT, markerPositionDim);
    ncfile.addVariable("marker_strand", DataType.CHAR, markerPropertyDim8);
    
    ncfile.addVariable("marker_property_1", DataType.CHAR, markerPropertyDim1);
    ncfile.addVariable("marker_property_2", DataType.CHAR, markerPropertyDim2);
    ncfile.addVariable("marker_property_8", DataType.CHAR, markerPropertyDim8);
    ncfile.addVariable("marker_property_16", DataType.CHAR, markerPropertyDim16);
    ncfile.addVariable("marker_property_32", DataType.CHAR, markerPropertyDim32);
    
    // Define Sample Variables
    ncfile.addVariable("sampleset", DataType.CHAR, sampleSetDims);
    ncfile.addVariableAttribute("sampleset", constants.cNetCDF.Attributes.LENGTH, sampleSetSize);
    
    // Define Genotype Variables
    ncfile.addVariable("genotypes", DataType.CHAR, dims);
    ncfile.addVariableAttribute("genotypes", constants.cNetCDF.Attributes.GLOB_STRAND, "+/-");
    
    // add global attributes
    ncfile.addGlobalAttribute(constants.cNetCDF.Attributes.GLOB_STUDY, studyId);
    ncfile.addGlobalAttribute(constants.cNetCDF.Attributes.GLOB_TECHNOLOGY, "INTERNAL");
    ncfile.addGlobalAttribute(constants.cNetCDF.Attributes.GLOB_DESCRIPTION, "Matrix created by MOAPI through addition of 2 matrices");
    
    return ncfile;
    

    } }

Use the above in the following way:

package netCDF;

import java.util.List; import ucar.ma2.; import ucar.nc2.; import java.io.IOException;

/**


* @author Fernando Muñiz Fernandez * IBE, Institute of Evolutionary Biology (UPF-CSIC) * CEXS-UPF-PRBB


* THIS TO GENERATE A netCDF-3 GENOTYPE DB */

public class TestWriteNetcdf {

public static void main(String[] arg) throws InvalidRangeException, IOException {

    NetcdfFileWriteable ncfile = netCDF.CreateNetcdf.setDimsAndAttributes(0, 
                                                                      "INTERNAL", 
                                                                      "test in TestWriteNetcdf", 
                                                                      "+/-", 
                                                                      5,
                                                                      10);

    // create the file
    try {
        ncfile.create();
    } catch (IOException e) {
        System.err.println("ERROR creating file "+ncfile.getLocation()+"\n"+e);
    }


    ////////////// FILL'ER UP! ////////////////
    List<Dimension> dims = ncfile.getDimensions();
    Dimension samplesDim = dims.get(0);
    Dimension markersDim = dims.get(1);
    Dimension markerSpanDim = dims.get(2);

    ArrayChar charArray = new ArrayChar.D3(samplesDim.getLength(),markersDim.getLength(),markerSpanDim.getLength());
    int i,j;
    Index ima = charArray.getIndex();


    int method = 1;
    switch (method) {
        case 1: 
            // METHOD 1: Feed the complete genotype in one go
            for (i=0; i<samplesDim.getLength(); i++) {
                for (j=0; j<markersDim.getLength(); j++) {
                    char c = (char) ((char) j + 65);
                    String s = Character.toString(c) + Character.toString(c);
                    charArray.setString(ima.set(i,j,0),s);
                    System.out.println("SNP: "+i);
                }
            }
            break;
        case 2: 
            //METHOD 2: One snp at a time -> feed in all samples
            for (i=0; i<markersDim.getLength(); i++) {
                charArray.setString(ima.set(i,0), "s"+i+"I0");
                System.out.println("SNP: "+i);
            }
            break;
        case 3: 
            //METHOD 3: One sample at a time -> feed in all snps
            break;
    }



    int[] offsetOrigin = new int[3]; //0,0
    try {
        ncfile.write("genotypes", offsetOrigin, charArray);
        //ncfile.write("genotype", origin, A);
    } catch (IOException e) {
        System.err.println("ERROR writing file");
    } catch (InvalidRangeException e) {
        e.printStackTrace();
    }

    // close the file
    try {
        ncfile.close();
    } catch (IOException e) {
        System.err.println("ERROR creating file "+ncfile.getLocation()+"\n"+e);
    }

}

}

ADD COMMENTlink written 10.7 years ago by Fernando Muñiz60
3
gravatar for István Albert
10.7 years ago by
István Albert ♦♦ 310
University Park
István Albert ♦♦ 310 wrote:

In the GeneTrack software we have used HDF to store values for each genomic base. Its main advantage over other storage systems was that it was able to return consecutive values with minimal overhead.

For example it is extremely fast (ms) in retrieving say 100,000 consecutive values starting with a certain index.We used the Python bindings to HDF. An added advantage of these bindings is that they will return the data back as numpy arrays (very fast numerical operations).

Here is the relevant code that deals with HDF only: hdf.py

The HDF schema is set up in a different module, but in the end it simply something like:

class MySchema( IsDescription ):
    """
    Stores a triplet of float values for each index.
    """
    ix = IntCol  ( pos=1 )  # index
    wx = FloatCol( pos=2 )  # values on the W (forward) strand
    cx = FloatCol( pos=3 )  # value on the C (reverse) strand
    ax = FloatCol( pos=4 )  # weighted value on the combined W + C strands
ADD COMMENTlink modified 10.7 years ago by Jane ♦♦ 0 • written 10.7 years ago by István Albert ♦♦ 310
2
gravatar for Fernando Muñiz
10.7 years ago by
Fernando Muñiz60 wrote:

Hello Pierre!

I have been talking with the BioHDF guys and from what they tell me, their work will be centered around a number of command-line APIs, written in C, that will address some areas of usage which for now do not seem to overlap.

I have seen this example on their site: http://www.hdfgroup.org/projects/biohdf/biohdf_tools.html Don't know if that helps.

I have been talking with them to see if we can achieve an API for saving genotype data. Don't know yet where that will lead me.

If you are looking for something more versatile, you will probably have to delve in the official HDF5 C code ( http://www.hdfgroup.org/HDF5/Tutor/ ), which seems to be the only one that offers all the functionality and goodies of that impressive storage system.

ADD COMMENTlink written 10.7 years ago by Fernando Muñiz60
2
gravatar for Michael Dondrup
10.6 years ago by
Bergen
Michael Dondrup50 wrote:

There is also a Perl binding to HDF5: PDL::IO::HDF5

http://search.cpan.org/~cerney/PDL-IO-HDF5-0.5/ This requires the Perl Data Language (PDL) package. The way, data-structures can be handled, sub-ranges of data can be defined an data can be manipulated is actually very elegant in PDL such that computational code can profit from PDLs vectorized style of writing expressions.

The same is true for R and the hdf5 package: http://cran.r-project.org/web/packages/hdf5/index.html

Code examples are in the package documentations of both, the R-hdf5 package documentation is quite little though.

Both of these language bindings might be a very efficient way to read and write HDF5 files.

There are also APIs in Fortran, Java, Python, Matlab, C, or C++. So it might make sense to select the language and define the type of data you wish to store first.

ADD COMMENTlink written 10.6 years ago by Michael Dondrup50
1
gravatar for Giovanni M Dall'Olio
10.7 years ago by
Barcelona, Spain
Giovanni M Dall'Olio420 wrote:

Unfortunately I don't have any example to shows you yet. I don't know how to program in C/C++ so I have been looking at two hdf5 wrappers in python, PyTables and H5PY.

PyTables has a database-like approach in which HDF5 is used as a sort of hierarchical database, in which a column can be a table itself, allowing to store nested data. For example, you have a table called 'SNPs' with two columns, 'id' and 'genotypes'; the column 'genotypes' contains a nested table, with the columns 'individual' and 'genotype'; and so on.

H5Py is basically a re-implementation of numpy's arrays, so you can store and access arrays/matrixes as you would do with numpy (it is similar to arrays and matrixes in matlab, R, and any other language with this data type) and they are stored in an HDF5 file so the access is faster.

ADD COMMENTlink written 10.7 years ago by Giovanni M Dall'Olio420
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 16.09
Traffic: 1 users visited in the last hour