Molecular Fingerprints Database
1. Introduction
(a) Problem Statement
This project is to build a MFPS (Molecular Signatures Server)
that stores protein metadata and use them to
support queries for cluster based similarity.
(b) Architectural Overview
- Take PDB file as an input
- Create the following data for the input protein
- a. skeletal graph structure by using TexMol software
FCC(Flexible Chain Complex)
- b. volume by using pdb2volume software
Rawiv, RawV
- c. surface by using volume rover software
Raw, Rawc, Rawn, Rawnc
We use the volume data to create protein metadata because we can generate volume data from above three data files.
- Generate unique protein metadata for the input protein by using TAQT (Topological Analysis and Quantitative Tools).
TAQT produces a labeled multi-resolution dual contour tree. This dual contour tree will be used as protein metadata. The labeled multi-resolution dual contour tree has a number of metric.
- SRB (Storage Resource Broker) will be used as a storage.
We are eventually going to have a number of data in database. Therefore, we might have space problem. SRB resolves the space limitation because SRB allows us to use the heterogeneous resources. Since SRB connects all repositories in other places with a logical resource, SRB infinitely increases the available resources.
2. Background
- Topological Analysis Quantitative Tool
TAQT introduces an algorithm of matching 3D volumetric functions based on affine-invariant multi-resolution dual contour trees. A dual contour tree structure is constructed from the contour tree of a volume by dividing its functional range into segments such that the connected contour tree edges within a segment become a node in the dual tree. Each node of the dual contour tree corresponds to a connected sub-volume bounded by contours within a certain range segment.
- Similarity Metric
A similarity metric between two volumes is computed by matching the multi-resolution dual contour trees and corresponding attributes. This algorithm shows good performance separating proteins into different classes when this applies to biomolecular structures and associated properties. We use level 4 dual contour trees. This number is the minimum value to keep the dual contour tree properties. Based on the definition of norms, similarity metrics can be defined for molecular properties such as electron density and electrostatic potential. However, there is a new metrics invariant under affine transformations based on topological and combinatorial properties by using Morse complexes. The Morse complex is a three-dimensional embedded graph that illustrates the topology of critical points in a scalar field f. Morse theory focuses on non-degenerate critical points where the Hessian has non-zero determinant.
- Contour Tree
Among metrics, we use the contour tree as a topological structure of scalar field ?. The contour tree is a data structure that captures the topological characteristics of a scalar field ?. The contour tree itself is also very complicated and contains lots of nodes and edges due to the complexity of critical points in the Function ?. Due to this complexity of the contour tree, it is very difficult to define a quantitative metric to directly measure the similarity of contour trees.
- Dual Contour Tree
From the contour tree, we introduce a new structure called the dual contour tree, which can dramatically simplify the original contour tree and remove small noise and fluctuations in the data. Multi-resolution hierarchy of the dual contour tree can b easily constructed to further simplify the structure and assist in matching the functions. For each node of the dual contour tree, geometrical (volume), functional (range), and topological (Betti numbers) attributes are defined and applied in the computation of similarity score of 3D functions.
- Matching Algorithm
- Similarity metric is defined follows:
< m,n > = w1< V(m),V(n) >
- + w2< R(m),R(n) >
- + w3( (< B1(m),B1(n) + < B2(m), B12(n)> )/2 )
-
Where weights 0 <= w1, w2, w3 <= 1 and w1+ w2 + w3 = 1 controls the relative importance of different attributes in the similarity computation.
< V(m),V(n) > = min(V(m),V(n))/max(V(m),V(n))
< R(m),R(n) > = min(R(m),R(n))/max(R(m),R(n))
< B(m),B(n) > = 1/3 Sigma(i=1,k) max(min(bi(m), bi(n)), 1) / max(max(bi(m), bi(n)), 1)
The similarity score of a node to itself is equal to 1.
0 <= < m,n > <= < m,m > = < n,n > = 1
However, mobios use the other way. Therefore, the similarity score will be close to 0 when two nodes are close.
- SRB (Storage Resource Broker)
The SDSC Storage Resource Broker (SRB) is a client-server middleware that provides a uniform interface for connecting to heterogeneous data resources over a network and accessing replicated data sets.
(http://www.npaci.edu/DICE/SRB/)
We used SRB to store data such as PDB files, image files, and protein metadata. We can even store the large size volume data.
2. Design Phase
A. Data Model
(This is simple version of data model)
Within PDBContour there are several ways to substitute the keyObject such as finer dual contour tree, the collection of betti number and possibly contour tree. Among these signatures, the labeled multi-resolution dual contour tree has a number of information in it. Therefore, we decided to use the labeled multi-resolution dual contour tree. Whenever, we have any new representation of protein, we can add a new table and link to PDB table having 1-1 relation. For example, FCC is another type of signature to represent proteins. It can be another method to represent proteins instead of using volume. We can compute the volume from above three kind signatures. Therefore, this project uses the volume to create protein index for each signature.
B. Database Design
- Create tables by using Mckoi (Java Sql Database).
mobios includes Mckoi database.
- a. Create a volume from a PDB file.
- b. Construct the dual contour tree from the volume.
- c. Insert the path where the dual contour tree is and row id
and PDB Table id for the PDB.
- d. All data files will be stored in SRB.
- Create data structures for each signature by using MoBIos.
Create a pair link list: each element will contain a row id and protein metadata object
- a. When Dataloader class is invoked, Dataloader connects to Mckoi database,
and receives information such as row id and the path where the protein metadata
is located in SRB.
- b. Dataloader downloads the protein metadata from SRB.
- c. Create the protein metadata Object by using TAQT.
- d. Insert the Object into a pair link list
- Support a range query.
- a. Select a scheme to search similar protein based on signatures
(surface, volume, and fcc).
- b. Produce a query object.
- c. Set a range.
- d. Query with the object and range
C. Web Server/Interface
- Apache Tomcat/4.1.30
We used the Tomcat server which is a Java based Web Application container. Tomcat server runs Servlets and JavaServer Pages (JSP) in Web applications.
- Servlets
Servelts executes the CCTestForm program which does the index creation and the query execution.
- JavaServer Pages
Create a simple user interface to be used on Browsers.
|