ShortestPathMolecularGraphKernels.jl

overview

a molecular graph represents a molecule as a simple graph (nodes: atoms; edges: bonds) where (a) nodes are "labeled" with the atomic species (H, C, N, etc.) of the atoms they represent and (b) edges are "labeled" with the type (aromatic, single, double, triple) of bond they represent.

a molecular graph kernel, loosely, scores the similarity of two molecular graphs. more formally, it corresponds with an inner product of two feature vectors of the molecular graphs. so, some mapping from a molecular graph to a vector representation, theoretically, underlies a molecular graph kernel.

generally, a shortest path molecular graph kernel characterizes each molecular graph by all shortest paths between every pair of nodes in the graph. it looks at the length of and the atom-bond label sequence along these shortest paths. we implement two variants of the shortest path graph kernel. given two input molecular graphs, the kernel counts pairs of corresponding shortest paths between them having the same:

  1. length and unordered pair of atom labels on the source and destination node
  2. length and exact atom-bond label sequence.

the exact sequence matching is much more expressive of molecular similarity, but it corresponds with a sparser underlying feature vector.

this Julia package, ShortestPathMolecularGraphKernels.jl:

  • employs MolecularGraph.jl to interpret a SMILES specification of a molecular structure as a molecular graph.
  • pre-computes and stores all shortest paths along the molecular graph and the atom-bond label sequences along them.
  • computes the shortest path graph kernel between any two input molecular graphs, using either scoring criteria:

same (a) length and terminal node labels or (b) length and exact atom-bond label sequence.

  • implements a multi-threaded computation of a Gram (molecule-molecule similarity) matrix for machine learning.

example

first, we show how to construct a molecular graph representing caffeine from its SMILES representation.

mg = MolGraph(
    "CN1C=NC2=C1C(=O)N(C)C(=O)N2C", # SMILES
    incl_hydrogen=false             # include H atoms?
)

next, we search for and store all shortest paths in this molecular graph and the atom-bond label sequences along them.

find_shortest_paths!(
    mg,                 # the molecular graph
    ℓ_max=typemax(Int)  # the max path length
)

we can visualize the molecular graph and explore any path on it:

id_spath = 39                # index of shortest path to look at
mg.spaths[id_spath]          # shows all info abt this shortest path
viz(mg, id_spath=id_spath)   # visualize graph and highlight shortest path

finally, we can compute the shortest path graph kernel between a pair of molecular graphs using either scoring criteria:

caf = MolGraph("CN1C=NC2=C1C(=O)N(C)C(=O)N2C")   # caffeine
ibu = MolGraph("CC(C)CC1=CC=C(C=C1)C(C)C(=O)O")  # ibuprofin
find_shortest_paths!.([caf, ibu])                 

# criterion: score using atom label set on src and dst nodes paired with length
shortest_path_graph_kernel(ba, ibu, exact_seq_matching=false) # 458.0
# criterion: score using exact atom-bond label sequence (implicitly, length, too)
shortest_path_graph_kernel(ba, ibu, exact_seq_matching=true)  # 206.0

for computing a Gram matrix, we have a multi-threaded implementation:

Threads.nthreads() # to see the number of threads I have available
# list of molecular graphs
mgs = MolGraph.(
    [
        "CN1C=NC2=C1C(=O)N(C)C(=O)N2C", 
        "O=C(O)c1ccccc1",
        "CC(C)CC1=CC=C(C=C1)C(C)C(=O)O",
        "CN1C=NC2=C1C(=O)N(C)C(=O)N2C"
    ]
)

# pre-compute shortest paths
find_shortest_paths!.(mgs)

# compute Gram matrix
K = compute_Gram_matrix(mgs, false) # K[i, j] = K[j, i]: similarity btwn molecular graphs i & j

tests and more demos

run tests/runtests.jl as a Pluto.jl notebook. these tests automatically run via Github actions after every commit and pull request. the tests involve explicitly constructing the feature vectors and taking the dot product, to ensure equality with the kernel.

references

Kriege NM, Johansson FD, Morris C. A survey on graph kernels. Applied Network Science. 2020.

Borgwardt KM, Kriegel HP. Shortest-path kernels on graphs. In Fifth IEEE international conference on data mining. 2005.

Rupp M, Schneider G. Graph kernels for molecular similarity. Molecular Informatics. 2010.

docs

ShortestPathMolecularGraphKernels.MolGraphType
MolGraph(
    smiles::String; incl_hydrogen::Bool=false, verbose::Bool=false
) -> MolGraph

interpret a SMILES representation of a molecular structure as a molecular graph.

relies on MolecularGraph.jl.

fields

  • smiles::String: the SMILES of the molecule represented
  • incl_hydrogen::Bool: include H atoms?
  • g::SimpleGraph: the graph (nodes: atoms; edges: bonds)
  • atoms::Vector{AtomType}: list of labels (atom types) on the nodes
  • bonds::Vector{BondType}: list of labels (bond types) on the edges
  • spaths::Vector{ShortestPath}: list of shortest paths

example

mg = MolGraph("CN1C=NC2=C1C(=O)N(C)C(=O)N2C") # caffeine
source
ShortestPathMolecularGraphKernels.BondTypeType
BondType(bond_order::Int)

an efficient UInt8 encoding of a bond type.

example

bt = BondType(0) # BOND_AROMATIC
bt = BondType(1) # BOND_SINGLE
bt = BondType(2) # BOND_DOUBLE
bt = BondType(3) # BOND_TRIPLE
source
ShortestPathMolecularGraphKernels.ShortestPathType
ShortestPath(
    ρ::Vector{Int}, seq::Vector{UInt8}, ℓ::Int, w::Float64
) -> ShortestPath

a data structure representing a shortest path between two nodes in a molecular graph.

fields

  • ρ::Vector{Int}: the path represented as an ordered sequence of node indices
  • seq::Vector{UInt8}: the atom-bond label sequence traversed along the path
  • ℓ::Int: the length of the path (= # of hops)
  • w::Float64: path weight, which ≂̸ 1.0 only when multiple degenerate

shortest paths exist between the same pair of nodes, used to normalize contributions.

note: under the hood, we store the shortest paths in a canonical order to avoid compute-intensive reversal for matching.

source
ShortestPathMolecularGraphKernels.find_shortest_paths!Function
find_shortest_paths!(mg::MolGraph; ℓ_max::Int=typemax(Int)) -> nothing

search for all shortest paths between all pairs of nodes in a molecular graph mg, up to and including of length ℓ_max. (uses Dijkstra's algorithm.) attach the list of shortest paths and their atom-bond label sequences to the molecular graph.

example

caf = MolGraph("CN1C=NC2=C1C(=O)N(C)C(=O)N2C") # caffeine
caf.spaths # empty array
find_shortest_paths!(caf)
caf.spaths # now contains the list of shortest paths
source
ShortestPathMolecularGraphKernels.get_shortest_pathsFunction
get_shortest_paths(spaths::Vector{ShortestPath}, u::Int, v::Int) -> Vector{ShortestPath}
get_shortest_paths(mg::MolGraph, u::Int, v::Int) -> Vector{ShortestPath}
get_shortest_paths(mg::MolGraph, u::Int) -> Vector{ShortestPath}

filter the list of shortest paths to get those "u ↔ v" or "u ↔ *"

source
ShortestPathMolecularGraphKernels.shortest_path_graph_kernelFunction
shortest_path_graph_kernel(
    mg₁::MolGraph, mg₂::MolGraph; exact_seq_matching::Bool=false
) -> Float64

compute the shortest path molecular graph kernel between two input molecular graphs.

the kernel value is calculated by summing the similarity of all pairs of shortest paths between the two graphs.

if exact_seq_matching::Bool, we require exact matching of the atom-bond label sequence to note two common paths; otherwise, we note two common paths if the length and unordered pair of atom labels in the src and dst nodes are identical.

returns a numerical value representing the similarity between the two graphs.

source
ShortestPathMolecularGraphKernels.compute_Gram_matrixFunction
compute_Gram_matrix(mgs::Vector{MolGraph}, exact_seq_matching::Bool) -> Matrix{Float64}

compute the symmetric Gram (kernel) matrix for a list of molecular graphs.

Each element K[i, j] in the resulting matrix K represents the similarity between molecular graph i and j according to the shortest path graph kernel.

implementation details

  • symmetry: to reduce computation time by 50%, only the upper triangle is explicitly calculated, and values are mirrored to the lower triangle.
  • parallelization: this function utilizes Base.Threads.@threads to distribute the outer loop iterations across available CPU cores.
source
ShortestPathMolecularGraphKernels.vizFunction
viz(
    mg::MolGraph; id_spath::Union{Int, Nothing}=nothing,
    nlabels::Bool=true, size::Tuple{Int, Int}=(375, 375)
)

visualize a molecular graph and, if the index of a shortest path is provided, highlight that shortest path.

uses GraphMakie.jl under the hood and displays a plot in a Pluto.jl notebook.

example

mg = MolGraph("CN1C=NC2=C1C(=O)N(C)C(=O)N2C")
viz(mg, id_spath=39) # displays plot in Pluto
source
ShortestPathMolecularGraphKernels.normalize_Gram_matrixFunction
normalize_Gram_matrix(K::Matrix{Float64}) -> Matrix{Float64}

normalize a Gram (kernel) matrix so that all diagonal entries equal 1.

each entry K[i,j] is divided by sqrt(K[i,i]) * sqrt(K[j,j]), which is equivalent to converting kernel values into cosine similarities between the underlying feature vectors.

source
ShortestPathMolecularGraphKernels.center_Gram_matrixFunction
center_Gram_matrix(K::Matrix{Float64}) -> Matrix{Float64}

center a Gram (kernel) matrix using double centering.

double centering of a Gram matrix removes the mean along each row and each column, then adds back the grand mean to avoid over-subtraction. this is equivalent to re-computing the original feature vectors, centering them, then taking the dot product of those centered feature vectors. do this before e.g. k-PCA!

example

Xs = rand(2, 10) # 10 2D feature vectors
K = Xs' * Xs
K_centered = center_Gram_matrix(K)
source