[ ]:
!pip install parquetdb
!pip install ipykernel

03 - Graph Generators in ParquetGraphDB

In this notebook, we’ll learn how to:

  1. Create node generator

  2. Add the node generator to the graph

  3. Create edge generator

  4. Add the edge generator to the graph

  5. Defining dependencies between generators

We’ll use the ParquetGraphDB class from parquetdb to demonstrate these features. If you haven’t already installed parquetdb, run the previous cell.

Example Scenario: Modeling Materials Data

Let’s explore how parquetdb generators can build and maintain a graph using a materials science scenario. Materials, at their core, are described by their structure and the chemical elements they contain (their composition).

We can represent this information effectively using a heterograph:

  • Nodes representing Materials (like \(H_2O\) or \(Fe\)).

  • Nodes representing Elements (like \(H\), \(O\), \(Fe\)).

  • Edges showing which Elements make up which Materials.

The real power of generators becomes apparent when considering how this data originates and evolves. Material definitions might come from external files or databases, and the properties of elements might be sourced separately.

Generators provide a robust mechanism to:

  • Ingest and process this source data into graph nodes and edges.

  • Establish dependencies. For instance, the creation of material-element edges depends on both Material and Element nodes existing first.

  • Automate updates. If the input file defining materials changes, or if an element’s properties are updated in its source, generators allow parquetdb to potentially rebuild the affected parts of the graph automatically, ensuring consistency.

We’ll now set up this example, starting with the data sources for elements and materials.

Setup

[2]:
import os
import shutil
import requests
import io
from pathlib import Path

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

def download_url(url,save_path):
    # Download the parquet file
    response = requests.get(url)
    if response.status_code == 200:
        # Load the parquet file into a pandas DataFrame
        parquet_file = io.BytesIO(response.content)
        periodic_table = pq.read_table(parquet_file)
        print(f"Downloaded periodic table data with {len(periodic_table)} elements")
    else:
        raise "Could not download data"

    pq.write_table(periodic_table, save_path)


FILE_DIR = Path(".")
DATA_DIR = FILE_DIR / "data"

if DATA_DIR.exists():
    shutil.rmtree(DATA_DIR)

DATA_DIR.mkdir(parents=True, exist_ok=True)

# URL to the raw data file in the GitHub repository
elements_url = "https://github.com/lllangWV/ParquetDB/raw/GraphDB/tests/graph/data/interim_periodic_table_values.parquet"
materials_url = "https://github.com/lllangWV/ParquetDB/raw/GraphDB/tests/graph/data/materials/materials_0.parquet"

elements_file = DATA_DIR / "elements.parquet"
materials_file = DATA_DIR / "materials.parquet"

download_url(elements_url,elements_file)
download_url(materials_url,materials_file)

Downloaded periodic table data with 118 elements
Downloaded periodic table data with 1000 elements
[3]:
elements_table = pq.read_table(elements_file)
materials_table = pq.read_table(materials_file)
print(elements_table)
print(materials_table)
pyarrow.Table
long_name: string
symbol: string
abundance_universe: double
abundance_solar: double
abundance_meteor: double
abundance_crust: double
abundance_ocean: double
abundance_human: double
adiabatic_index: string
allotropes: string
appearance: string
atomic_mass: double
atomic_number: int64
block: string
boiling_point: double
classifications_cas_number: string
classifications_cid_number: string
classifications_rtecs_number: string
classifications_dot_numbers: string
classifications_dot_hazard_class: double
conductivity_thermal: double
cpk_hex: string
critical_pressure: double
critical_temperature: double
crystal_structure: string
density_stp: double
discovered_year: int64
discovered_by: string
discovered_location: string
electron_affinity: double
electron_configuration: string
electron_configuration_semantic: string
electronegativity_pauling: double
energy_levels: string
gas_phase: string
group: int64
extended_group: int64
half_life: string
heat_specific: double
heat_vaporization: double
heat_fusion: double
heat_molar: double
isotopes_known: string
isotopes_stable: string
isotopic_abundances: string
lattice_angles: string
lattice_constants: string
lifetime: string
magnetic_susceptibility_mass: double
magnetic_susceptibility_molar: double
magnetic_susceptibility_volume: double
magnetic_type: string
melting_point: double
molar_volume: double
neutron_cross_section: double
neutron_mass_absorption: double
oxidation_states: string
period: int64
phase: string
quantum_numbers: string
radius_calculated: double
radius_empirical: double
radius_covalent: double
radius_vanderwaals: double
refractive_index: double
series: string
source: string
space_group_name: string
space_group_number: double
speed_of_sound: double
summary: string
valence_electrons: double
conductivity_electric: double
electrical_resistivity: double
electrical_type: string
modulus_bulk: double
modulus_shear: double
modulus_young: double
poisson_ratio: double
coefficient_of_linear_thermal_expansion: double
hardness_vickers: double
hardness_brinell: double
hardness_mohs: double
superconduction_temperature: double
is_actinoid: bool
is_alkali: bool
is_alkaline: bool
is_chalcogen: bool
is_halogen: bool
is_lanthanoid: bool
is_metal: bool
is_metalloid: bool
is_noble_gas: bool
is_post_transition_metal: bool
is_quadrupolar: bool
is_rare_earth_metal: bool
experimental_oxidation_states: string
ionization_energies: string
----
long_name: [["Hydrogen","Helium","Lithium","Beryllium","Boron",...,"Flerovium","Moscovium","Livermorium","Tennessine","Oganesson"]]
symbol: [["H","He","Li","Be","B",...,"Fl","Mc","Lv","Ts","Og"]]
abundance_universe: [[75,23,6e-7,1e-7,1e-7,...,0,0,0,0,0]]
abundance_solar: [[75,23,6e-9,1e-8,2e-7,...,0,0,0,0,0]]
abundance_meteor: [[2.4,null,0.00017,0.0000029,0.00016,...,0,0,0,0,0]]
abundance_crust: [[0.15,5.5e-7,0.0017,0.00019,0.00086,...,0,0,0,0,0]]
abundance_ocean: [[11,7.2e-10,0.000018,6e-11,0.00044,...,0,0,0,0,0]]
abundance_human: [[10,null,0.000003,4e-8,0.00007,...,0,0,0,0,0]]
adiabatic_index: [["5-Jul","3-May",null,null,null,...,null,null,null,null,null]]
allotropes: [["Dihydrogen",null,null,null,"Alpha Rhombohedral Boron, Beta Rhombohedral Boron, Alpha Tetragonal Boron",...,null,null,null,null,null]]
...
pyarrow.Table
bonding.cutoff_method.bond_connections: list<element: list<element: int64>>
  child 0, element: list<element: int64>
      child 0, element: int64
bonding.electric_consistent.bond_connections: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
bonding.electric_consistent.bond_orders: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
bonding.geometric_consistent.bond_connections: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
bonding.geometric_consistent.bond_orders: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
bonding.geometric_electric_consistent.bond_connections: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
bonding.geometric_electric_consistent.bond_orders: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
chargemol.bond_connections: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
chargemol.bond_orders: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
chargemol.cubed_moments: list<element: double>
  child 0, element: double
chargemol.fourth_moments: list<element: double>
  child 0, element: double
chargemol.squared_moments: list<element: double>
  child 0, element: double
chemenv.coordination_environments_multi_weight: list<element: list<element: struct<ce_fraction: double, ce_symbol: string, csm: double, permutation: list<element: int64>>>>
  child 0, element: list<element: struct<ce_fraction: double, ce_symbol: string, csm: double, permutation: list<element: int64>>>
      child 0, element: struct<ce_fraction: double, ce_symbol: string, csm: double, permutation: list<element: int64>>
          child 0, ce_fraction: double
          child 1, ce_symbol: string
          child 2, csm: double
          child 3, permutation: list<element: int64>
              child 0, element: int64
chemenv.coordination_multi_connections: list<element: list<element: int64>>
  child 0, element: list<element: int64>
      child 0, element: int64
chemenv.coordination_multi_numbers: list<element: int64>
  child 0, element: int64
core.atomic_numbers: list<element: int64>
  child 0, element: int64
core.cartesian_coords: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
core.density: double
core.density_atomic: double
core.elements: list<element: string>
  child 0, element: string
core.energy_per_atom: double
core.formula: string
core.formula_pretty: string
core.frac_coords: list<element: list<element: double>>
  child 0, element: list<element: double>
      child 0, element: double
core.is_gap_direct: bool
core.is_magnetic: bool
core.is_metal: bool
core.is_stable: bool
core.lattice: extension<arrow.fixed_shape_tensor[value_type=double, shape=[3,3]]>
core.material_id: string
core.nelements: int64
core.nsites: int64
core.species: list<element: string>
  child 0, element: string
core.volume: double
dielectric.e_electronic: double
dielectric.e_ij_max: double
dielectric.e_ionic: double
dielectric.e_total: double
dielectric.n: double
elasticity.compliance_tensor_ieee_format: extension<arrow.fixed_shape_tensor[value_type=double, shape=[6,6]]>
elasticity.compliance_tensor_raw: extension<arrow.fixed_shape_tensor[value_type=double, shape=[6,6]]>
elasticity.debye_temperature: double
elasticity.elastic_tensor_ieee_format: extension<arrow.fixed_shape_tensor[value_type=double, shape=[6,6]]>
elasticity.elastic_tensor_raw: extension<arrow.fixed_shape_tensor[value_type=double, shape=[6,6]]>
elasticity.g_reuss: double
elasticity.g_voigt: double
elasticity.g_vrh: double
elasticity.homogeneous_poisson: double
elasticity.k_reuss: double
elasticity.k_voigt: double
elasticity.k_vrh: double
elasticity.order: int64
elasticity.sound_velocity_acoustic: double
elasticity.sound_velocity_longitudinal: double
elasticity.sound_velocity_optical: double
elasticity.sound_velocity_total: double
elasticity.sound_velocity_transverse: double
elasticity.state: string
elasticity.thermal_conductivity_cahill: double
elasticity.thermal_conductivity_clarke: double
elasticity.universal_anisotropy: double
elasticity.warnings: list<element: string>
  child 0, element: string
elasticity.young_modulus: null
electronic_structure.band_gap: double
electronic_structure.cbm: double
electronic_structure.dos_energy_up: null
electronic_structure.efermi: double
electronic_structure.vbm: double
feature_vectors.element_fraction: extension<arrow.fixed_shape_tensor[value_type=double, shape=[103]]>
feature_vectors.element_property: extension<arrow.fixed_shape_tensor[value_type=double, shape=[132]]>
feature_vectors.sine_coulomb_matrix: extension<arrow.fixed_shape_tensor[value_type=double, shape=[432]]>
feature_vectors.xrd_pattern: extension<arrow.fixed_shape_tensor[value_type=double, shape=[128]]>
grain_boundaries.grain_boundaries: list<element: struct<gb_energy: double, rotation_angle: double, sigma: int64, type: string>>
  child 0, element: struct<gb_energy: double, rotation_angle: double, sigma: int64, type: string>
      child 0, gb_energy: double
      child 1, rotation_angle: double
      child 2, sigma: int64
      child 3, type: string
has_props.absorption: bool
has_props.bandstructure: bool
has_props.charge_density: bool
has_props.chemenv: bool
has_props.dielectric: bool
has_props.dos: bool
has_props.elasticity: bool
has_props.electronic_structure: bool
has_props.eos: bool
has_props.grain_boundaries: bool
has_props.insertion_electrodes: bool
has_props.magnetism: bool
has_props.materials: bool
has_props.oxi_states: bool
has_props.phonon: bool
has_props.piezoelectric: bool
has_props.provenance: bool
has_props.substrates: bool
has_props.surface_properties: bool
has_props.thermo: bool
has_props.xas: bool
id: int64
magnetism.num_magnetic_sites: int64
magnetism.num_unique_magnetic_sites: int64
magnetism.ordering: string
magnetism.total_magnetization: double
magnetism.total_magnetization_normalized_vol: double
magnetism.types_of_magnetic_species: list<element: string>
  child 0, element: string
metadata.last_updated: string
metadata.theoretical: bool
oxidation_states.method: string
oxidation_states.possible_species: list<element: string>
  child 0, element: string
oxidation_states.possible_valences: list<element: double>
  child 0, element: double
structure.@class: string
structure.@module: string
structure.charge: double
structure.lattice.a: double
structure.lattice.alpha: double
structure.lattice.b: double
structure.lattice.beta: double
structure.lattice.c: double
structure.lattice.gamma: double
structure.lattice.matrix: extension<arrow.fixed_shape_tensor[value_type=double, shape=[3,3]]>
structure.lattice.pbc: extension<arrow.fixed_shape_tensor[value_type=bool, shape=[3]]>
structure.lattice.volume: double
structure.sites: list<element: struct<abc: list<element: double>, label: string, properties: struct<magmom: double>, species: list<element: struct<element: string, occu: int64>>, xyz: list<element: double>>>
  child 0, element: struct<abc: list<element: double>, label: string, properties: struct<magmom: double>, species: list<element: struct<element: string, occu: int64>>, xyz: list<element: double>>
      child 0, abc: list<element: double>
          child 0, element: double
      child 1, label: string
      child 2, properties: struct<magmom: double>
          child 0, magmom: double
      child 3, species: list<element: struct<element: string, occu: int64>>
          child 0, element: struct<element: string, occu: int64>
              child 0, element: string
              child 1, occu: int64
      child 4, xyz: list<element: double>
          child 0, element: double
surface_properties.shape_factor: double
surface_properties.surface_anisotropy: double
surface_properties.weighted_surface_energy: double
surface_properties.weighted_surface_energy_EV_PER_ANG2: double
surface_properties.weighted_work_function: double
symmetry.crystal_system: string
symmetry.number: int64
symmetry.point_group: string
symmetry.symbol: string
symmetry.symprec: double
symmetry.version: string
symmetry.wyckoffs: list<element: string>
  child 0, element: string
thermo.decomposes_to: list<element: struct<amount: double, formula: string, material_id: string>>
  child 0, element: struct<amount: double, formula: string, material_id: string>
      child 0, amount: double
      child 1, formula: string
      child 2, material_id: string
thermo.energy_above_hull: double
thermo.equilibrium_reaction_energy_per_atom: double
thermo.formation_energy_per_atom: double
thermo.uncorrected_energy_per_atom: double
----
bonding.cutoff_method.bond_connections: [[null,null,...,null,null]]
bonding.electric_consistent.bond_connections: [[[[],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[],[],...,[2],[2]],...,[[8,9,10,11,12,16,16,19,19],[8,9,10,11,13,17,17,18,18],...,[1,1,2,2,4,7,13,13,14],[0,0,3,3,5,6,12,12,15]],null]]
bonding.electric_consistent.bond_orders: [[[[],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[],[],...,[0.5889],[0.5889]],...,[[0.2795,0.2883,0.2883,0.2795,0.2481,0.307,0.3419,0.311,0.311],[0.2883,0.2795,0.2795,0.2883,0.2481,0.3419,0.307,0.311,0.311],...,[0.311,0.311,0.307,0.3419,0.5776,0.5776,0.1163,0.1163,0.4588],[0.311,0.311,0.3419,0.307,0.5776,0.5776,0.1163,0.1163,0.4588]],null]]
bonding.geometric_consistent.bond_connections: [[[[2,2,2,2,3,...,3,4,4,4,4],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[6,7,12,13,10,11],[4,5,14,15,8,9],...,[2,1],[2,1]],...,[[16,19,19,16,9,10,8,11,12],[17,18,18,17,8,11,9,10,13],...,[4,7,14,2,1,1,2],[5,6,15,3,0,0,3]],null]]
bonding.geometric_consistent.bond_orders: [[[[0.0734,0.0734,0.0734,0.0734,0.0734,...,0.0734,0.0734,0.0734,0.0734,0.0734],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[0.0842,0.0842,0.0707,0.0707,0.0619,0.0619],[0.0842,0.0842,0.0707,0.0707,0.0619,0.0619],...,[0.5889,0.0707],[0.5889,0.0707]],...,[[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],...,[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307],[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307]],null]]
bonding.geometric_electric_consistent.bond_connections: [[[[],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[],[],...,[2],[2]],...,[[16,19,19,16,9,10,8,11,12],[17,18,18,17,8,11,9,10,13],...,[4,7,14,2,1,1,2],[5,6,15,3,0,0,3]],null]]
bonding.geometric_electric_consistent.bond_orders: [[[[],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[],[],...,[0.5889],[0.5889]],...,[[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],...,[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307],[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307]],null]]
chargemol.bond_connections: [[[[0,0,0,0,0,...,3,4,4,4,4],[0,0,0,0,0,...,2,3,3,4,4],[0,0,0,0,1,...,3,4,4,4,4],[0,0,0,0,1,...,3,4,4,4,4],[0,0,0,0,1,...,3,4,4,4,4]],[[2,3,3,3,3,...,9,10,11,12,13],[2,2,2,2,3,...,9,10,11,14,15],...,[1,2,2,2,3,...,9,11,15,15,15],[1,2,2,2,3,...,9,10,14,14,14]],...,[[3,3,3,3,5,...,15,16,16,19,19],[2,2,2,2,4,...,14,17,17,18,18],...,[1,1,2,2,4,...,14,17,17,17,17],[0,0,3,3,4,...,15,16,16,16,16]],null]]
chargemol.bond_orders: [[[[0.0033,0.0033,0.0033,0.0033,0.0033,...,0.0734,0.0734,0.0734,0.0734,0.0734],[0.0155,0.0155,0.0155,0.0155,0.0155,...,0.6945,0.6945,0.6945,0.6945,0.6945],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0347,0.0347,0.0347,0.0347,0.0347],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0017,0.0347,0.0347,0.0347,0.0347],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0347,0.0017,0.0017,0.0017,0.0017]],[[0.0049,0.0011,0.0011,0.0011,0.0011,...,0.002,0.0619,0.0619,0.0707,0.0707],[0.0011,0.0011,0.0011,0.0011,0.0049,...,0.0619,0.002,0.002,0.0707,0.0707],...,[0.0707,0.5889,0.0013,0.0013,0.0046,...,0.0271,0.0435,0.0076,0.0076,0.0438],[0.0707,0.0013,0.0013,0.5889,0.0046,...,0.0259,0.0435,0.0076,0.0076,0.0438]],...,[[0.0167,0.0179,0.0167,0.0179,0.0612,...,0.002,0.307,0.3419,0.311,0.311],[0.0179,0.0167,0.0179,0.0167,0.0612,...,0.002,0.3419,0.307,0.311,0.311],...,[0.311,0.311,0.307,0.3419,0.5776,...,0.4588,0.0302,0.0405,0.0302,0.0405],[0.311,0.311,0.3419,0.307,0.0043,...,0.0265,0.0405,0.0302,0.0405,0.0302]],null]]
chargemol.cubed_moments: [[[71.579692,117.080036,41.119066,41.119066,41.119066],[4.090622,4.090516,41.32723,41.326694,22.622699,...,22.349214,22.449562,22.449562,22.449615,22.449615],...,[144.779124,144.779125,144.779138,144.779139,82.876595,...,168.458199,194.520457,194.520458,194.520409,194.52041],null]]
...

Next, we can load the materials data them into ParquetGraphDB.

[4]:
from parquetdb import ParquetGraphDB

# Create a temporary directory for our database
GRAPH_DB_DIR = DATA_DIR / "GraphDB"
if GRAPH_DB_DIR.exists():
    shutil.rmtree(GRAPH_DB_DIR)
GRAPH_DB_DIR.mkdir(parents=True, exist_ok=True)


# Initialize ParquetGraphDB
db = ParquetGraphDB(storage_path=GRAPH_DB_DIR)

# The data has an previous id column, we have to remove it
data = pq.read_table(materials_file)
data = data.drop_columns("id")
db.add_nodes(node_type="material", data=data)

print(db.summary(show_column_names=True))

============================================================
GRAPH DATABASE SUMMARY
============================================================
Name: GraphDB
Storage path: data\GraphDB
└── Repository structure:
    ├── nodes/                 (data\GraphDB\nodes)
    ├── edges/                 (data\GraphDB\edges)
    ├── edge_generators/       (data\GraphDB\edge_generators)
    ├── node_generators/       (data\GraphDB\node_generators)
    └── graph/                 (data\GraphDB\graph)

############################################################
NODE DETAILS
############################################################
Total node types: 1
------------------------------------------------------------
• Node type: material
  - Number of nodes: 1000
  - Number of features: 136
  - Columns:
       - bonding.cutoff_method.bond_connections
       - bonding.electric_consistent.bond_connections
       - bonding.electric_consistent.bond_orders
       - bonding.geometric_consistent.bond_connections
       - bonding.geometric_consistent.bond_orders
       - bonding.geometric_electric_consistent.bond_connections
       - bonding.geometric_electric_consistent.bond_orders
       - chargemol.bond_connections
       - chargemol.bond_orders
       - chargemol.cubed_moments
       - chargemol.fourth_moments
       - chargemol.squared_moments
       - chemenv.coordination_environments_multi_weight
       - chemenv.coordination_multi_connections
       - chemenv.coordination_multi_numbers
       - core.atomic_numbers
       - core.cartesian_coords
       - core.density
       - core.density_atomic
       - core.elements
       - core.energy_per_atom
       - core.formula
       - core.formula_pretty
       - core.frac_coords
       - core.is_gap_direct
       - core.is_magnetic
       - core.is_metal
       - core.is_stable
       - core.lattice
       - core.material_id
       - core.nelements
       - core.nsites
       - core.species
       - core.volume
       - dielectric.e_electronic
       - dielectric.e_ij_max
       - dielectric.e_ionic
       - dielectric.e_total
       - dielectric.n
       - elasticity.compliance_tensor_ieee_format
       - elasticity.compliance_tensor_raw
       - elasticity.debye_temperature
       - elasticity.elastic_tensor_ieee_format
       - elasticity.elastic_tensor_raw
       - elasticity.g_reuss
       - elasticity.g_voigt
       - elasticity.g_vrh
       - elasticity.homogeneous_poisson
       - elasticity.k_reuss
       - elasticity.k_voigt
       - elasticity.k_vrh
       - elasticity.order
       - elasticity.sound_velocity_acoustic
       - elasticity.sound_velocity_longitudinal
       - elasticity.sound_velocity_optical
       - elasticity.sound_velocity_total
       - elasticity.sound_velocity_transverse
       - elasticity.state
       - elasticity.thermal_conductivity_cahill
       - elasticity.thermal_conductivity_clarke
       - elasticity.universal_anisotropy
       - elasticity.warnings
       - elasticity.young_modulus
       - electronic_structure.band_gap
       - electronic_structure.cbm
       - electronic_structure.dos_energy_up
       - electronic_structure.efermi
       - electronic_structure.vbm
       - feature_vectors.element_fraction
       - feature_vectors.element_property
       - feature_vectors.sine_coulomb_matrix
       - feature_vectors.xrd_pattern
       - grain_boundaries.grain_boundaries
       - has_props.absorption
       - has_props.bandstructure
       - has_props.charge_density
       - has_props.chemenv
       - has_props.dielectric
       - has_props.dos
       - has_props.elasticity
       - has_props.electronic_structure
       - has_props.eos
       - has_props.grain_boundaries
       - has_props.insertion_electrodes
       - has_props.magnetism
       - has_props.materials
       - has_props.oxi_states
       - has_props.phonon
       - has_props.piezoelectric
       - has_props.provenance
       - has_props.substrates
       - has_props.surface_properties
       - has_props.thermo
       - has_props.xas
       - id
       - magnetism.num_magnetic_sites
       - magnetism.num_unique_magnetic_sites
       - magnetism.ordering
       - magnetism.total_magnetization
       - magnetism.total_magnetization_normalized_vol
       - magnetism.types_of_magnetic_species
       - metadata.last_updated
       - metadata.theoretical
       - oxidation_states.method
       - oxidation_states.possible_species
       - oxidation_states.possible_valences
       - structure.@class
       - structure.@module
       - structure.charge
       - structure.lattice.a
       - structure.lattice.alpha
       - structure.lattice.b
       - structure.lattice.beta
       - structure.lattice.c
       - structure.lattice.gamma
       - structure.lattice.matrix
       - structure.lattice.pbc
       - structure.lattice.volume
       - structure.sites
       - surface_properties.shape_factor
       - surface_properties.surface_anisotropy
       - surface_properties.weighted_surface_energy
       - surface_properties.weighted_surface_energy_EV_PER_ANG2
       - surface_properties.weighted_work_function
       - symmetry.crystal_system
       - symmetry.number
       - symmetry.point_group
       - symmetry.symbol
       - symmetry.symprec
       - symmetry.version
       - symmetry.wyckoffs
       - thermo.decomposes_to
       - thermo.energy_above_hull
       - thermo.equilibrium_reaction_energy_per_atom
       - thermo.formation_energy_per_atom
       - thermo.uncorrected_energy_per_atom
  - db_path: data\GraphDB\nodes\material
------------------------------------------------------------

############################################################
EDGE DETAILS
############################################################
Total edge types: 0
------------------------------------------------------------

############################################################
NODE GENERATOR DETAILS
############################################################
Total node generators: 0
------------------------------------------------------------

############################################################
EDGE GENERATOR DETAILS
############################################################
Total edge generators: 0
------------------------------------------------------------

Generators

A Generator is a callable (function) that returns a PyArrow Table of either nodes or edges. By adding a generator to ParquetGraphDB, you can:

  1. Register the generator, so it can be re-run on demand.

  2. Optionally specify arguments/kwargs to pass into the generator.

  3. Automatically store the output in a NodeStore or EdgeStore with the same name as the generator function (or a custom name, if you prefer).

This is especially handy for generating nodes from external data sources or from computational routines.

In the following sections we will create custom node and edge generators. These can be create by wrapping existing functions with the node_generator or edge_generator decorators.

These can be imported like:

from parquetdb import node_generator, edge_generator

Element Node Generator

1. Define the Generator

In our first example, we will create a node generator that creates element nodes.

As mentioned above to create a node generator, we will wrap an existing function with the node_generator decorator. The function name will be the name of the node type.

@node_generator
def element():
    ...

For this example, we will import an periodic table data. This is a dataframe with 118 rows representing 118 elements of the periodic table. We have also added some transformations to the data to make it more useful for our purposes.

[5]:
### Element Node Generator
from parquetdb import node_generator

# Define the generator with the @node_generator decorator
@node_generator
def element(base_file=elements_file):
    """
    Creates Element nodes from a local file (CSV or Parquet).
    Returns a Pandas DataFrame (or PyArrow Table) with one row per element.
    """

    try:
        # Read the file
        file_ext = os.path.splitext(base_file)[-1][
            1:
        ].lower()  # e.g. "parquet" or "csv"
        if file_ext == "parquet":
            df = pd.read_parquet(base_file)
        elif file_ext == "csv":
            df = pd.read_csv(base_file)
        else:
            raise ValueError("base_file must be a parquet or csv file")

        # Apply some transformations
        # Example transformations
        df["oxidation_states"] = df["oxidation_states"].apply(
            lambda x: x.replace("]", "").replace("[", "")
        )
        df["oxidation_states"] = df["oxidation_states"].apply(
            lambda x: ",".join(x.split())
        )
        df["oxidation_states"] = df["oxidation_states"].apply(
            lambda x: eval("[" + x + "]")
        )
        df["experimental_oxidation_states"] = df["experimental_oxidation_states"].apply(
            lambda x: eval(x)
        )
        df["ionization_energies"] = df["ionization_energies"].apply(lambda x: eval(x))

    except Exception as e:
        print(f"Error reading element file: {e}")
        return None

    return df  # Return the transformed dataframe


df = element()

print(df)
       long_name symbol  abundance_universe  abundance_solar  \
0       Hydrogen      H        7.500000e+01     7.500000e+01
1         Helium     He        2.300000e+01     2.300000e+01
2        Lithium     Li        6.000000e-07     6.000000e-09
3      Beryllium     Be        1.000000e-07     1.000000e-08
4          Boron      B        1.000000e-07     2.000000e-07
..           ...    ...                 ...              ...
113    Flerovium     Fl        0.000000e+00     0.000000e+00
114    Moscovium     Mc        0.000000e+00     0.000000e+00
115  Livermorium     Lv        0.000000e+00     0.000000e+00
116   Tennessine     Ts        0.000000e+00     0.000000e+00
117    Oganesson     Og        0.000000e+00     0.000000e+00

     abundance_meteor  abundance_crust  abundance_ocean  abundance_human  \
0            2.400000     1.500000e-01     1.100000e+01     1.000000e+01
1                 NaN     5.500000e-07     7.200000e-10              NaN
2            0.000170     1.700000e-03     1.800000e-05     3.000000e-06
3            0.000003     1.900000e-04     6.000000e-11     4.000000e-08
4            0.000160     8.600000e-04     4.400000e-04     7.000000e-05
..                ...              ...              ...              ...
113          0.000000     0.000000e+00     0.000000e+00     0.000000e+00
114          0.000000     0.000000e+00     0.000000e+00     0.000000e+00
115          0.000000     0.000000e+00     0.000000e+00     0.000000e+00
116          0.000000     0.000000e+00     0.000000e+00     0.000000e+00
117          0.000000     0.000000e+00     0.000000e+00     0.000000e+00

    adiabatic_index                                         allotropes  ...  \
0             5-Jul                                         Dihydrogen  ...
1             3-May                                               None  ...
2              None                                               None  ...
3              None                                               None  ...
4              None  Alpha Rhombohedral Boron, Beta Rhombohedral Bo...  ...
..              ...                                                ...  ...
113            None                                               None  ...
114            None                                               None  ...
115            None                                               None  ...
116            None                                               None  ...
117            None                                               None  ...

    is_halogen  is_lanthanoid  is_metal is_metalloid  is_noble_gas  \
0        False          False     False        False         False
1        False          False     False        False          True
2        False          False      True        False         False
3        False          False      True        False         False
4        False          False     False         True         False
..         ...            ...       ...          ...           ...
113      False          False     False        False         False
114      False          False     False        False         False
115      False          False     False        False         False
116      False          False     False        False         False
117      False          False     False        False          True

    is_post_transition_metal is_quadrupolar is_rare_earth_metal  \
0                      False           True               False
1                      False          False               False
2                      False           True               False
3                      False           True               False
4                      False           True               False
..                       ...            ...                 ...
113                    False          False               False
114                    False          False               False
115                    False          False               False
116                    False          False               False
117                    False          False               False

    experimental_oxidation_states                        ionization_energies
0                              []                                   [1312.0]
1                              []                           [2372.3, 5250.5]
2                             [1]                   [520.2, 7298.1, 11815.0]
3                             [2]          [899.5, 1757.1, 14848.7, 21006.6]
4                             [3]  [800.6, 2427.1, 3659.7, 25025.8, 32826.7]
..                            ...                                        ...
113                           [2]    [832.2, 1600.0, 3370.0, 4400.0, 5850.0]
114                           [3]    [538.3, 1760.0, 2650.0, 4680.0, 5720.0]
115                          [-2]    [663.9, 1330.0, 2850.0, 3810.0, 6080.0]
116                          [-1]    [736.9, 1435.4, 2161.9, 4012.9, 5076.4]
117                            []                            [860.1, 1560.0]

[118 rows x 98 columns]

2. Add the Generator to the ParquetGraphDB

Now that we have defined the generator, we can add it to the ParquetGraphDB instance. We do this by calling the add_node_generator method. Here we give the function, the arguments, and the kwargs. We also have the option to run the generator immediately or later. Default is True.

The node generator will be stored in the node_generator_store of the ParquetGraphDB instance.

[6]:
db.add_node_generator(
    generator_func=element,
    generator_args={},
    generator_kwargs={"base_file": elements_file},
    run_immediately=False,  # We have the option to run the generator immediately or later. Default is True.
)

# Check the node generators in the MatGraphDB

print(db.node_generator_store)
============================================================
GENERATOR STORE SUMMARY
============================================================
• Number of generators: 1
Storage path: data\GraphDB\node_generators


############################################################
METADATA
############################################################
• class: GeneratorStore
• class_module: parquetdb.graph.generator_store

############################################################
GENERATOR DETAILS
############################################################
• Columns:
    - generator_func
    - generator_kwargs.base_file
    - generator_name
    - id

• Generator names:
    - element

Running a Node Generator Later

Now we can run the node generator with db.run_node_generator(generator_name).

Note: Here we run the node generator. Notice how we do not need pass the arguments or kwargs, this information is stored in the node generator store. However, we can override the arguments or kwargs if we want to.

[7]:
table = db.run_node_generator("element")

Lets check the node store for the elements.

[8]:
element_node_store = db.get_node_store("element")
print(element_node_store)
============================================================
NODE STORE SUMMARY
============================================================
Node type: element
• Number of nodes: 118
• Number of features: 99
Storage path: data\GraphDB\nodes\element


############################################################
METADATA
############################################################
• class: NodeStore
• class_module: parquetdb.graph.nodes
• node_type: element
• name_column: id

############################################################
NODE DETAILS
############################################################

Material-Element Edge Generator

1. Define the Generator

An edge generator is similar to a node generator but returns a PyArrow Table describing edges. Each generated edge must have at least these fields:

  • source_id (int)

  • source_type (string)

  • target_id (int)

  • target_type (string)

Additionally, edge_generators must have the corresponding node_stores in the ParquetGraphDB instance as an argument. This is to ensure that the ids of the nodes are valid and in the correct node store.

For edges we use the edge_generator decorator.

[9]:
from parquetdb import edge_generator
import pyarrow as pa


@edge_generator
def material_element_has(
    material_store, element_store
):  # We have the material_store and element_store as an argument
    try:
        connection_name = "has"

        # We select only the necessary columns from the node stores
        material_table = material_store.read_nodes(
            columns=["id", "core.material_id", "core.elements"]
        )
        element_table = element_store.read_nodes(columns=["id", "symbol"])

        # We rename for utility purposes
        material_table = material_table.rename_columns(
            {"id": "source_id", "core.material_id": "material_name"}
        )
        material_table = material_table.append_column(
            "source_type", pa.array(["material"] * material_table.num_rows)
        )

        element_table = element_table.rename_columns({"id": "target_id"})
        element_table = element_table.append_column(
            "target_type", pa.array(["elements"] * element_table.num_rows)
        )

        # We convert the tables to pandas for easier manipulation
        material_df = material_table.to_pandas()
        element_df = element_table.to_pandas()

        # We create a map of the element symbols to the target_id for quick lookup
        element_target_id_map = {
            row["symbol"]: row["target_id"] for _, row in element_df.iterrows()
        }

        # We create a dictionary to store the edge data
        table_dict = {
            "source_id": [],
            "source_type": [],
            "target_id": [],
            "target_type": [],
            "edge_type": [],
            "name": [],
            "weight": [],
        }

        # We iterate over the material nodes
        for _, row in material_df.iterrows():
            # We get the elements composing the material
            elements = row["core.elements"]
            source_id = row["source_id"]
            material_name = row["material_name"]
            if elements is None:
                continue

            # We iterate over the elements
            for element in elements:
                # We get the target_id for the element
                target_id = element_target_id_map[element]

                # We append the edge data to the dictionary. Here we could also define the reverse edge as well.
                table_dict["source_id"].append(source_id)
                table_dict["source_type"].append(material_store.node_type)
                table_dict["target_id"].append(target_id)
                table_dict["target_type"].append(element_store.node_type)
                table_dict["edge_type"].append(connection_name)

                name = f"{material_name}_{connection_name}_{element}"
                table_dict["name"].append(name)
                table_dict["weight"].append(1.0)

        df = pd.DataFrame(table_dict)
    except Exception as e:
        print(f"Error creating material-element-has relationships: {e}")

    return df

2. Add the Generator to the ParquetGraphDB

Now that we have defined the generator, we can add it to the ParquetGraphDB instance. We do this by calling the add_edge_generator method.

The edge generator will be stored in the edge_generator_store of the ParquetGraphDB instance.

[10]:
element_store = db.get_node_store("element")
material_store = db.get_node_store("material")

db.add_edge_generator(
    generator_func=material_element_has,
    generator_args={
        "material_store": material_store,
        "element_store": element_store,
    },
    generator_kwargs={},
    run_immediately=True,
)

Lets check the edge generator store.

[11]:
print(db.edge_generator_store)
============================================================
GENERATOR STORE SUMMARY
============================================================
• Number of generators: 1
Storage path: data\GraphDB\edge_generators


############################################################
METADATA
############################################################
• class: GeneratorStore
• class_module: parquetdb.graph.generator_store

############################################################
GENERATOR DETAILS
############################################################
• Columns:
    - generator_args.element_store
    - generator_args.material_store
    - generator_func
    - generator_name
    - id

• Generator names:
    - material_element_has

Let’s check to see if the edge created the edges in the edge store.

[12]:
edge_store = db.get_edge_store("material_element_has")
print(edge_store)
============================================================
EDGE STORE SUMMARY
============================================================
Edge type: material_element_has
• Number of edges: 3348
• Number of features: 8
Storage path: data\GraphDB\edges\material_element_has


############################################################
METADATA
############################################################
• class: EdgeStore
• class_module: parquetdb.graph.edges

############################################################
EDGE DETAILS
############################################################

Updates to node stores.

By default, when node and edge generators are added their argument store dependencies are added to the ParquetGraphDB instance. This means that when parent stores are updated, the geneator will run and update their corresponding stores.

These stores are stored in the ParquetGraphDB/generator_dependency.json file.

[14]:
materials_df = db.read_nodes(node_type="material", columns=["id"], ids=[0]).to_pandas()
print(materials_df)

db.delete_nodes(node_type="material",ids=[0])

materials_df = db.read_nodes(node_type="material", columns=["id"], ids=[0]).to_pandas()
print(materials_df)
   id
0   0
Empty DataFrame
Columns: [id]
Index: []

As you can see the material node with id=0 is now gone.

Let’s check the material_element_has edges to see if the update has been propagated

[15]:
edge_store = db.get_edge_store("material_element_has")
print(edge_store)
============================================================
EDGE STORE SUMMARY
============================================================
Edge type: material_element_has
• Number of edges: 3345
• Number of features: 8
Storage path: data\GraphDB\edges\material_element_has


############################################################
METADATA
############################################################
• class: EdgeStore
• class_module: parquetdb.graph.edges

############################################################
EDGE DETAILS
############################################################

Now there are 3345 material_element_has edges which has reduced from 3348 from before the deletion

Let’s check the material_element_has dataframe.

[16]:
df = edge_store.read_edges().to_pandas()
print(df)
     edge_type    id               name  source_id source_type  target_id  \
0          has     0   mp-1222351_has_F          1    material          8
1          has     1  mp-1222351_has_Fe          1    material         25
2          has     2  mp-1222351_has_Li          1    material          2
3          has     3    mp-651087_has_F          2    material          8
4          has     4   mp-651087_has_Gd          2    material         63
...        ...   ...                ...        ...         ...        ...
3340       has  3340  mp-2714707_has_Al        999    material         12
3341       has  3341  mp-2714707_has_Na        999    material         10
3342       has  3342   mp-2714707_has_O        999    material          7
3343       has  3343   mp-2714707_has_S        999    material         15
3344       has  3344  mp-2714707_has_Zn        999    material         29

     target_type  weight
0        element     1.0
1        element     1.0
2        element     1.0
3        element     1.0
4        element     1.0
...          ...     ...
3340     element     1.0
3341     element     1.0
3342     element     1.0
3343     element     1.0
3344     element     1.0

[3345 rows x 8 columns]

As you can see, the source_id does not have and id=0.

We can double check this with the following:

[20]:
df[df["source_type"] == 0]
[20]:
edge_type id name source_id source_type target_id target_type weight

This is empty as we should expect.

[21]:
print(db)
============================================================
GRAPH DATABASE SUMMARY
============================================================
Name: GraphDB
Storage path: data\GraphDB
└── Repository structure:
    ├── nodes/                 (data\GraphDB\nodes)
    ├── edges/                 (data\GraphDB\edges)
    ├── edge_generators/       (data\GraphDB\edge_generators)
    ├── node_generators/       (data\GraphDB\node_generators)
    └── graph/                 (data\GraphDB\graph)

############################################################
NODE DETAILS
############################################################
Total node types: 2
------------------------------------------------------------
• Node type: material
  - Number of nodes: 999
  - Number of features: 136
  - db_path: data\GraphDB\nodes\material
------------------------------------------------------------
• Node type: element
  - Number of nodes: 118
  - Number of features: 99
  - db_path: data\GraphDB\nodes\element
------------------------------------------------------------

############################################################
EDGE DETAILS
############################################################
Total edge types: 1
------------------------------------------------------------
• Edge type: material_element_has
  - Number of edges: 3345
  - Number of features: 8
  - db_path: data\GraphDB\edges\material_element_has
------------------------------------------------------------

############################################################
NODE GENERATOR DETAILS
############################################################
Total node generators: 1
------------------------------------------------------------
• Generator: element
Generator Args:
  - generator_func: [<function wrapper at 0x000002CDBFC9D1B0>]
  - generator_kwargs.base_file: [WindowsPath('data/elements.parquet')]
  - generator_name: ['element']
  - id: [0]
Generator Kwargs:
  - base_file: [WindowsPath('data/elements.parquet')]
------------------------------------------------------------

############################################################
EDGE GENERATOR DETAILS
############################################################
Total edge generators: 1
------------------------------------------------------------
• Generator: material_element_has
Generator Args:
  - element_store: data\GraphDB\nodes\element
  - material_store: data\GraphDB\nodes\material
Generator Kwargs:
------------------------------------------------------------

6. Summary

In this notebook, we showed how to define custom node and edge generators and showed how to run them.