[ ]:
!pip install parquetdb
!pip install ipykernel
03 - Graph Generators in ParquetGraphDB¶
In this notebook, we’ll learn how to:
Create node generator
Add the node generator to the graph
Create edge generator
Add the edge generator to the graph
Defining dependencies between generators
We’ll use the ParquetGraphDB class from parquetdb to demonstrate these features. If you haven’t already installed parquetdb, run the previous cell.
Example Scenario: Modeling Materials Data¶
Let’s explore how parquetdb generators can build and maintain a graph using a materials science scenario. Materials, at their core, are described by their structure and the chemical elements they contain (their composition).
We can represent this information effectively using a heterograph:
Nodes representing Materials (like \(H_2O\) or \(Fe\)).
Nodes representing Elements (like \(H\), \(O\), \(Fe\)).
Edges showing which Elements make up which Materials.
The real power of generators becomes apparent when considering how this data originates and evolves. Material definitions might come from external files or databases, and the properties of elements might be sourced separately.
Generators provide a robust mechanism to:
Ingest and process this source data into graph nodes and edges.
Establish dependencies. For instance, the creation of material-element edges depends on both Material and Element nodes existing first.
Automate updates. If the input file defining materials changes, or if an element’s properties are updated in its source, generators allow
parquetdbto potentially rebuild the affected parts of the graph automatically, ensuring consistency.
We’ll now set up this example, starting with the data sources for elements and materials.
Setup¶
[2]:
import os
import shutil
import requests
import io
from pathlib import Path
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def download_url(url,save_path):
# Download the parquet file
response = requests.get(url)
if response.status_code == 200:
# Load the parquet file into a pandas DataFrame
parquet_file = io.BytesIO(response.content)
periodic_table = pq.read_table(parquet_file)
print(f"Downloaded periodic table data with {len(periodic_table)} elements")
else:
raise "Could not download data"
pq.write_table(periodic_table, save_path)
FILE_DIR = Path(".")
DATA_DIR = FILE_DIR / "data"
if DATA_DIR.exists():
shutil.rmtree(DATA_DIR)
DATA_DIR.mkdir(parents=True, exist_ok=True)
# URL to the raw data file in the GitHub repository
elements_url = "https://github.com/lllangWV/ParquetDB/raw/GraphDB/tests/graph/data/interim_periodic_table_values.parquet"
materials_url = "https://github.com/lllangWV/ParquetDB/raw/GraphDB/tests/graph/data/materials/materials_0.parquet"
elements_file = DATA_DIR / "elements.parquet"
materials_file = DATA_DIR / "materials.parquet"
download_url(elements_url,elements_file)
download_url(materials_url,materials_file)
Downloaded periodic table data with 118 elements
Downloaded periodic table data with 1000 elements
[3]:
elements_table = pq.read_table(elements_file)
materials_table = pq.read_table(materials_file)
print(elements_table)
print(materials_table)
pyarrow.Table
long_name: string
symbol: string
abundance_universe: double
abundance_solar: double
abundance_meteor: double
abundance_crust: double
abundance_ocean: double
abundance_human: double
adiabatic_index: string
allotropes: string
appearance: string
atomic_mass: double
atomic_number: int64
block: string
boiling_point: double
classifications_cas_number: string
classifications_cid_number: string
classifications_rtecs_number: string
classifications_dot_numbers: string
classifications_dot_hazard_class: double
conductivity_thermal: double
cpk_hex: string
critical_pressure: double
critical_temperature: double
crystal_structure: string
density_stp: double
discovered_year: int64
discovered_by: string
discovered_location: string
electron_affinity: double
electron_configuration: string
electron_configuration_semantic: string
electronegativity_pauling: double
energy_levels: string
gas_phase: string
group: int64
extended_group: int64
half_life: string
heat_specific: double
heat_vaporization: double
heat_fusion: double
heat_molar: double
isotopes_known: string
isotopes_stable: string
isotopic_abundances: string
lattice_angles: string
lattice_constants: string
lifetime: string
magnetic_susceptibility_mass: double
magnetic_susceptibility_molar: double
magnetic_susceptibility_volume: double
magnetic_type: string
melting_point: double
molar_volume: double
neutron_cross_section: double
neutron_mass_absorption: double
oxidation_states: string
period: int64
phase: string
quantum_numbers: string
radius_calculated: double
radius_empirical: double
radius_covalent: double
radius_vanderwaals: double
refractive_index: double
series: string
source: string
space_group_name: string
space_group_number: double
speed_of_sound: double
summary: string
valence_electrons: double
conductivity_electric: double
electrical_resistivity: double
electrical_type: string
modulus_bulk: double
modulus_shear: double
modulus_young: double
poisson_ratio: double
coefficient_of_linear_thermal_expansion: double
hardness_vickers: double
hardness_brinell: double
hardness_mohs: double
superconduction_temperature: double
is_actinoid: bool
is_alkali: bool
is_alkaline: bool
is_chalcogen: bool
is_halogen: bool
is_lanthanoid: bool
is_metal: bool
is_metalloid: bool
is_noble_gas: bool
is_post_transition_metal: bool
is_quadrupolar: bool
is_rare_earth_metal: bool
experimental_oxidation_states: string
ionization_energies: string
----
long_name: [["Hydrogen","Helium","Lithium","Beryllium","Boron",...,"Flerovium","Moscovium","Livermorium","Tennessine","Oganesson"]]
symbol: [["H","He","Li","Be","B",...,"Fl","Mc","Lv","Ts","Og"]]
abundance_universe: [[75,23,6e-7,1e-7,1e-7,...,0,0,0,0,0]]
abundance_solar: [[75,23,6e-9,1e-8,2e-7,...,0,0,0,0,0]]
abundance_meteor: [[2.4,null,0.00017,0.0000029,0.00016,...,0,0,0,0,0]]
abundance_crust: [[0.15,5.5e-7,0.0017,0.00019,0.00086,...,0,0,0,0,0]]
abundance_ocean: [[11,7.2e-10,0.000018,6e-11,0.00044,...,0,0,0,0,0]]
abundance_human: [[10,null,0.000003,4e-8,0.00007,...,0,0,0,0,0]]
adiabatic_index: [["5-Jul","3-May",null,null,null,...,null,null,null,null,null]]
allotropes: [["Dihydrogen",null,null,null,"Alpha Rhombohedral Boron, Beta Rhombohedral Boron, Alpha Tetragonal Boron",...,null,null,null,null,null]]
...
pyarrow.Table
bonding.cutoff_method.bond_connections: list<element: list<element: int64>>
child 0, element: list<element: int64>
child 0, element: int64
bonding.electric_consistent.bond_connections: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
bonding.electric_consistent.bond_orders: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
bonding.geometric_consistent.bond_connections: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
bonding.geometric_consistent.bond_orders: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
bonding.geometric_electric_consistent.bond_connections: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
bonding.geometric_electric_consistent.bond_orders: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
chargemol.bond_connections: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
chargemol.bond_orders: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
chargemol.cubed_moments: list<element: double>
child 0, element: double
chargemol.fourth_moments: list<element: double>
child 0, element: double
chargemol.squared_moments: list<element: double>
child 0, element: double
chemenv.coordination_environments_multi_weight: list<element: list<element: struct<ce_fraction: double, ce_symbol: string, csm: double, permutation: list<element: int64>>>>
child 0, element: list<element: struct<ce_fraction: double, ce_symbol: string, csm: double, permutation: list<element: int64>>>
child 0, element: struct<ce_fraction: double, ce_symbol: string, csm: double, permutation: list<element: int64>>
child 0, ce_fraction: double
child 1, ce_symbol: string
child 2, csm: double
child 3, permutation: list<element: int64>
child 0, element: int64
chemenv.coordination_multi_connections: list<element: list<element: int64>>
child 0, element: list<element: int64>
child 0, element: int64
chemenv.coordination_multi_numbers: list<element: int64>
child 0, element: int64
core.atomic_numbers: list<element: int64>
child 0, element: int64
core.cartesian_coords: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
core.density: double
core.density_atomic: double
core.elements: list<element: string>
child 0, element: string
core.energy_per_atom: double
core.formula: string
core.formula_pretty: string
core.frac_coords: list<element: list<element: double>>
child 0, element: list<element: double>
child 0, element: double
core.is_gap_direct: bool
core.is_magnetic: bool
core.is_metal: bool
core.is_stable: bool
core.lattice: extension<arrow.fixed_shape_tensor[value_type=double, shape=[3,3]]>
core.material_id: string
core.nelements: int64
core.nsites: int64
core.species: list<element: string>
child 0, element: string
core.volume: double
dielectric.e_electronic: double
dielectric.e_ij_max: double
dielectric.e_ionic: double
dielectric.e_total: double
dielectric.n: double
elasticity.compliance_tensor_ieee_format: extension<arrow.fixed_shape_tensor[value_type=double, shape=[6,6]]>
elasticity.compliance_tensor_raw: extension<arrow.fixed_shape_tensor[value_type=double, shape=[6,6]]>
elasticity.debye_temperature: double
elasticity.elastic_tensor_ieee_format: extension<arrow.fixed_shape_tensor[value_type=double, shape=[6,6]]>
elasticity.elastic_tensor_raw: extension<arrow.fixed_shape_tensor[value_type=double, shape=[6,6]]>
elasticity.g_reuss: double
elasticity.g_voigt: double
elasticity.g_vrh: double
elasticity.homogeneous_poisson: double
elasticity.k_reuss: double
elasticity.k_voigt: double
elasticity.k_vrh: double
elasticity.order: int64
elasticity.sound_velocity_acoustic: double
elasticity.sound_velocity_longitudinal: double
elasticity.sound_velocity_optical: double
elasticity.sound_velocity_total: double
elasticity.sound_velocity_transverse: double
elasticity.state: string
elasticity.thermal_conductivity_cahill: double
elasticity.thermal_conductivity_clarke: double
elasticity.universal_anisotropy: double
elasticity.warnings: list<element: string>
child 0, element: string
elasticity.young_modulus: null
electronic_structure.band_gap: double
electronic_structure.cbm: double
electronic_structure.dos_energy_up: null
electronic_structure.efermi: double
electronic_structure.vbm: double
feature_vectors.element_fraction: extension<arrow.fixed_shape_tensor[value_type=double, shape=[103]]>
feature_vectors.element_property: extension<arrow.fixed_shape_tensor[value_type=double, shape=[132]]>
feature_vectors.sine_coulomb_matrix: extension<arrow.fixed_shape_tensor[value_type=double, shape=[432]]>
feature_vectors.xrd_pattern: extension<arrow.fixed_shape_tensor[value_type=double, shape=[128]]>
grain_boundaries.grain_boundaries: list<element: struct<gb_energy: double, rotation_angle: double, sigma: int64, type: string>>
child 0, element: struct<gb_energy: double, rotation_angle: double, sigma: int64, type: string>
child 0, gb_energy: double
child 1, rotation_angle: double
child 2, sigma: int64
child 3, type: string
has_props.absorption: bool
has_props.bandstructure: bool
has_props.charge_density: bool
has_props.chemenv: bool
has_props.dielectric: bool
has_props.dos: bool
has_props.elasticity: bool
has_props.electronic_structure: bool
has_props.eos: bool
has_props.grain_boundaries: bool
has_props.insertion_electrodes: bool
has_props.magnetism: bool
has_props.materials: bool
has_props.oxi_states: bool
has_props.phonon: bool
has_props.piezoelectric: bool
has_props.provenance: bool
has_props.substrates: bool
has_props.surface_properties: bool
has_props.thermo: bool
has_props.xas: bool
id: int64
magnetism.num_magnetic_sites: int64
magnetism.num_unique_magnetic_sites: int64
magnetism.ordering: string
magnetism.total_magnetization: double
magnetism.total_magnetization_normalized_vol: double
magnetism.types_of_magnetic_species: list<element: string>
child 0, element: string
metadata.last_updated: string
metadata.theoretical: bool
oxidation_states.method: string
oxidation_states.possible_species: list<element: string>
child 0, element: string
oxidation_states.possible_valences: list<element: double>
child 0, element: double
structure.@class: string
structure.@module: string
structure.charge: double
structure.lattice.a: double
structure.lattice.alpha: double
structure.lattice.b: double
structure.lattice.beta: double
structure.lattice.c: double
structure.lattice.gamma: double
structure.lattice.matrix: extension<arrow.fixed_shape_tensor[value_type=double, shape=[3,3]]>
structure.lattice.pbc: extension<arrow.fixed_shape_tensor[value_type=bool, shape=[3]]>
structure.lattice.volume: double
structure.sites: list<element: struct<abc: list<element: double>, label: string, properties: struct<magmom: double>, species: list<element: struct<element: string, occu: int64>>, xyz: list<element: double>>>
child 0, element: struct<abc: list<element: double>, label: string, properties: struct<magmom: double>, species: list<element: struct<element: string, occu: int64>>, xyz: list<element: double>>
child 0, abc: list<element: double>
child 0, element: double
child 1, label: string
child 2, properties: struct<magmom: double>
child 0, magmom: double
child 3, species: list<element: struct<element: string, occu: int64>>
child 0, element: struct<element: string, occu: int64>
child 0, element: string
child 1, occu: int64
child 4, xyz: list<element: double>
child 0, element: double
surface_properties.shape_factor: double
surface_properties.surface_anisotropy: double
surface_properties.weighted_surface_energy: double
surface_properties.weighted_surface_energy_EV_PER_ANG2: double
surface_properties.weighted_work_function: double
symmetry.crystal_system: string
symmetry.number: int64
symmetry.point_group: string
symmetry.symbol: string
symmetry.symprec: double
symmetry.version: string
symmetry.wyckoffs: list<element: string>
child 0, element: string
thermo.decomposes_to: list<element: struct<amount: double, formula: string, material_id: string>>
child 0, element: struct<amount: double, formula: string, material_id: string>
child 0, amount: double
child 1, formula: string
child 2, material_id: string
thermo.energy_above_hull: double
thermo.equilibrium_reaction_energy_per_atom: double
thermo.formation_energy_per_atom: double
thermo.uncorrected_energy_per_atom: double
----
bonding.cutoff_method.bond_connections: [[null,null,...,null,null]]
bonding.electric_consistent.bond_connections: [[[[],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[],[],...,[2],[2]],...,[[8,9,10,11,12,16,16,19,19],[8,9,10,11,13,17,17,18,18],...,[1,1,2,2,4,7,13,13,14],[0,0,3,3,5,6,12,12,15]],null]]
bonding.electric_consistent.bond_orders: [[[[],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[],[],...,[0.5889],[0.5889]],...,[[0.2795,0.2883,0.2883,0.2795,0.2481,0.307,0.3419,0.311,0.311],[0.2883,0.2795,0.2795,0.2883,0.2481,0.3419,0.307,0.311,0.311],...,[0.311,0.311,0.307,0.3419,0.5776,0.5776,0.1163,0.1163,0.4588],[0.311,0.311,0.3419,0.307,0.5776,0.5776,0.1163,0.1163,0.4588]],null]]
bonding.geometric_consistent.bond_connections: [[[[2,2,2,2,3,...,3,4,4,4,4],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[6,7,12,13,10,11],[4,5,14,15,8,9],...,[2,1],[2,1]],...,[[16,19,19,16,9,10,8,11,12],[17,18,18,17,8,11,9,10,13],...,[4,7,14,2,1,1,2],[5,6,15,3,0,0,3]],null]]
bonding.geometric_consistent.bond_orders: [[[[0.0734,0.0734,0.0734,0.0734,0.0734,...,0.0734,0.0734,0.0734,0.0734,0.0734],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[0.0842,0.0842,0.0707,0.0707,0.0619,0.0619],[0.0842,0.0842,0.0707,0.0707,0.0619,0.0619],...,[0.5889,0.0707],[0.5889,0.0707]],...,[[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],...,[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307],[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307]],null]]
bonding.geometric_electric_consistent.bond_connections: [[[[],[2,2,3,3,4,4],[1,1],[1,1],[1,1]],[[],[],...,[2],[2]],...,[[16,19,19,16,9,10,8,11,12],[17,18,18,17,8,11,9,10,13],...,[4,7,14,2,1,1,2],[5,6,15,3,0,0,3]],null]]
bonding.geometric_electric_consistent.bond_orders: [[[[],[0.6945,0.6945,0.6945,0.6945,0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945],[0.6945,0.6945]],[[],[],...,[0.5889],[0.5889]],...,[[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],[0.3419,0.311,0.311,0.307,0.2883,0.2883,0.2795,0.2795,0.2481],...,[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307],[0.5776,0.5776,0.4588,0.3419,0.311,0.311,0.307]],null]]
chargemol.bond_connections: [[[[0,0,0,0,0,...,3,4,4,4,4],[0,0,0,0,0,...,2,3,3,4,4],[0,0,0,0,1,...,3,4,4,4,4],[0,0,0,0,1,...,3,4,4,4,4],[0,0,0,0,1,...,3,4,4,4,4]],[[2,3,3,3,3,...,9,10,11,12,13],[2,2,2,2,3,...,9,10,11,14,15],...,[1,2,2,2,3,...,9,11,15,15,15],[1,2,2,2,3,...,9,10,14,14,14]],...,[[3,3,3,3,5,...,15,16,16,19,19],[2,2,2,2,4,...,14,17,17,18,18],...,[1,1,2,2,4,...,14,17,17,17,17],[0,0,3,3,4,...,15,16,16,16,16]],null]]
chargemol.bond_orders: [[[[0.0033,0.0033,0.0033,0.0033,0.0033,...,0.0734,0.0734,0.0734,0.0734,0.0734],[0.0155,0.0155,0.0155,0.0155,0.0155,...,0.6945,0.6945,0.6945,0.6945,0.6945],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0347,0.0347,0.0347,0.0347,0.0347],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0017,0.0347,0.0347,0.0347,0.0347],[0.0734,0.0734,0.0734,0.0734,0.6945,...,0.0347,0.0017,0.0017,0.0017,0.0017]],[[0.0049,0.0011,0.0011,0.0011,0.0011,...,0.002,0.0619,0.0619,0.0707,0.0707],[0.0011,0.0011,0.0011,0.0011,0.0049,...,0.0619,0.002,0.002,0.0707,0.0707],...,[0.0707,0.5889,0.0013,0.0013,0.0046,...,0.0271,0.0435,0.0076,0.0076,0.0438],[0.0707,0.0013,0.0013,0.5889,0.0046,...,0.0259,0.0435,0.0076,0.0076,0.0438]],...,[[0.0167,0.0179,0.0167,0.0179,0.0612,...,0.002,0.307,0.3419,0.311,0.311],[0.0179,0.0167,0.0179,0.0167,0.0612,...,0.002,0.3419,0.307,0.311,0.311],...,[0.311,0.311,0.307,0.3419,0.5776,...,0.4588,0.0302,0.0405,0.0302,0.0405],[0.311,0.311,0.3419,0.307,0.0043,...,0.0265,0.0405,0.0302,0.0405,0.0302]],null]]
chargemol.cubed_moments: [[[71.579692,117.080036,41.119066,41.119066,41.119066],[4.090622,4.090516,41.32723,41.326694,22.622699,...,22.349214,22.449562,22.449562,22.449615,22.449615],...,[144.779124,144.779125,144.779138,144.779139,82.876595,...,168.458199,194.520457,194.520458,194.520409,194.52041],null]]
...
Next, we can load the materials data them into ParquetGraphDB.
[4]:
from parquetdb import ParquetGraphDB
# Create a temporary directory for our database
GRAPH_DB_DIR = DATA_DIR / "GraphDB"
if GRAPH_DB_DIR.exists():
shutil.rmtree(GRAPH_DB_DIR)
GRAPH_DB_DIR.mkdir(parents=True, exist_ok=True)
# Initialize ParquetGraphDB
db = ParquetGraphDB(storage_path=GRAPH_DB_DIR)
# The data has an previous id column, we have to remove it
data = pq.read_table(materials_file)
data = data.drop_columns("id")
db.add_nodes(node_type="material", data=data)
print(db.summary(show_column_names=True))
============================================================
GRAPH DATABASE SUMMARY
============================================================
Name: GraphDB
Storage path: data\GraphDB
└── Repository structure:
├── nodes/ (data\GraphDB\nodes)
├── edges/ (data\GraphDB\edges)
├── edge_generators/ (data\GraphDB\edge_generators)
├── node_generators/ (data\GraphDB\node_generators)
└── graph/ (data\GraphDB\graph)
############################################################
NODE DETAILS
############################################################
Total node types: 1
------------------------------------------------------------
• Node type: material
- Number of nodes: 1000
- Number of features: 136
- Columns:
- bonding.cutoff_method.bond_connections
- bonding.electric_consistent.bond_connections
- bonding.electric_consistent.bond_orders
- bonding.geometric_consistent.bond_connections
- bonding.geometric_consistent.bond_orders
- bonding.geometric_electric_consistent.bond_connections
- bonding.geometric_electric_consistent.bond_orders
- chargemol.bond_connections
- chargemol.bond_orders
- chargemol.cubed_moments
- chargemol.fourth_moments
- chargemol.squared_moments
- chemenv.coordination_environments_multi_weight
- chemenv.coordination_multi_connections
- chemenv.coordination_multi_numbers
- core.atomic_numbers
- core.cartesian_coords
- core.density
- core.density_atomic
- core.elements
- core.energy_per_atom
- core.formula
- core.formula_pretty
- core.frac_coords
- core.is_gap_direct
- core.is_magnetic
- core.is_metal
- core.is_stable
- core.lattice
- core.material_id
- core.nelements
- core.nsites
- core.species
- core.volume
- dielectric.e_electronic
- dielectric.e_ij_max
- dielectric.e_ionic
- dielectric.e_total
- dielectric.n
- elasticity.compliance_tensor_ieee_format
- elasticity.compliance_tensor_raw
- elasticity.debye_temperature
- elasticity.elastic_tensor_ieee_format
- elasticity.elastic_tensor_raw
- elasticity.g_reuss
- elasticity.g_voigt
- elasticity.g_vrh
- elasticity.homogeneous_poisson
- elasticity.k_reuss
- elasticity.k_voigt
- elasticity.k_vrh
- elasticity.order
- elasticity.sound_velocity_acoustic
- elasticity.sound_velocity_longitudinal
- elasticity.sound_velocity_optical
- elasticity.sound_velocity_total
- elasticity.sound_velocity_transverse
- elasticity.state
- elasticity.thermal_conductivity_cahill
- elasticity.thermal_conductivity_clarke
- elasticity.universal_anisotropy
- elasticity.warnings
- elasticity.young_modulus
- electronic_structure.band_gap
- electronic_structure.cbm
- electronic_structure.dos_energy_up
- electronic_structure.efermi
- electronic_structure.vbm
- feature_vectors.element_fraction
- feature_vectors.element_property
- feature_vectors.sine_coulomb_matrix
- feature_vectors.xrd_pattern
- grain_boundaries.grain_boundaries
- has_props.absorption
- has_props.bandstructure
- has_props.charge_density
- has_props.chemenv
- has_props.dielectric
- has_props.dos
- has_props.elasticity
- has_props.electronic_structure
- has_props.eos
- has_props.grain_boundaries
- has_props.insertion_electrodes
- has_props.magnetism
- has_props.materials
- has_props.oxi_states
- has_props.phonon
- has_props.piezoelectric
- has_props.provenance
- has_props.substrates
- has_props.surface_properties
- has_props.thermo
- has_props.xas
- id
- magnetism.num_magnetic_sites
- magnetism.num_unique_magnetic_sites
- magnetism.ordering
- magnetism.total_magnetization
- magnetism.total_magnetization_normalized_vol
- magnetism.types_of_magnetic_species
- metadata.last_updated
- metadata.theoretical
- oxidation_states.method
- oxidation_states.possible_species
- oxidation_states.possible_valences
- structure.@class
- structure.@module
- structure.charge
- structure.lattice.a
- structure.lattice.alpha
- structure.lattice.b
- structure.lattice.beta
- structure.lattice.c
- structure.lattice.gamma
- structure.lattice.matrix
- structure.lattice.pbc
- structure.lattice.volume
- structure.sites
- surface_properties.shape_factor
- surface_properties.surface_anisotropy
- surface_properties.weighted_surface_energy
- surface_properties.weighted_surface_energy_EV_PER_ANG2
- surface_properties.weighted_work_function
- symmetry.crystal_system
- symmetry.number
- symmetry.point_group
- symmetry.symbol
- symmetry.symprec
- symmetry.version
- symmetry.wyckoffs
- thermo.decomposes_to
- thermo.energy_above_hull
- thermo.equilibrium_reaction_energy_per_atom
- thermo.formation_energy_per_atom
- thermo.uncorrected_energy_per_atom
- db_path: data\GraphDB\nodes\material
------------------------------------------------------------
############################################################
EDGE DETAILS
############################################################
Total edge types: 0
------------------------------------------------------------
############################################################
NODE GENERATOR DETAILS
############################################################
Total node generators: 0
------------------------------------------------------------
############################################################
EDGE GENERATOR DETAILS
############################################################
Total edge generators: 0
------------------------------------------------------------
Generators¶
A Generator is a callable (function) that returns a PyArrow Table of either nodes or edges. By adding a generator to ParquetGraphDB, you can:
Register the generator, so it can be re-run on demand.
Optionally specify arguments/kwargs to pass into the generator.
Automatically store the output in a NodeStore or EdgeStore with the same name as the generator function (or a custom name, if you prefer).
This is especially handy for generating nodes from external data sources or from computational routines.
In the following sections we will create custom node and edge generators. These can be create by wrapping existing functions with the node_generator or edge_generator decorators.
These can be imported like:
from parquetdb import node_generator, edge_generator
Element Node Generator¶
1. Define the Generator¶
In our first example, we will create a node generator that creates element nodes.
As mentioned above to create a node generator, we will wrap an existing function with the node_generator decorator. The function name will be the name of the node type.
@node_generator
def element():
...
For this example, we will import an periodic table data. This is a dataframe with 118 rows representing 118 elements of the periodic table. We have also added some transformations to the data to make it more useful for our purposes.
[5]:
### Element Node Generator
from parquetdb import node_generator
# Define the generator with the @node_generator decorator
@node_generator
def element(base_file=elements_file):
"""
Creates Element nodes from a local file (CSV or Parquet).
Returns a Pandas DataFrame (or PyArrow Table) with one row per element.
"""
try:
# Read the file
file_ext = os.path.splitext(base_file)[-1][
1:
].lower() # e.g. "parquet" or "csv"
if file_ext == "parquet":
df = pd.read_parquet(base_file)
elif file_ext == "csv":
df = pd.read_csv(base_file)
else:
raise ValueError("base_file must be a parquet or csv file")
# Apply some transformations
# Example transformations
df["oxidation_states"] = df["oxidation_states"].apply(
lambda x: x.replace("]", "").replace("[", "")
)
df["oxidation_states"] = df["oxidation_states"].apply(
lambda x: ",".join(x.split())
)
df["oxidation_states"] = df["oxidation_states"].apply(
lambda x: eval("[" + x + "]")
)
df["experimental_oxidation_states"] = df["experimental_oxidation_states"].apply(
lambda x: eval(x)
)
df["ionization_energies"] = df["ionization_energies"].apply(lambda x: eval(x))
except Exception as e:
print(f"Error reading element file: {e}")
return None
return df # Return the transformed dataframe
df = element()
print(df)
long_name symbol abundance_universe abundance_solar \
0 Hydrogen H 7.500000e+01 7.500000e+01
1 Helium He 2.300000e+01 2.300000e+01
2 Lithium Li 6.000000e-07 6.000000e-09
3 Beryllium Be 1.000000e-07 1.000000e-08
4 Boron B 1.000000e-07 2.000000e-07
.. ... ... ... ...
113 Flerovium Fl 0.000000e+00 0.000000e+00
114 Moscovium Mc 0.000000e+00 0.000000e+00
115 Livermorium Lv 0.000000e+00 0.000000e+00
116 Tennessine Ts 0.000000e+00 0.000000e+00
117 Oganesson Og 0.000000e+00 0.000000e+00
abundance_meteor abundance_crust abundance_ocean abundance_human \
0 2.400000 1.500000e-01 1.100000e+01 1.000000e+01
1 NaN 5.500000e-07 7.200000e-10 NaN
2 0.000170 1.700000e-03 1.800000e-05 3.000000e-06
3 0.000003 1.900000e-04 6.000000e-11 4.000000e-08
4 0.000160 8.600000e-04 4.400000e-04 7.000000e-05
.. ... ... ... ...
113 0.000000 0.000000e+00 0.000000e+00 0.000000e+00
114 0.000000 0.000000e+00 0.000000e+00 0.000000e+00
115 0.000000 0.000000e+00 0.000000e+00 0.000000e+00
116 0.000000 0.000000e+00 0.000000e+00 0.000000e+00
117 0.000000 0.000000e+00 0.000000e+00 0.000000e+00
adiabatic_index allotropes ... \
0 5-Jul Dihydrogen ...
1 3-May None ...
2 None None ...
3 None None ...
4 None Alpha Rhombohedral Boron, Beta Rhombohedral Bo... ...
.. ... ... ...
113 None None ...
114 None None ...
115 None None ...
116 None None ...
117 None None ...
is_halogen is_lanthanoid is_metal is_metalloid is_noble_gas \
0 False False False False False
1 False False False False True
2 False False True False False
3 False False True False False
4 False False False True False
.. ... ... ... ... ...
113 False False False False False
114 False False False False False
115 False False False False False
116 False False False False False
117 False False False False True
is_post_transition_metal is_quadrupolar is_rare_earth_metal \
0 False True False
1 False False False
2 False True False
3 False True False
4 False True False
.. ... ... ...
113 False False False
114 False False False
115 False False False
116 False False False
117 False False False
experimental_oxidation_states ionization_energies
0 [] [1312.0]
1 [] [2372.3, 5250.5]
2 [1] [520.2, 7298.1, 11815.0]
3 [2] [899.5, 1757.1, 14848.7, 21006.6]
4 [3] [800.6, 2427.1, 3659.7, 25025.8, 32826.7]
.. ... ...
113 [2] [832.2, 1600.0, 3370.0, 4400.0, 5850.0]
114 [3] [538.3, 1760.0, 2650.0, 4680.0, 5720.0]
115 [-2] [663.9, 1330.0, 2850.0, 3810.0, 6080.0]
116 [-1] [736.9, 1435.4, 2161.9, 4012.9, 5076.4]
117 [] [860.1, 1560.0]
[118 rows x 98 columns]
2. Add the Generator to the ParquetGraphDB¶
Now that we have defined the generator, we can add it to the ParquetGraphDB instance. We do this by calling the add_node_generator method. Here we give the function, the arguments, and the kwargs. We also have the option to run the generator immediately or later. Default is True.
The node generator will be stored in the node_generator_store of the ParquetGraphDB instance.
[6]:
db.add_node_generator(
generator_func=element,
generator_args={},
generator_kwargs={"base_file": elements_file},
run_immediately=False, # We have the option to run the generator immediately or later. Default is True.
)
# Check the node generators in the MatGraphDB
print(db.node_generator_store)
============================================================
GENERATOR STORE SUMMARY
============================================================
• Number of generators: 1
Storage path: data\GraphDB\node_generators
############################################################
METADATA
############################################################
• class: GeneratorStore
• class_module: parquetdb.graph.generator_store
############################################################
GENERATOR DETAILS
############################################################
• Columns:
- generator_func
- generator_kwargs.base_file
- generator_name
- id
• Generator names:
- element
Running a Node Generator Later¶
Now we can run the node generator with db.run_node_generator(generator_name).
Note:Here we run the node generator. Notice how we do not need pass the arguments or kwargs, this information is stored in the node generator store. However, we can override the arguments or kwargs if we want to.
[7]:
table = db.run_node_generator("element")
Lets check the node store for the elements.
[8]:
element_node_store = db.get_node_store("element")
print(element_node_store)
============================================================
NODE STORE SUMMARY
============================================================
Node type: element
• Number of nodes: 118
• Number of features: 99
Storage path: data\GraphDB\nodes\element
############################################################
METADATA
############################################################
• class: NodeStore
• class_module: parquetdb.graph.nodes
• node_type: element
• name_column: id
############################################################
NODE DETAILS
############################################################
Material-Element Edge Generator¶
1. Define the Generator¶
An edge generator is similar to a node generator but returns a PyArrow Table describing edges. Each generated edge must have at least these fields:
source_id(int)source_type(string)target_id(int)target_type(string)
Additionally, edge_generators must have the corresponding node_stores in the ParquetGraphDB instance as an argument. This is to ensure that the ids of the nodes are valid and in the correct node store.
For edges we use the edge_generator decorator.
[9]:
from parquetdb import edge_generator
import pyarrow as pa
@edge_generator
def material_element_has(
material_store, element_store
): # We have the material_store and element_store as an argument
try:
connection_name = "has"
# We select only the necessary columns from the node stores
material_table = material_store.read_nodes(
columns=["id", "core.material_id", "core.elements"]
)
element_table = element_store.read_nodes(columns=["id", "symbol"])
# We rename for utility purposes
material_table = material_table.rename_columns(
{"id": "source_id", "core.material_id": "material_name"}
)
material_table = material_table.append_column(
"source_type", pa.array(["material"] * material_table.num_rows)
)
element_table = element_table.rename_columns({"id": "target_id"})
element_table = element_table.append_column(
"target_type", pa.array(["elements"] * element_table.num_rows)
)
# We convert the tables to pandas for easier manipulation
material_df = material_table.to_pandas()
element_df = element_table.to_pandas()
# We create a map of the element symbols to the target_id for quick lookup
element_target_id_map = {
row["symbol"]: row["target_id"] for _, row in element_df.iterrows()
}
# We create a dictionary to store the edge data
table_dict = {
"source_id": [],
"source_type": [],
"target_id": [],
"target_type": [],
"edge_type": [],
"name": [],
"weight": [],
}
# We iterate over the material nodes
for _, row in material_df.iterrows():
# We get the elements composing the material
elements = row["core.elements"]
source_id = row["source_id"]
material_name = row["material_name"]
if elements is None:
continue
# We iterate over the elements
for element in elements:
# We get the target_id for the element
target_id = element_target_id_map[element]
# We append the edge data to the dictionary. Here we could also define the reverse edge as well.
table_dict["source_id"].append(source_id)
table_dict["source_type"].append(material_store.node_type)
table_dict["target_id"].append(target_id)
table_dict["target_type"].append(element_store.node_type)
table_dict["edge_type"].append(connection_name)
name = f"{material_name}_{connection_name}_{element}"
table_dict["name"].append(name)
table_dict["weight"].append(1.0)
df = pd.DataFrame(table_dict)
except Exception as e:
print(f"Error creating material-element-has relationships: {e}")
return df
2. Add the Generator to the ParquetGraphDB¶
Now that we have defined the generator, we can add it to the ParquetGraphDB instance. We do this by calling the add_edge_generator method.
The edge generator will be stored in the edge_generator_store of the ParquetGraphDB instance.
[10]:
element_store = db.get_node_store("element")
material_store = db.get_node_store("material")
db.add_edge_generator(
generator_func=material_element_has,
generator_args={
"material_store": material_store,
"element_store": element_store,
},
generator_kwargs={},
run_immediately=True,
)
Lets check the edge generator store.
[11]:
print(db.edge_generator_store)
============================================================
GENERATOR STORE SUMMARY
============================================================
• Number of generators: 1
Storage path: data\GraphDB\edge_generators
############################################################
METADATA
############################################################
• class: GeneratorStore
• class_module: parquetdb.graph.generator_store
############################################################
GENERATOR DETAILS
############################################################
• Columns:
- generator_args.element_store
- generator_args.material_store
- generator_func
- generator_name
- id
• Generator names:
- material_element_has
Let’s check to see if the edge created the edges in the edge store.
[12]:
edge_store = db.get_edge_store("material_element_has")
print(edge_store)
============================================================
EDGE STORE SUMMARY
============================================================
Edge type: material_element_has
• Number of edges: 3348
• Number of features: 8
Storage path: data\GraphDB\edges\material_element_has
############################################################
METADATA
############################################################
• class: EdgeStore
• class_module: parquetdb.graph.edges
############################################################
EDGE DETAILS
############################################################
Updates to node stores.¶
By default, when node and edge generators are added their argument store dependencies are added to the ParquetGraphDB instance. This means that when parent stores are updated, the geneator will run and update their corresponding stores.
These stores are stored in the ParquetGraphDB/generator_dependency.json file.
[14]:
materials_df = db.read_nodes(node_type="material", columns=["id"], ids=[0]).to_pandas()
print(materials_df)
db.delete_nodes(node_type="material",ids=[0])
materials_df = db.read_nodes(node_type="material", columns=["id"], ids=[0]).to_pandas()
print(materials_df)
id
0 0
Empty DataFrame
Columns: [id]
Index: []
As you can see the material node with id=0 is now gone.
Let’s check the material_element_has edges to see if the update has been propagated
[15]:
edge_store = db.get_edge_store("material_element_has")
print(edge_store)
============================================================
EDGE STORE SUMMARY
============================================================
Edge type: material_element_has
• Number of edges: 3345
• Number of features: 8
Storage path: data\GraphDB\edges\material_element_has
############################################################
METADATA
############################################################
• class: EdgeStore
• class_module: parquetdb.graph.edges
############################################################
EDGE DETAILS
############################################################
Now there are 3345 material_element_has edges which has reduced from 3348 from before the deletion
Let’s check the material_element_has dataframe.
[16]:
df = edge_store.read_edges().to_pandas()
print(df)
edge_type id name source_id source_type target_id \
0 has 0 mp-1222351_has_F 1 material 8
1 has 1 mp-1222351_has_Fe 1 material 25
2 has 2 mp-1222351_has_Li 1 material 2
3 has 3 mp-651087_has_F 2 material 8
4 has 4 mp-651087_has_Gd 2 material 63
... ... ... ... ... ... ...
3340 has 3340 mp-2714707_has_Al 999 material 12
3341 has 3341 mp-2714707_has_Na 999 material 10
3342 has 3342 mp-2714707_has_O 999 material 7
3343 has 3343 mp-2714707_has_S 999 material 15
3344 has 3344 mp-2714707_has_Zn 999 material 29
target_type weight
0 element 1.0
1 element 1.0
2 element 1.0
3 element 1.0
4 element 1.0
... ... ...
3340 element 1.0
3341 element 1.0
3342 element 1.0
3343 element 1.0
3344 element 1.0
[3345 rows x 8 columns]
As you can see, the source_id does not have and id=0.
We can double check this with the following:
[20]:
df[df["source_type"] == 0]
[20]:
| edge_type | id | name | source_id | source_type | target_id | target_type | weight |
|---|
This is empty as we should expect.
[21]:
print(db)
============================================================
GRAPH DATABASE SUMMARY
============================================================
Name: GraphDB
Storage path: data\GraphDB
└── Repository structure:
├── nodes/ (data\GraphDB\nodes)
├── edges/ (data\GraphDB\edges)
├── edge_generators/ (data\GraphDB\edge_generators)
├── node_generators/ (data\GraphDB\node_generators)
└── graph/ (data\GraphDB\graph)
############################################################
NODE DETAILS
############################################################
Total node types: 2
------------------------------------------------------------
• Node type: material
- Number of nodes: 999
- Number of features: 136
- db_path: data\GraphDB\nodes\material
------------------------------------------------------------
• Node type: element
- Number of nodes: 118
- Number of features: 99
- db_path: data\GraphDB\nodes\element
------------------------------------------------------------
############################################################
EDGE DETAILS
############################################################
Total edge types: 1
------------------------------------------------------------
• Edge type: material_element_has
- Number of edges: 3345
- Number of features: 8
- db_path: data\GraphDB\edges\material_element_has
------------------------------------------------------------
############################################################
NODE GENERATOR DETAILS
############################################################
Total node generators: 1
------------------------------------------------------------
• Generator: element
Generator Args:
- generator_func: [<function wrapper at 0x000002CDBFC9D1B0>]
- generator_kwargs.base_file: [WindowsPath('data/elements.parquet')]
- generator_name: ['element']
- id: [0]
Generator Kwargs:
- base_file: [WindowsPath('data/elements.parquet')]
------------------------------------------------------------
############################################################
EDGE GENERATOR DETAILS
############################################################
Total edge generators: 1
------------------------------------------------------------
• Generator: material_element_has
Generator Args:
- element_store: data\GraphDB\nodes\element
- material_store: data\GraphDB\nodes\material
Generator Kwargs:
------------------------------------------------------------
6. Summary¶
In this notebook, we showed how to define custom node and edge generators and showed how to run them.