parquetdb.graph.generator_store.GeneratorStore¶
- class GeneratorStore(storage_path: str, initial_fields: List[Field] | None = None, verbose: int = 1)¶
A store for managing generator functions in a graph database. This class handles serialization, storage, and loading of functions that generate edges between nodes.
- __init__(storage_path: str, initial_fields: List[Field] | None = None, verbose: int = 1)¶
Initialize the EdgeGeneratorStore.
- Parameters:
storage_path (str) – Path where the generator functions will be stored
Methods
__init__(storage_path[, initial_fields, verbose])Initialize the EdgeGeneratorStore.
backup_database(backup_path)Creates a complete backup of the current dataset.
construct_table(data[, schema, metadata, ...])Constructs a PyArrow Table from various input data formats.
copy_dataset(dest_name[, overwrite])Creates a complete copy of the current dataset under a new name.
create(data[, schema, metadata, ...])Adds new data to the database.
dataset_exists([dataset_name])Check if a dataset exists and contains data.
delete([ids, filters, columns, normalize_config])Deletes records or columns from the database.
delete_generator(generator_name)Delete a generator by name.
drop_dataset()Removes the current dataset directory and reinitializes it with an empty table.
export_dataset(file_path[, format])Exports the entire dataset to a single file in the specified format.
export_partitioned_dataset(export_dir, ...)Exports the dataset to a partitioned format in the specified directory.
get_current_files()Get a list of all Parquet files in the current dataset.
get_field_metadata([field_names, return_bytes])Retrieves metadata for specified fields/columns in the dataset.
get_field_names([columns, include_cols])Get the names of fields/columns in the dataset schema.
get_file_sizes([verbose])Get the size of each file in the dataset in MB.
get_metadata([return_bytes])Retrieves the metadata of the dataset table.
get_n_rows_per_row_group_per_file([as_dict])Get the number of rows in each row group for each file.
get_number_of_row_groups_per_file()Get the number of row groups in each Parquet file in the dataset.
get_number_of_rows_per_file()Get the number of rows in each Parquet file in the dataset.
get_parquet_column_metadata_per_file([as_dict])Get detailed metadata for each column in each row group in each file.
get_parquet_file_metadata_per_file([as_dict])Get the metadata for each Parquet file in the dataset.
get_parquet_file_row_group_metadata_per_file([...])Get detailed metadata for each row group in each Parquet file.
get_row_group_sizes_per_file([verbose])Get the size of each row group for each file.
get_schema()Get the PyArrow schema of the dataset.
get_serialized_metadata_size_per_file()Get the serialized metadata size for each Parquet file in the dataset.
import_dataset(file_path[, format])Imports data from a file into the dataset, supporting multiple file formats.
is_empty()Check if the dataset is empty.
is_in(generator_name)list_generators()List all stored edge generators.
load_generator(generator_name)Load an edge generator function by name.
load_generator_data(generator_name)merge_datasets(source_tables, dest_table)normalize([normalize_config])Normalize the dataset by restructuring files for optimal performance.
preprocess_data_without_python_objects(data)Preprocesses data without python objects.
preprocess_table(table[, ...])Preprocesses a PyArrow table by flattening nested structures and handling special field types.
process_data_with_python_objects(data[, ...])Processes input data and handles Python object serialization.
read([ids, columns, filters, load_format, ...])Reads data from the database with flexible filtering and formatting options.
rename_dataset(new_name[, remove_dest])Renames the current dataset directory and all contained files.
rename_fields(name_map[, normalize_config])Rename fields/columns in the dataset using a mapping dictionary.
restore_database(backup_path)Restores the dataset from a previous backup.
run_generator(generator_name[, ...])Run a generator function by name.
set_field_metadata(fields_metadata[, update])Sets or updates metadata for specific fields/columns in the dataset.
set_metadata(metadata[, update])Sets or updates the metadata of the dataset table.
sort_fields([normalize_config])Sort the fields/columns of the dataset alphabetically by name.
store_generator(generator_func, generator_name)Store an edge generator function.
summary([show_column_names])Generate a formatted summary string containing database information and metadata.
to_nested([nested_dataset_dir, ...])Converts the current dataset to a nested structure optimized for querying nested data.
transform(transform_callable[, new_db_path, ...])Transform the entire dataset using a user-provided callable.
update(data[, schema, metadata, ...])Updates existing records in the database by matching on specified key fields.
update_schema([field_dict, schema, ...])Updates the schema of the table in the dataset.
Attributes
basename_templateGet the template for parquet file basenames.
columnsGet the column names in the database.
dataset_nameGet the dataset name.
db_pathGet the database path.
generator_namesn_columnsGet the number of columns in the database.
n_filesGet the number of parquet files in the database.
n_generatorsn_row_groups_per_fileGet the number of row groups in each parquet file.
n_rowsGet the total number of rows in the database.
n_rows_per_fileGet the number of rows in each parquet file.
n_rows_per_row_group_per_fileGet the number of rows in each row group for each file.
required_fieldsserialized_metadata_size_per_fileGet the size of serialized metadata for each file.
storage_path