parquetdb.core.parquetdb.ParquetDB¶
- class ParquetDB(db_path: str | Path, schema: Schema | None = None, initial_fields: List[Field] | None = None, metadata: Dict[str, str] | None = None, config: ParquetDBConfig = ParquetDBConfig(serialize_python_objects=False, convert_to_fixed_shape=None, normalize_config=NormalizeConfig(load_format=table, batch_size=131072, batch_readahead=16, fragment_readahead=4, fileformat=parquet, fragment_scan_options=None, memory_pool=None, filesystem=None, file_options=None, use_threads=True, max_partitions=1024, max_open_files=1024, max_rows_per_file=10000000, min_rows_per_group=50000, max_rows_per_group=100000, file_visitor=None, existing_data_behavior=overwrite_or_ignore, create_dir=True), load_config=LoadConfig(batch_size=131072, batch_readahead=16, fragment_readahead=4, fragment_scan_options=None, use_threads=True, memory_pool=None)), verbose: int = 0)¶
- __init__(db_path: str | Path, schema: Schema | None = None, initial_fields: List[Field] | None = None, metadata: Dict[str, str] | None = None, config: ParquetDBConfig = ParquetDBConfig(serialize_python_objects=False, convert_to_fixed_shape=None, normalize_config=NormalizeConfig(load_format=table, batch_size=131072, batch_readahead=16, fragment_readahead=4, fileformat=parquet, fragment_scan_options=None, memory_pool=None, filesystem=None, file_options=None, use_threads=True, max_partitions=1024, max_open_files=1024, max_rows_per_file=10000000, min_rows_per_group=50000, max_rows_per_group=100000, file_visitor=None, existing_data_behavior=overwrite_or_ignore, create_dir=True), load_config=LoadConfig(batch_size=131072, batch_readahead=16, fragment_readahead=4, fragment_scan_options=None, use_threads=True, memory_pool=None)), verbose: int = 0)¶
Initializes the ParquetDB object.
- Parameters:
db_path (str) – The path of the database.
schema (pa.Schema, optional) – PyArrow schema defining the structure and types of the data. If not provided, schema will be inferred from the data.
initial_fields (List[pa.Field], optional) – List of PyArrow fields to initialize the database schema with. An ‘id’ field of type int64 will automatically be added. Default is None (empty list).
metadata (Dict[str, str], optional) – Dictionary of key-value pairs to attach as metadata to the table. This metadata applies to the entire table.
config (ParquetDBConfig, optional) – Configuration for the ParquetDB object. Default is ParquetDBConfig().
verbose (int, optional) – Verbosity level for logging. Default is 2.
Examples
>>> from parquetdb import ParquetDB >>> import pyarrow as pa >>> fields = [pa.field('name', pa.string()), pa.field('age', pa.int32())] >>> db = ParquetDB(db_path='/path/to/db', initial_fields=fields)
Methods
__init__(db_path[, schema, initial_fields, ...])Initializes the ParquetDB object.
backup_database(backup_path)Creates a complete backup of the current dataset.
construct_table(data[, schema, metadata, ...])Constructs a PyArrow Table from various input data formats.
copy_dataset(dest_name[, overwrite])Creates a complete copy of the current dataset under a new name.
create(data[, schema, metadata, ...])Adds new data to the database.
dataset_exists([dataset_name])Check if a dataset exists and contains data.
delete([ids, filters, columns, normalize_config])Deletes records or columns from the database.
drop_dataset()Removes the current dataset directory and reinitializes it with an empty table.
export_dataset(file_path[, format])Exports the entire dataset to a single file in the specified format.
export_partitioned_dataset(export_dir, ...)Exports the dataset to a partitioned format in the specified directory.
get_current_files()Get a list of all Parquet files in the current dataset.
get_field_metadata([field_names, return_bytes])Retrieves metadata for specified fields/columns in the dataset.
get_field_names([columns, include_cols])Get the names of fields/columns in the dataset schema.
get_file_sizes([verbose])Get the size of each file in the dataset in MB.
get_metadata([return_bytes])Retrieves the metadata of the dataset table.
get_n_rows_per_row_group_per_file([as_dict])Get the number of rows in each row group for each file.
get_number_of_row_groups_per_file()Get the number of row groups in each Parquet file in the dataset.
get_number_of_rows_per_file()Get the number of rows in each Parquet file in the dataset.
get_parquet_column_metadata_per_file([as_dict])Get detailed metadata for each column in each row group in each file.
get_parquet_file_metadata_per_file([as_dict])Get the metadata for each Parquet file in the dataset.
get_parquet_file_row_group_metadata_per_file([...])Get detailed metadata for each row group in each Parquet file.
get_row_group_sizes_per_file([verbose])Get the size of each row group for each file.
get_schema()Get the PyArrow schema of the dataset.
get_serialized_metadata_size_per_file()Get the serialized metadata size for each Parquet file in the dataset.
import_dataset(file_path[, format])Imports data from a file into the dataset, supporting multiple file formats.
initialize_empty_db([schema, ...])is_empty()Check if the dataset is empty.
merge_datasets(source_tables, dest_table)normalize([normalize_config])Normalize the dataset by restructuring files for optimal performance.
preprocess_data_without_python_objects(data)Preprocesses data without python objects.
preprocess_table(table[, ...])Preprocesses a PyArrow table by flattening nested structures and handling special field types.
process_data_with_python_objects(data[, ...])Processes input data and handles Python object serialization.
read([ids, columns, filters, load_format, ...])Reads data from the database with flexible filtering and formatting options.
rename_dataset(new_name[, remove_dest])Renames the current dataset directory and all contained files.
rename_fields(name_map[, normalize_config])Rename fields/columns in the dataset using a mapping dictionary.
restore_database(backup_path)Restores the dataset from a previous backup.
set_field_metadata(fields_metadata[, update])Sets or updates metadata for specific fields/columns in the dataset.
set_metadata(metadata[, update])Sets or updates the metadata of the dataset table.
sort_fields([normalize_config])Sort the fields/columns of the dataset alphabetically by name.
summary([show_column_names, ...])Generate a formatted summary string containing database information and metadata.
to_nested([nested_dataset_dir, ...])Converts the current dataset to a nested structure optimized for querying nested data.
transform(transform_callable[, new_db_path, ...])Transform the entire dataset using a user-provided callable.
update(data[, schema, metadata, ...])Updates existing records in the database by matching on specified key fields.
update_schema([field_dict, schema, ...])Updates the schema of the table in the dataset.
Attributes
basename_templateGet the template for parquet file basenames.
columnsGet the column names in the database.
dataset_nameGet the dataset name.
db_pathGet the database path.
n_columnsGet the number of columns in the database.
n_filesGet the number of parquet files in the database.
n_row_groups_per_fileGet the number of row groups in each parquet file.
n_rowsGet the total number of rows in the database.
n_rows_per_fileGet the number of rows in each parquet file.
n_rows_per_row_group_per_fileGet the number of rows in each row group for each file.
schemaGet the schema of the database.
serialized_metadata_size_per_fileGet the size of serialized metadata for each file.