Data Management¶

Manifest Platform provides a unified data layer for managing datasets, querying data from connector-backed sources, and performing vector search for RAG (Retrieval-Augmented Generation) and semantic similarity use cases.

Data Capabilities¶

graph TD
    DM["Data Management"] --> DS["Datasets<br/>Structured data catalog"]
    DM --> VS["Vector Search<br/>Embeddings & semantic queries"]
    DM --> LN["Lineage<br/>Data provenance tracking"]

    DS --> Sources["Connector Sources<br/>Live data from external systems"]
    DS --> Schema["Schema Management<br/>Versioned field definitions"]
    DS --> Query["Query API<br/>Filter, sort, paginate"]

    VS --> Embed["Embedding Generation"]
    VS --> Index["Vector Indexes"]
    VS --> Similarity["Similarity Search"]

Key Concepts¶

Datasets¶

A dataset is a managed collection of structured data within the platform. Datasets serve multiple purposes:

Agent evaluation -- Test datasets with input/expected_output pairs for scoring agents in the Playground
Live data sources -- Backed by connector instances that query external systems on demand
Reference data -- Static datasets for lookups, enrichment, and agent context
Pipeline output -- Results from data transformation workflows

Datasets live within a workspace and are governed by the organization's access policies.

Vector Search¶

Vector search enables semantic queries over embeddings stored alongside dataset records. Instead of matching exact keywords, vector search finds records that are semantically similar to a query -- powering RAG pipelines, document retrieval, and recommendation systems.

Data Lineage¶

The platform tracks lineage relationships between datasets. When one dataset is derived from another (through transformation, filtering, or aggregation), the lineage graph shows the full provenance chain from source to derived dataset.

Datasets and Connectors¶

Datasets can be backed by one or more connector sources. Each source links a connector instance and operation to the dataset, with optional field mappings and filters.

graph LR
    Dataset["Dataset<br/>customer-tickets"] --> S1["Source 1<br/>Jira (open tickets)"]
    Dataset --> S2["Source 2<br/>Jira (closed tickets)"]
    Dataset --> S3["Source 3<br/>Salesforce (contacts)"]

    S1 --> CI1["Jira Instance"]
    S2 --> CI1
    S3 --> CI2["Salesforce Instance"]

    CI1 --> Jira["Jira Cloud API"]
    CI2 --> SF["Salesforce API"]

When you query a dataset, the platform routes the query to the appropriate connector source, applies field mappings, and returns unified results.

Data Governance¶

Certification Levels¶

Datasets can be certified to indicate data quality and trustworthiness:

Level	Meaning
`none`	Uncertified -- no quality guarantees
`bronze`	Basic validation passed
`silver`	Quality checks and schema validation passed
`gold`	Fully certified with lineage, quality scores, and review

Policy Tags¶

Apply policy tags to datasets for compliance and access control:

pii -- Contains personally identifiable information
hipaa -- Subject to HIPAA regulations
financial -- Contains financial data
internal-only -- Not for external sharing

Data Residency¶

Specify where data can be stored and processed:

Coming Soon

The Python SDK for local development is not yet publicly available.

from flow_sdk.cli_client import CLIClient

client = CLIClient(config)

dataset = client.datasets.create({
    "name": "EU Customer Data",
    "slug": "eu-customer-data",
    "data_residency": "eu-west-1",
    "policy_tags": ["pii", "gdpr"],
})

Next Steps¶

Datasets

Create, manage, and query structured datasets.

Work with datasets
Vector Search

Build semantic search with embeddings and vector indexes.

Vector search