Skip to content

Data Management

Manifest Platform provides a unified data layer for managing datasets, querying data from connector-backed sources, and performing vector search for RAG (Retrieval-Augmented Generation) and semantic similarity use cases.


Data Capabilities

graph TD
    DM["Data Management"] --> DS["Datasets<br/>Structured data catalog"]
    DM --> VS["Vector Search<br/>Embeddings & semantic queries"]
    DM --> LN["Lineage<br/>Data provenance tracking"]

    DS --> Sources["Connector Sources<br/>Live data from external systems"]
    DS --> Schema["Schema Management<br/>Versioned field definitions"]
    DS --> Query["Query API<br/>Filter, sort, paginate"]

    VS --> Embed["Embedding Generation"]
    VS --> Index["Vector Indexes"]
    VS --> Similarity["Similarity Search"]

Key Concepts

Datasets

A dataset is a managed collection of structured data within the platform. Datasets serve multiple purposes:

  • Agent evaluation -- Test datasets with input/expected_output pairs for scoring agents in the Playground
  • Live data sources -- Backed by connector instances that query external systems on demand
  • Reference data -- Static datasets for lookups, enrichment, and agent context
  • Pipeline output -- Results from data transformation workflows

Datasets live within a workspace and are governed by the organization's access policies.

Vector search enables semantic queries over embeddings stored alongside dataset records. Instead of matching exact keywords, vector search finds records that are semantically similar to a query -- powering RAG pipelines, document retrieval, and recommendation systems.

Data Lineage

The platform tracks lineage relationships between datasets. When one dataset is derived from another (through transformation, filtering, or aggregation), the lineage graph shows the full provenance chain from source to derived dataset.


Datasets and Connectors

Datasets can be backed by one or more connector sources. Each source links a connector instance and operation to the dataset, with optional field mappings and filters.

graph LR
    Dataset["Dataset<br/>customer-tickets"] --> S1["Source 1<br/>Jira (open tickets)"]
    Dataset --> S2["Source 2<br/>Jira (closed tickets)"]
    Dataset --> S3["Source 3<br/>Salesforce (contacts)"]

    S1 --> CI1["Jira Instance"]
    S2 --> CI1
    S3 --> CI2["Salesforce Instance"]

    CI1 --> Jira["Jira Cloud API"]
    CI2 --> SF["Salesforce API"]

When you query a dataset, the platform routes the query to the appropriate connector source, applies field mappings, and returns unified results.


Data Governance

Certification Levels

Datasets can be certified to indicate data quality and trustworthiness:

Level Meaning
none Uncertified -- no quality guarantees
bronze Basic validation passed
silver Quality checks and schema validation passed
gold Fully certified with lineage, quality scores, and review

Policy Tags

Apply policy tags to datasets for compliance and access control:

  • pii -- Contains personally identifiable information
  • hipaa -- Subject to HIPAA regulations
  • financial -- Contains financial data
  • internal-only -- Not for external sharing

Data Residency

Specify where data can be stored and processed:

Coming Soon

The Python SDK for local development is not yet publicly available.

from flow_sdk.cli_client import CLIClient

client = CLIClient(config)

dataset = client.datasets.create({
    "name": "EU Customer Data",
    "slug": "eu-customer-data",
    "data_residency": "eu-west-1",
    "policy_tags": ["pii", "gdpr"],
})

Next Steps