Data Management¶
Manifest Platform provides a unified data layer for managing datasets, querying data from connector-backed sources, and performing vector search for RAG (Retrieval-Augmented Generation) and semantic similarity use cases.
Data Capabilities¶
graph TD
DM["Data Management"] --> DS["Datasets<br/>Structured data catalog"]
DM --> VS["Vector Search<br/>Embeddings & semantic queries"]
DM --> LN["Lineage<br/>Data provenance tracking"]
DS --> Sources["Connector Sources<br/>Live data from external systems"]
DS --> Schema["Schema Management<br/>Versioned field definitions"]
DS --> Query["Query API<br/>Filter, sort, paginate"]
VS --> Embed["Embedding Generation"]
VS --> Index["Vector Indexes"]
VS --> Similarity["Similarity Search"]
Key Concepts¶
Datasets¶
A dataset is a managed collection of structured data within the platform. Datasets serve multiple purposes:
- Agent evaluation -- Test datasets with input/expected_output pairs for scoring agents in the Playground
- Live data sources -- Backed by connector instances that query external systems on demand
- Reference data -- Static datasets for lookups, enrichment, and agent context
- Pipeline output -- Results from data transformation workflows
Datasets live within a workspace and are governed by the organization's access policies.
Vector Search¶
Vector search enables semantic queries over embeddings stored alongside dataset records. Instead of matching exact keywords, vector search finds records that are semantically similar to a query -- powering RAG pipelines, document retrieval, and recommendation systems.
Data Lineage¶
The platform tracks lineage relationships between datasets. When one dataset is derived from another (through transformation, filtering, or aggregation), the lineage graph shows the full provenance chain from source to derived dataset.
Datasets and Connectors¶
Datasets can be backed by one or more connector sources. Each source links a connector instance and operation to the dataset, with optional field mappings and filters.
graph LR
Dataset["Dataset<br/>customer-tickets"] --> S1["Source 1<br/>Jira (open tickets)"]
Dataset --> S2["Source 2<br/>Jira (closed tickets)"]
Dataset --> S3["Source 3<br/>Salesforce (contacts)"]
S1 --> CI1["Jira Instance"]
S2 --> CI1
S3 --> CI2["Salesforce Instance"]
CI1 --> Jira["Jira Cloud API"]
CI2 --> SF["Salesforce API"]
When you query a dataset, the platform routes the query to the appropriate connector source, applies field mappings, and returns unified results.
Data Governance¶
Certification Levels¶
Datasets can be certified to indicate data quality and trustworthiness:
| Level | Meaning |
|---|---|
none |
Uncertified -- no quality guarantees |
bronze |
Basic validation passed |
silver |
Quality checks and schema validation passed |
gold |
Fully certified with lineage, quality scores, and review |
Policy Tags¶
Apply policy tags to datasets for compliance and access control:
pii-- Contains personally identifiable informationhipaa-- Subject to HIPAA regulationsfinancial-- Contains financial datainternal-only-- Not for external sharing
Data Residency¶
Specify where data can be stored and processed:
Coming Soon
The Python SDK for local development is not yet publicly available.
from flow_sdk.cli_client import CLIClient
client = CLIClient(config)
dataset = client.datasets.create({
"name": "EU Customer Data",
"slug": "eu-customer-data",
"data_residency": "eu-west-1",
"policy_tags": ["pii", "gdpr"],
})
Next Steps¶
-
Datasets
Create, manage, and query structured datasets.
-
Vector Search
Build semantic search with embeddings and vector indexes.