*Prototype — automation logic validated, refinement in progress
Build an automated pipeline that ingests website content, extracts relevant text, generates semantic embeddings, and stores structured knowledge for use in AI search and FAQ systems.
Companies publish large volumes of product documentation and support content. Transforming this text into an AI-ready knowledge base (for RAG, semantic search, and chat assistants) is often done manually, which is slow and inefficient.
The goal was to automate this process end-to-end.
I designed an n8n workflow that:
- Fetches new content from a chosen source (Stripe documentation example)
- Extracts readable text from HTML
- Cleans and splits content into semantic chunks
- Generates embeddings using a language model
- Stores text and vector data for later retrieval
This creates a scalable internal knowledge pipeline suitable for AI-powered search and automated FAQ responses.
1. Content Retrieval
Nodes: Scheduler / Manual Trigger + HTTP Request
The workflow calls a public documentation URL to retrieve the raw HTML of a page.
2. HTML Parsing
Node: HTML Extract
CSS selectors extract the relevant parts of the page:
h1 as article title
h2, h3, and p elements as content sections
This removes navigation, banners, code formatting, and irrelevant elements.
3. Text Chunking
Node: Text Splitter
Content is split into overlapping text segments to preserve context integrity for semantic search.
- Chunk size: ~1,000 characters
- Overlap: ~150 characters
4. Embedding Generation
Node: OpenAI embeddings
Each chunk is converted into a numerical vector using an embedding model (text-embedding-3-small).
This enables semantic retrieval instead of keyword matching.
5. Data Storage
Nodes: Airtable + Vector Store
For each chunk, the process stores:
This ensures both human-readable data and machine-retrievable vector data.
The workflow calls a public documentation URL to retrieve the raw HTML of a page. The workflow automatically converts live documentation into a structured, searchable knowledge base that can power:
AI FAQ systems
Internal support bots
Developer assistants
Context-aware search
The system scales with new content and requires no manual intervention once deployed.
n8n: workflow orchestration
OpenAI: embedding generation
Airtable: structured text storage
Vector storage engine: embedding indexing
CSS selectors + HTML parsing techniques
RAG pipeline design
Automation architecture
Working with vector search logic
Data extraction and transformation
API integration and model configuration
Practical application of LLM infrastructure in no-code environments
refine question extraction accuracy
improve automated formatting of chatbot responses
integrate confidence scoring for extracted FAQs
test pipeline on larger knowledge base dataset
This prototype validated the feasibility of automating FAQ extraction from knowledge bases. It highlighted key challenges around semantic question detection and structured output generation, informing the next iteration.