Ai – Multi-Modal WhatsApp Conversational Agent (Voice, Vision & PDF)

Key Features:

  • Voice-to-Voice Interaction: The agent transcribes incoming voice notes and can respond back with a high-quality AI-generated voice, making it accessible and hands-free.

  • Visual Intelligence: Powered by GPT-4o-mini, the bot can “see” and describe images, identify objects, and answer specific questions about photos sent by the user.

  • Instant Document Processing: Automatically extracts and summarizes text from PDF documents, allowing for quick information retrieval without human intervention.

  • Short-Term Memory: Remembers the last 10 interactions in a conversation window, ensuring the AI maintains context and doesn’t ask repetitive questions.

Target Audience

  • Medical Clinics & Laboratories: To handle voice-recorded symptoms from patients, read digital prescriptions (PDF), or analyze reports.

  • Automobile & Industrial Repair Shops: For technicians who need to send photos of damaged parts for instant identification or troubleshooting.

  • HR & Recruitment Agencies: To automate the screening of resumes (PDFs) and handle initial voice-based candidate inquiries.

  • E-commerce Retailers: To allow customers to send photos of products they are looking for or send voice notes for orders.

Information Required

  • WhatsApp Business API Credentials: Access to their Meta Developer App or a phone number for the WhatsApp Cloud API.

  • OpenAI API Key: To power the vision, voice, and text intelligence (or we can provide it as part of the monthly fee).

  • Business Instruction Document: A clear description of the bot’s role, FAQ list, and preferred tone of voice (Formal vs. Friendly).

  • Escalation Contact: A specific phone number or department name to mention if the AI cannot resolve a query.

  • PDF Knowledge Base: Any specific manuals, price lists, or brochures they want the AI to “read” and remember.