MiniCPM-o 4.5 Streaming Duplex Audio-Video Multimodal AI: What if we integrated real-time AI workflows for SMBs?
Article Content
Recently, I came across MiniCPM-o 4.5, a streaming duplex multimodal AI system that processes video and audio in real time, generating speech and text responses, primarily in English and Chinese. It’s built from several models like SigLip2, Whisper-medium, and Qwen3-8B, totaling 9 billion parameters. While the tech sounds impressive, the demo shows it’s still rough around the edges.
From a Norwegian SMB perspective—especially companies with 10 to 50 employees—this kind of AI could be a game changer if adapted correctly. Businesses here juggle multiple systems like Tripletex for accounting, Vipps for payments, and Altinn for reporting, all while needing to keep admin costs down. Paying 400–600 NOK/hour for admin and over 1000 NOK/hour for specialist tasks quickly adds up. Imagine if some of those repetitive communication tasks could be automated by a real-time AI that understands audio and video streams.
What if you could have a system that listens to video meetings or customer calls, understands the content, and generates summaries or action points instantly? Or a solution that could handle voice cloning to simulate different roles in customer support without needing multiple employees on calls? This would free up time and reduce costs, especially when integrated with existing Norwegian tools.
Here’s how this CAN be done with the tools I work with: Using APIs like Azure OpenAI or Claude for language processing combined with local AI models ensures GDPR compliance. I would build a prototype that captures audio or video streams, sends them through Whisper-medium for transcription, and then uses Qwen3-8B to generate context-aware responses or summaries. Integration with Tripletex or Fiken APIs could automate invoice generation or document handling based on the conversation. Adding n8n workflows would orchestrate these tasks seamlessly, pushing notifications to Telegram bots or updating databases in Supabase. This approach keeps development lean, focusing on practical automations that SMBs can test within weeks.
This setup fits well for SMBs looking to cut down on repetitive admin and customer interaction costs while maintaining compliance with Norwegian regulations. However, it’s not for enterprises needing high concurrency or custom ML model training. Also, companies relying heavily on languages other than English or Chinese might find current multimodal limitations restrictive.
Have you considered how real-time AI-powered audio-video processing could reshape your daily workflows and reduce your overhead?
Resources: