Analyze a Scene with an ESP32-CAM


Using the Microsoft Cloud Vision API with an ESP32-CAM to describe a scene with audio.

This project uses the ‘describe’ API of the Microsoft Cognitive AI system to describe an entire scene rather than individual objects. Objects in the scene are also provided as part of the response.

The project connects an ESP32-CAM to an LCD display and optionally a DAC amplifier and speaker to read out and display a text description of a scene returned from Microsoft’s server.

Scene Analyzer Assembled Front
Scene Analyzer Assembled Rear

Video Demonstration


You need the following items:

LCD Screen with 3.3V backpack:
Capacitive Touch Button:
Max98357 DAC amplifier (optional):
Small speaker (optional):

Project Components

A 5v power supply for the ESP32-CAM is needed. I used a USB power bank as shown above.

The 3D printed parts and assembly looks like this:

3D Prints Front
3D Prints Back
3D Prints Assembled

Fully assembled, the project looks like this:

Assembled Project Front
Assembled Project Rear

Microsoft Azure Cognitive Services

First sign up for a free 12 months of Azure Cognitive Services here – You need a Microsoft account (Hotmail, Outlook, Live etc) and credit card but it won’t be charged.

A possible alternative is signing up to the free Basic Account with RapidApi here – Microsoft Computer Vision API: Pricing & Cost (microsoft-azure-org-microsoft-cognitive-services) | RapidAPI where no card is needed. I’ll try to test and update the tutorial to work with this option.

When you have the Azure account set up click here: to set up the instance on the server. Fill in the details similar to the screenshot below:

Resource Sign Up Screen

When completed, click ‘Go to resource’. In the resource, click Overview at the top of the left menu and then in the panel on the right that opens, ‘Click here to manage keys’.

Arduino IDE Setup

In the Arduino IDE Library Manager (Sketch > Include Library > Manage Libraries… ), install the following libraries:
ArduinoJson (v6 or greater) by Benoit Blanchon
Extensible hd44780 LCD Library by Bill Perry

Optionally for the audio output version:
ESP8266Audio by Earle F. Philhower
SP8266SAM library from, either by unzipping the download in the IDE library folder or just import the zip file with ‘Sketch > Include Library > Add ZIP Library…’

Copy or download the Sketch from Github here: (Plus Members can download the version with audio at the end of the tutorial). You need to make the following changes to the code:

  • Change the Wi-Fi ssid and password to your router
  • Change host to be your server at Microsoft
  • Change the POST URL (around line 152) to the correct URL for your server

Upload the sketch to the ESP32-CAM.

Wiring Diagram

Connect the pins as below for the LCD only version. Audio version wiring diagram is in the Plus members section below.

ESP32-CAM LCD Scene Analyser Wiring


Investigate using a cloud service to convert the text the speech as a sound file and play this to improve the quality of the sound.

Plus Members Files

For full access to this tutorial including the code, 3D STL files and the wiring diagram, please click here to subscribe to be a Plus member. More information about Plus membership can be found here.


More Voices for SAM:
More info about SAM Text to Speech:

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.

scroll to top