Using the Microsoft Cloud Vision API with an ESP32-CAM to describe a scene with audio.
This project uses the ‘describe’ API of the Microsoft Cognitive AI system to describe an entire scene rather than individual objects. Objects in the scene are also provided as part of the response.
The project connects an ESP32-CAM to an LCD display and optionally a DAC amplifier and speaker to read out and display a text description of a scene returned from Microsoft’s server.
You need the following items:
LCD Screen with 3.3V backpack: https://es.aliexpress.com/item/32774955921.html
Capacitive Touch Button: https://es.aliexpress.com/item/32964219843.html
Max98357 DAC amplifier (optional): https://es.aliexpress.com/item/33043664469.html
Small speaker (optional): https://s.click.aliexpress.com/e/_AenhyU
A 5v power supply for the ESP32-CAM is needed. I used a USB power bank as shown above.
The 3D printed parts and assembly looks like this:
Fully assembled, the project looks like this:
Microsoft Azure Cognitive Services
First sign up for a free 12 months of Azure Cognitive Services here – https://azure.microsoft.com/en-gb/free/cognitive-services/ You need a Microsoft account (Hotmail, Outlook, Live etc) and credit card but it won’t be charged.
A possible alternative is signing up to the free Basic Account with RapidApi here – Microsoft Computer Vision API: Pricing & Cost (microsoft-azure-org-microsoft-cognitive-services) | RapidAPI where no card is needed. I’ll try to test and update the tutorial to work with this option.
When you have the Azure account set up click here: https://portal.azure.com/#create/Microsoft.CognitiveServicesComputerVision to set up the instance on the server. Fill in the details similar to the screenshot below:
When completed, click ‘Go to resource’. In the resource, click Overview at the top of the left menu and then in the panel on the right that opens, ‘Click here to manage keys’.
Arduino IDE Setup
In the Arduino IDE Library Manager (Sketch > Include Library > Manage Libraries… ), install the following libraries:
ArduinoJson (v6 or greater) by Benoit Blanchon
Extensible hd44780 LCD Library by Bill Perry
Optionally for the audio output version:
ESP8266Audio by Earle F. Philhower
SP8266SAM library from https://github.com/earlephilhower/ESP8266SAM, either by unzipping the download in the IDE library folder or just import the zip file with ‘Sketch > Include Library > Add ZIP Library…’
Copy or download the Sketch from Github here: https://github.com/robotzero1/esp32cam-cognitive-scene. You need to make the following changes to the code:
- Change the Wi-Fi ssid and password to your router
- Change host to be your server at Microsoft
- Change the POST URL (around line 152) to the correct URL for your server
Upload the sketch to the ESP32-CAM.
Connect the pins as below for the LCD only version. Audio version wiring diagram is in section below.
3D printer files: https://robotzero.one/wp-content/uploads/2021/02/Scene-Analyzer-STL-Files.zip
Arduino Sketch with audio: https://pastebin.com/GCLqEFp0
Investigate using a cloud service to convert the text the speech as a sound file and play this to improve the quality of the sound.
More Voices for SAM: https://github.com/earlephilhower/ESP8266SAM/issues/13
More info about SAM Text to Speech: https://simulationcorner.net/index.php?page=sam
If you found something useful above please say thanks by buying me a coffee here...