
Mohammad Farhan Habib
3 min readJul 13, 2021


Google Cloud Vision API enables developers to understand the content of an image by encapsulating powerful machine learning models in an easy-to-use REST API. It quickly classifies images into thousands of categories (e.g., “sailboat”, “lion”, “Eiffel Tower”), detects individual objects, faces within images, finds/reads printed words and contained within images.

Getting Started

The Vision API has a very broader scope therefore we will discuss only two features and its API Implementation here:

  1. Text Detection (OCR)
  2. Label Detection

Let's move to API Implementation Schema first to know about its usage

API Implementation Schema


The Vision API consists of a single endpoint:

that supports one HTTP request method (annotate):



If your client application does not use OAuth 2.0, then it must include an API key when it calls an API that’s enabled within a Google Cloud Platform project. The application passes this key into all API requests as a key=API_key parameter,
such as:


Since we are just testing these APIs on Google Vision API console therefore we can work without authentication.


The body of your POST request contains a JSON object of type AnnotateImageRequest, such as:

"image": {
"features": [
"imageContext": {


You can try the requests mentioned below on Google Vision API console

1. Text Detection (OCR)

The Vision API can detect and extract text from images. For example, a photograph might contain a street sign or traffic sign. The JSON includes the entire extracted string, as well as individual words, and their bounding boxes.
Here we are discussing only one feature that support OCR:


Input Image

Sample Request

Image can be sent as a base64-encoded string, a Google Cloud Storage file location, or as a publicly accessible URL, ImageContext is optional & used very rarely based on your request parameters, so we are skipping it.

"requests": [
"image": {
"content": "/9j/7QBEUGhvdG9zaG9...base64-encoded-image-content...fXNWzvDEeYxxxzj/Coa6Bax//Z"
"features": [

Sample Response

"responses": [
"textAnnotations": [
"locale": "en",
"description": "ABBEY\nROAD NW8\nCITY OF WESTMINSTER\n",
"boundingPoly": {
"vertices": [
"x": 45,
"y": 43
"x": 269,
"y": 43
"x": 269,
"y": 178
"x": 45,
"y": 178
"description": "ABBEY",
"boundingPoly": {
"vertices": [
"description": "ROAD",
"boundingPoly": {
"vertices": [
"description": "NW8",
"boundingPoly": {
"vertices": [
"description": "CITY",
"boundingPoly": {
"vertices": [
"description": "OF",
"boundingPoly": {
"vertices": [
"description": "WESTMINSTER",
"boundingPoly": {
"vertices": [

2. Label Detection

The Vision API can detect and extract information about entities within an image, across a broad group of categories.
Labels can identify objects, locations, activities, animal species, products, and more.
we need to specify LABEL_DETECTION as the value of features.type, same as above, done for TEXT_DETECTION:


API implementation would be the same as described above, so moving to the next step:

Input Image

Sample Request

"requests": [
"image": {
"source": {
"imageUri": ""
"features": [

Sample Response

"responses": [
"labelAnnotations": [
"mid": "/m/017rgb",
"description": "ferris wheel",
"score": 0.84832066
"mid": "/m/010jjr",
"description": "amusement park",
"score": 0.8101249
"mid": "/m/01d74z",
"description": "night",
"score": 0.8036025
"mid": "/m/05b0n7k",
"description": "outdoor recreation",
"score": 0.68825835
"mid": "/m/02jf28",
"description": "fair",
"score": 0.6566326

Please reach out to me in case of any query and also don’t forget to applaud if you like this article. Thanks



Mohammad Farhan Habib

With 10+ years of experience in developing Android Mobile Apps, Restful APIs in Spring Boot and dapps using Ethereum Blockchain