Yurii Sokyran Software Engineer (Front-End)

How to Create a Web Extension for AI Image Generation

#Web

19 October 2023 • 20 min read

Midjourney, DALL-E and StableDiffusion image generators are very popular now. According to EveryPixel, more than 15 billion AI-created images have been generated in roughly 1.5 years. For reference, to create that many photos, humanity has spent 149 years.

No wonder tools that help generate better images come in handy for designers and generative art enthusiasts.

That’s why we had the following idea for a project during the Internal AI Hackathon: create a browser extension that will help users to create better prompts.

Image Wizard generating a prompt based on the “Capybara Galaxy Capibara” poster. Image by FavoritePlates via Displate

I know it’s a capybara, not a rat, but still 😅

How does an Image Wizard work?

Right-click any image and select What’s the prompt? In a pop-up window, Image Wizard will give you a prompt you could use to generate a similar image with AI. For example:

Image Wizard generates a prompt based on “The Legend of the Wild West”. Image by @Ralgory_digital via Pinterest

Now, we can copy and paste the prompt into Bing Image Creator powered by DALLE-3. The prompt would also work with Midjourney or Stable Diffusion. Here’s the result:

Generated results using Bing Image Creator

Looks cool right? Let’s break down how we created this extension, starting from the manifest.

As you can see, there’s also a button Try prompt in Image Wizard. When clicked, it opens a macOS app for generating prompts. But let’s focus solely on the extension in this article.

Anatomy of web extension

The manifest

Every extension requires a JSON-formatted file, named manifest.json, that provides important information. This file must be located in the extension's root directory.

We can start with the following manifest:

{
  // Required
  "manifest_version": 3,
  "name": "Image Wizard",
  "version": "1.0",
  // Recommended
  "description": "Get a prompt for selected image"
}

The code above won’t be enough for our extension, but we will complete it throughout the development process. Also we will look at the manifest in more detail after we look through the other crucial parts of the extension.

Content scripts

Content scripts are JavaScript files that run in the context of a webpage, allowing you to interact with and manipulate the DOM.

We used content scripts as our main entry point for creating popup and adding logic to it.

const main = () => {
  const popup = document.createElement("div");
  popup.classList.add("popup");
  popup.classList.add("image-wizard");

  const imgSrc = chrome.runtime.getURL("images/wizard.png");
  const closeIcon = chrome.runtime.getURL("images/close-icon.svg");
  const copyIcon = chrome.runtime.getURL("images/copy-icon.svg");
  const openIcon = chrome.runtime.getURL("images/open-icon.svg");

  popup.innerHTML = html`
   <div class="top-bar">
     <img id="close" width="20" height="20" class="close" src=${closeIcon} alt="Close" />
   </div>
   <div class="content">
     <img width="32" height="32" class="logo" src=${imgSrc} alt="" />
     <div>
       <div class="text-container">
         <div id="loadingText" class="text"></div>
         <div id="resultText" class="text"></div>
       </div>
       <div class="button-container">
         <button id="copy-text" class="copy-button">
           <img src=${copyIcon} alt="Copy" />
         </button>
         <button id="open-imagewizard" class="open-button">
           Try prompt in Image Wizard
           <img src=${openIcon} class="open-icon" alt="" />
         </button>
       </div>
     </div>
   </div>
 `;

  document.body.appendChild(popup);

  const close = document.querySelector("#close");
  close.addEventListener("click", function () {
    popup.style.display = "none";
  });

  const openWizard = document.querySelector("#open-imagewizard");
  openWizard.addEventListener("click", function () {
    const resultText = document.querySelector("#resultText");
    window.open(`imagewizard://?prompt=${resultText.textContent}`);
  });
};

main();

This code is mostly basic html-js manipulations except for this block:

const imgSrc = chrome.runtime.getURL("images/wizard.png");

const closeIcon = chrome.runtime.getURL("images/close-icon.svg");

const copyIcon = chrome.runtime.getURL("images/copy-icon.svg");

const openIcon = chrome.runtime.getURL("images/open-icon.svg");

Here we dynamically get the url of the image to use later inside our markup, like <img src=${openIcon} />.

To use it, we first need to declare accessible resources in our manifest:

"web_accessible_resources": [{
   "resources": [
     "images/*",
     "fonts/*"
   ],
   "matches": ["<all_urls>"]
 }]

resources contains an array of relative paths to a given resource from the extension's root directory, and matches is an array of match patterns that specify which sites can access this set of resources. Here we use <all_urls> because we want users to be able to get prompts for images wherever they go.

If you build an extension for a specific marketplace or some website helper, you should set a more strict url pattern.

We have also declared fonts as accessible resources and can use them like that in our *.css files:

@font-face {
  font-family: "FixelText";
  font-style: normal;
  font-display: swap;
  font-weight: 200;
  src: url(/fonts/FixelText-ExtraLight.woff2) format("woff2");
}

In order to add content scripts to your extension, you should declare it your manifest:

"content_scripts": [
   {
     "matches": ["<all_urls>"],
     "js": ["content.js"],
     "css": ["content.css"]
   }
 ],

See content scripts in isolated worlds: Chrome Developer Docs

According to Chrome docs, extensions live in an isolated world, which is a private execution environment that isn't accessible to the page or other extensions. A practical consequence of this isolation is that JavaScript variables in an extension's content scripts are not visible to the host page or other extensions' content scripts.

You might think extensions are fully isolated, but there is a catch: your CSS can influence other websites.

If we have a CSS class with some pretty common naming:

.mainContainer {
  background: indianred;
}

It will easily apply to the page you’re using.

Common CSS class influences the background color of Pinterest’s “Home” page. Image: Pinterest

List of APIs that content scripts can access

18n
storage
runtime:
- connect
- getManifest
- getURL
- id
- onConnect
- onMessage
- sendMessage

As you can see, there are generally three groups of APIs: translations (i18n), storage to work with local storage, and runtime to access resources or communicate with service-workers (more on that later).

All of that would be not enough to create a useful experience for a user, because we would need access to more sophisticated browser APIs. That’s why we have service workers.

Service workers (background scripts)

Service workers are a special kind of JavaScript that can run independently in the background, separate from web pages, and can manage tasks not directly associated with any user interface.

We add service workers to our extension by declaring it in manifest.json:

"background": {
   "service_worker": "background-worker.js"
 },

To allow usage of import, we set type of service worker to “module”:

"background": {
   "service_worker": "background-worker.js",
   "type": "module"
 },

Service workers have access to a much broader list of APIs than the content scripts. That’s why they are useful for managing caching, offline mode, push notifications, background sync, and other tasks that do not require direct access to a web page's DOM (Document Object Model).

For example, we use contextMenus API to create new option in context menu like that:

chrome.runtime.onInstalled.addListener(() => {
  chrome.contextMenus.create(
    {
      id: "click-img",
      title: "What the prompt?",
      contexts: ["image"],
    },
    () => void chrome.runtime.lastError
  );
});

After that we add click listener for context menu items:

chrome.contextMenus.onClicked.addListener((info) => {
 if (info.menuItemId === "click-img") {
   // as we set `contexts: ["image"]` above, we can assume that `info.srcUrl` is an image url
   console.log('clicked on image', info.srcUrl);
   // other code ...
 }
});

Long living communication

Service workers do not have direct access to the DOM so they can't interact with web pages directly. Instead, they communicate with other parts of the extension using the postMessage API.

Content scripts and service workers must communicate to maximize their individual strengths and mitigate limitations:

Content scripts deal with the page and its elements, then send events to service workers.
Service workers process the data, call the necessary browser APIs, and respond back with the result.

You can achieve this communication with one-off messages, but I don’t recommend this approach because the actual implementation seems hard and overwhelming.

One-off messages and its drawbacks

	In content script	In background script
Send msg	browser.runtime.sendMessage()	browser.tabs.sendMessage()
Receive msg	browser.runtime.onMessage	browser.runtime.onMessage

Here’s for example git snippet:
https://gist.github.com/DashBarkHuss/4da860d395cea57dd502e8978df1c488

As another user, I get Error: Could not establish connection. Receiving end does not exist.

Luckily, there is also another implementation: **long living communication**.

Sometimes it's useful to have a conversation that lasts longer than a single request and response. In this case, you can open a long-lived channel from your content script to an extension page or vice versa using runtime.connect or tabs.connect, respectively. The channel can optionally have a name, allowing you to distinguish between different types of connections.

// draft-service-worker.js

let portFromCS;

function onConnected(port) {
  portFromCS = port;
}

chrome.runtime.onConnect.addListener(onConnected);
portFromCS.postMessage({ type: "finish", value: data.value });
portFromCS.postMessage({ type: "openPopup" });

And from the content script we can connect and listen for that message:

// content-script.js

let myPort = chrome.runtime.connect({ name: "port-from-cs" });

myPort.onMessage.addListener((m) => {
  if (m.type === "finish") {
    // ...
  } else if (m.type === "error") {
    // ...
  }
});

Below is the final implementation of our service worker code that implements long living communication with content script:

// service-worker.js

chrome.runtime.onInstalled.addListener(() => {
  chrome.contextMenus.create(
    {
      id: "click-img",
      title: "What the prompt?",
      contexts: ["image"],
    },
    () => void chrome.runtime.lastError
  );
});

let portFromCS;

function onConnected(port) {
  portFromCS = port;
}

chrome.runtime.onConnect.addListener(onConnected);

chrome.contextMenus.onClicked.addListener((info) => {
  if (info.menuItemId === "click-img") {
    portFromCS.postMessage({ type: "openPopup" });

    const timeoutId = setTimeout(() => {
      portFromCS.postMessage({ type: "timeout" });
    }, 29000);

    return fetch("https://image-prompting-server/describe", {
      method: "POST",
      headers: {
        "Content-Type": "application/json",
      },
      body: JSON.stringify({ url: info.srcUrl }),
    })
      .then((response) => {
        response.json().then((data) => {
          return portFromCS.postMessage({ type: "finish", value: data.value });
        });
      })
      .catch((error) => {
        return portFromCS.postMessage({ type: "error", value: error.message });
      })
      .finally(() => {
        clearTimeout(timeoutId);
      });
  }
});

There is also a caveat with service workers – they can be terminated after inactivity.

Chrome stops a service worker if:

inactive for 30 seconds (events or API calls reset the timer)
a request, like an event or API call, exceeds 5 minutes
a fetch() response takes over 30 seconds"

That’s why had to add timeout:

const timeoutId = setTimeout(() => {
     portFromCS.postMessage({ type: "timeout" });
   }, 29000);

We could postpone the timeout by calling the API, but for user experience it is better to reload the page and start again – the server might have a cold start and after a second request it will be faster.

Now back to manifest

The final version of our manifest.json is as follows:

{
  "description": "Get a prompt for selected image",
  "manifest_version": 3,
  "name": "Image Wizard",
  "version": "1.0",
  "background": {
    "service_worker": "background-worker.js"
  },
  "content_scripts": [
    {
      "matches": ["<all_urls>"],
      "js": ["content.js"],
      "css": ["content.css"]
    }
  ],
  "web_accessible_resources": [
    {
      "resources": ["images/*", "fonts/*"],
      "matches": ["<all_urls>"]
    }
  ],
  "permissions": ["contextMenus"],
  "icons": {
    "128": "icons/icon-128.png"
  }
}

We’ve added declarations for content scripts, service workers, and web-accessible resources in our configuration. Additionally, we've specified the icons and permissions properties.

The icons property is very straightforward: it sets the icon of your extension. The icon is visible on the menu bar, extension list, etc.
The permissions property is more fun. With permissions, we can add more features to our extension, which is great. But you must understand why you need them, as you’ll have to explain each permission during the review process (provided you plan to add the extension to Chrome Web Store)

See the full list of permissions: Chrome Developer Docs

Browser compatibility: V2 vs V3

From Firefox to Chrome. In the beginning, I wrote this extension for Firefox because I use this browser the most in my everyday life. Also I have discovered that browser compatibility can be achieved with a manifest v2 and simple polyfill.

The plan was set – make an extension for one browser, use polyfill to make it work on another.

Manifest V3. My hopes were crushed when I discovered that Chrome Web Store now accepts only extensions with manifest V3. Moreover, polyfill won’t work because there are breaking changes in manifest V3. Chrome is much more popular than other browsers these days, so I had to stick with its requirements.

The problem is that Firefox and Chrome have a different way of building extensions. Chrome uses service workers for background work, Firefox uses background scripts. Essentially they do the same stuff – work with a broad list of browser APIs – but they are declared differently inside manifest and have different APIs usage. As of now, Firefox doesn't have service workers support in extension, and it seems like they don’t plan on deprecating background pages for now.

The main steps in migrating from V2 to V3:

Update the manifest fields and API calls
Migrate to a service worker— it replaces background page to ensure that background code stays off the main thread where it can hurt performance. This change also requires moving DOM, some extension API calls into offscreen documents.
Replace blocking web request listeners — blocking or modifying network requests in Manifest V2 could significantly degrade performance and require excessive access to sensitive user data.
Improve extension security – using remotely hosted code and execution of arbitrary strings are not allowed.

See: Chrome Migration Guide

In our case, rewriting Firefox extension manifest V2 to Chrome extension manifest V3 was not that hard – we have a relatively small extension.

First of all, we had to just rename all of browser occurrences to chrome.

From this:

browser.contextMenus.create(
  {
    id: "click-img",

    title: "What the prompt?",

    contexts: ["image"],
  },

  () => void browser.runtime.lastError
);

To that:

chrome.contextMenus.create(
  {
    id: "click-img",

    title: "What the prompt?",

    contexts: ["image"],
  },

  () => void chrome.runtime.lastError
);

Next, we also had to update manifest.json, but we already showed you what’s inside.

Backend stuff

Okay, building the extension was fun – but how does it work internally? Where do you hide all of the magic you might ask?

Our backend can be split into three valuable components:

Model that describes passed image.
Carefully designed prompt that includes creative examples of other image prompts we strive for.
OpenAI model that combines description and prompt to craft its own result.

How to make AI see stuff

Give AI its eyes: First, we need to describe an image we want. We used the Salesforce/blip-image-captioning-base model. It was enough to get started. We could’ve opted for a bigger model “Salesforce/blip-image-captioning-large”, but it required a more performant hardware.

Setup is pretty simple thanks to the Huggingface transformers library. First, we downloaded model files in order to get them locally. Here’s what inside save_model.py script:

import os
from transformers import BlipProcessor, BlipForConditionalGeneration

models_path = 'models/'

processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base", local_files_only=False)
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", local_files_only=False)

processor.save_pretrained(models_path + 'processor')
model.save_pretrained(models_path + 'model')

Now we could use these files and process an image:

processor = BlipProcessor.from_pretrained(models_path + 'processor', local_files_only=True)
model = BlipForConditionalGeneration.from_pretrained(models_path + 'model', local_files_only=True)
img_url = 'https://some-image.url'

raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

inputs = processor(raw_image, return_tensors="pt")

out = model.generate(**inputs)
  
image_desc = processor.decode(out[0], skip_special_tokens=True)

image_desc is a string like “a soccer game with a player jumping to catch the ball” or “a woman sitting on the beach with her dog”.

Adding necessary creativity to GPT

Normally, when we ask GPT-3.5 for a prompt based on an image description, we get different results. It can be both good and bad.

There is already pretty popular advice for creating good prompts, including a fantastic introductory article on prompts by Maksym Vatralik, a graphic designer and a colleague of mine.

So we gathered several examples of what are deemed high-quality and creative prompts for Midjourney.

With this approach, we formed our own prompt:

chatPrompt = """Act as a professional prompt engineer, who does the best prompts for Midjourney. Your prompts should short and concise. Different modifications for image should be separated with a comma. Your prompts should be in a style of the following prompts:

- a person in a boat in the middle of a forest, a detailed matte painting by Cyril Rolando, cgsociety, fantasy art, matte painting, mystical, nightscape
- a man riding on the back of a white horse through a forest, a detailed matte painting, cgsociety, space art, matte painting, redshift, concept art.
- The Moon Knight dissolving into swirling sand, volumetric dust, cinematic lighting, close up portrait
- a stunning image of a female face with light in it, in the style of unreal engine, richard phillips, animated gifs, macro lens, kelly sue deconnick, guillem h. pongiluppi, photo-realistic hyperbol
- Hyper detailed movie still that fuses the iconic tea party scene from Alice in Wonderland showing the hatter and an adult alice. Some playcards flying around in the air. Captured with a Hasselblad medium format camera.
- a Shakespeare stage play, yellow mist, atmospheric, set design by Michel Crête, Aerial acrobatics design by André Simard, hyperrealistic
- venice in a carnival picture, in the style of fantastical compositions, colorful, eye-catching compositions, symmetrical arrangements, navy and aquamarine, distinctive noses, gothic references, spiral group

Describe prompt for picture that will include: "{image_desc}".

Generated prompt:

"""

Finally, we pack chatPrompt inside Flask server script:

import os
from flask import Flask, request, jsonify
from flask_cors import CORS
import openai
import re
import requests
from PIL import Image
from transformers import BlipProcessor, BlipForConditionalGeneration

app = Flask(__name__)
CORS(app)

openai.api_key = os.environ.get('OPENAI_API_KEY')

models_path = 'models/'

processor = BlipProcessor.from_pretrained(models_path + 'processor', local_files_only=True)
model = BlipForConditionalGeneration.from_pretrained(models_path + 'model', local_files_only=True)

chatPrompt = """Act as a … Generated prompt: """

@app.route('/describe', methods=['POST'])
def query():
   # parse the request body
   data = request.get_json()
   img_url = data['url']

   raw_image = Image.open(requests.get(img_url, stream=True).raw).convert('RGB')

   inputs = processor(raw_image, return_tensors="pt")

   out = model.generate(**inputs)
  
   image_desc = processor.decode(out[0], skip_special_tokens=True)

   messages = [
     { "role": "user", "content": chatPrompt.format(image_desc=image_desc) },
   ]

   response = openai.ChatCompletion.create(
     model="gpt-3.5-turbo-instruct",
     messages=messages,
     max_tokens=512,
   )

   return jsonify({ "value": response['choices'][0]['message']['content'] })

if __name__ == '__main__':
   port = int(os.environ.get('PORT', 8080))
   app.run(debug=False, host='0.0.0.0', port=port)

Publish extension

Product details form in Chrome Web Store. Image: Google LLC

Review process is the most straightforward part – you just have to fill in the info about your extension.

Extension icon requirements in Chrome Web Store. Image: Google LLC

Getting the right icon was easy – I just forwarded these requirements to our designers. Kudos to Maksym Vatralik and Anastasiia Satarenko for gold work!

Pss, a little secret: Our 'Image Wizard' icon was made by AI. So, even before you use it to generate AI prompts, there's already a bit of AI magic in the icon itself!

Single purpose description in Chrome Web Store. Image: Google LLC

Next, we write the purpose of our extension. "Single purpose" can refer to one of two aspects of an extension:

Single purpose limited to a narrow focus area or subject matter ("news headlines", "weather", "comparison shopping").
Single purpose limited to a narrow browser function ("new tab page", "tab management", or "search provider")

In our case, we used this single purpose description: “Generate AI Art Prompts: Use Image Wizard to effortlessly create initial prompts for Midjourney and DALL-E AI art models based on any image you can find online. Right-click, select 'What's the prompt?', and get prompt.”

Permission justification in Chrome Web Store. Image: Google LLC

The hard part for me was the Permission justification. This is why I advocate for using only the essential permissions. Firstly, with fewer permissions, users have fewer concerns about the extension. Secondly, it streamlines your review process – you have less to explain and a reduced risk of being rejected by the Web Store team.

Conclusions

Image Wizard was developed as part of our internal hackathon. Concluding, this journey was not only a valuable learning experience for us but also led to a tool that our design service genuinely appreciated and utilized.

This article aims to serve as a helpful guide for developers, walking you through the entire process of creating browser extensions from scratch - from manifest and backend development to ensuring browser compatibility and ultimately, publishing.

Useful Resources

AI image generation

How to Write Great Prompts

User interface

Omnibox API

Cross-browser support

This is an independent publication and it has not been authorized, sponsored, or otherwise approved by FavoritePlates, @Ralgory_digital, Pinterest, GitHub Inc, Google LLC.