Convert any Document to PDF using Serverless GCP

21 Apr

There are many ways we can achieve converting document to PDF in cloud. But let’s see how we can achieve the same with a very simple solution using the Serverless GCP products such as Cloud Run, Cloud Storage and Pub/Sub in this article.

The solution will convert any type of document to PDF like txt, doc, docx, etc.

Cloud Run

Cloud Run is the main service we are using where our solution is hosted. Cloud Run is serverless, so it abstracts away all infrastructure management and lets you focus on building your application instead of worrying about overhead. As a Google serverless product, it is able to scale to zero, meaning it won’t incur cost when not used. It also lets you use custom binary packages based on containers, which means building consistent isolated artifacts is now feasible.

Google Cloud Storage (GCS)

This is our infinite storage, where we will store all our files. Before processing we will store our files in a separate bucket, from where our application will pick up files to convert into PDF. After converting it into PDF, it will put the PDF files into a separate bucket.

Pub/Sub

Pub/Sub allows services to communicate asynchronously, with latencies on the order of 100 milliseconds. We can use it as a messaging-oriented middleware for service integration or as a queue to parallelize tasks. Pub/Sub enables you to create systems of event producers and consumers, called publishers and subscribers. Publishers communicate with subscribers asynchronously by broadcasting events.

 

Let’s start building the application

The solution depicted in the architecture diagram is a pretty self-explanatory and simple solution.

You can refer to the final code with all the commands we have used in this article, using this repository on GitHub.

Step 1:

Enable all the required APIs like Cloud Run API, Pub/Sub API, etc. if not enabled already.

I’m assuming you are running these steps from inside the Cloud Shell. If you don’t want to create the following files, you can clone the git repository mentioned above and use it.

Step 2:

Create a Dockerfile.

FROM node:12
RUN apt-get update -y \
&& apt-get install -y libreoffice \
&& apt-get clean
WORKDIR /usr/src/app
COPY package.json package*.json ./
RUN npm install --only=production
COPY . .
CMD [ "npm", "start" ]

We will use Nodejs and LibreOffice for this solution.

In the Dockerfile we are doing following steps:

  • Updating system and installing LibreOffice package
  • Copying package.json file
  • Running npm install so that it will install required packages from the package.json file
  • Starting the application

Step 3:

Create a package.json file.

{
    "name": "lab03",
    "version": "1.0.0",
    "description": "Convert Document to PDF Serverless using GCP",
    "main": "index.js",
    "scripts": {
      "start": "node index.js",
      "test": "echo \"Error: no test specified\" && exit 1"
    },
    "keywords": [],
    "author": "jinaldesai.com",
    "license": "GPL",
    "dependencies": {
      "@google-cloud/storage": "^3.3.1",
      "body-parser": "^1.19.0",
      "child_process": "^1.0.2",
      "express": "^4.17.1"
    }
  }

In this file we have included all the required dependencies for our application and the main file name index.js and start command of the application.

Step 4:

Create our main application file index.js.

const {promisify} = require('util');
const express     = require('express');
const bodyParser  = require('body-parser');
const {Storage}   = require('@google-cloud/storage');
const exec        = promisify(require('child_process').exec);
const storage     = new Storage();
const app         = express();

app.use(bodyParser.json());

const port = process.env.PORT || 8080;
app.listen(port, () => {
  console.log('Listening on port', port);
});

app.post('/', async (req, res) => {
  try {
    const file = decodeBase64Json(req.body.message.data);
    await downloadFile(file.bucket, file.name);
    const pdfFileName = await convertFile(file.name);
    await uploadFile(process.env.PDF_BUCKET, pdfFileName);
    await deleteFile(file.bucket, file.name);
  }
  catch (ex) {
    console.log(`Error: ${ex}`);
  }
  res.set('Content-Type', 'text/plain');
  res.send('\n\nOK\n\n');
})

function decodeBase64Json(data) {
  return JSON.parse(Buffer.from(data, 'base64').toString());
}

async function downloadFile(bucketName, fileName) {
  const options = {destination: `/tmp/${fileName}`};
  await storage.bucket(bucketName).file(fileName).download(options);
}

async function convertFile(fileName) {
  const cmd = 'libreoffice --headless --convert-to pdf --outdir /tmp ' + 
              `"/tmp/${fileName}"`;
  console.log(cmd);
  const { stdout, stderr } = await exec(cmd);
  if (stderr) {
    throw stderr;
  }
  console.log(stdout);
  pdfFileName = fileName.replace(/\.\w+$/, '.pdf');
  return pdfFileName;
}

async function deleteFile(bucketName, fileName) {
  await storage.bucket(bucketName).file(fileName).delete();
}

async function uploadFile(bucketName, fileName) {
  await storage.bucket(bucketName).upload(`/tmp/${fileName}`);
}

There are the following functions defined in this file:

  • decodeBase64Json
    • Decode the message coming from Pub/Sub
  • downloadFile
    • Download the document from the first bucket
  • convertFile
    • Convert the document into PDF using LibreOffice library
  • uploadFile
    • Upload PDF file to the second bucket, bucket name comes from the env variable which we will provide while creating Cloud Run service
  • deleteFile
    • Delete the document file from the first bucket
  • app.post
    • This will open a channel to accept post requests, whenever new document uploaded will get post request via our pub/sub subscription

Step 5:

Let’s build this application.

gcloud builds submit --tag gcr.io/$GOOGLE_CLOUD_PROJECT/pdf-converter

Step 6:

Let’s deploy it to Cloud Run.

gcloud run deploy pdf-converter \
--image gcr.io/$GOOGLE_CLOUD_PROJECT/pdf-converter \
--platform managed \
--region us-central1 \
--memory=2Gi \
--no-allow-unauthenticated \
--max-instances=1 \
--set-env-vars PDF_BUCKET=$GOOGLE_CLOUD_PROJECT-pdf

Note: LibreOffice needs a good amount of RAM, so we gave 2Gi memory.

Step 7:

Let’s test the service url.

SERVICE_URL=$(gcloud beta run services describe pdf-converter --platform managed --region us-central1 --format="value(status.url)")
curl -X POST -H "Authorization: Bearer $(gcloud auth print-identity-token)" $SERVICE_URL

If you get the response "OK" you have successfully deployed the Cloud Run service.

 

 

Let’s create buckets and notifications

Step 1:

Create the upload bucket

gsutil mb gs://$GOOGLE_CLOUD_PROJECT-upload

Step 2:

Create the pdf bucket

gsutil mb gs://$GOOGLE_CLOUD_PROJECT-pdf

Step 3:

Create notification on the upload bucket, so that it will send a pub/sub when new document uploaded

gsutil notification create -t new-doc -f json \
-e OBJECT_FINALIZE gs://$GOOGLE_CLOUD_PROJECT-upload

 

 

Let’s wire up Pub/Sub and Service Accounts

Step 1:

Create service account for pub/sub to trigger the Cloud Run service

gcloud iam service-accounts create pubsub-cloud-run-invoker \
--display-name "PubSub Cloud Run Invoker"

Step 2:

Assign role to invoke PDF Converter service.

gcloud beta run services add-iam-policy-binding pdf-converter \
--member=serviceAccount:pubsub-cloud-run-invoker@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com \
--role=roles/run.invoker --platform managed --region us-central1

Step 3:

Get the current project number.

PROJECT=$(gcloud config get-value project)
PROJECT_NUMBER=$(gcloud projects list --filter="$PROJECT" --format="value(PROJECT_NUMBER)")

Step 4:

Enable project to create pub/sub auth token.

gcloud projects add-iam-policy-binding $GOOGLE_CLOUD_PROJECT \
--member=serviceAccount:[email protected] \
--role=roles/iam.serviceAccountTokenCreator

Step 5:

Create a pub/sub subscription, so that PDF Converter service can run whenever new pub/sub message is published.

gcloud beta pubsub subscriptions create pdf-conv-sub \
--topic new-doc \
--push-endpoint=$SERVICE_URL \
--push-auth-service-account=pubsub-cloud-run-invoker@$GOOGLE_CLOUD_PROJECT.iam.gserviceaccount.com

 

 

Conclusion

That’s it. We are done, now it’s time for testing.

Upload some documents in the upload bucket, and it will convert those documents into pdf and place it in the pdf bucket.



Leave a Reply

Your email address will not be published. Required fields are marked *