Transferring Large Files From Cloud Storage to Google Drive

The services Google Drive (Drive) and Google Cloud Storage (GCS) both belong to the Google Cloud portfolio. One might assume that transferring data between these two services within Google should be seamless. However, it's not quite that straightforward. In this blog post, we introduce a cloud-native concept for Google Drive resumable uploads combined with Google Cloud Storage and Google Cloud Functions in NodeJS.

Challenge

Google Cloud Storage is a Google Cloud Platform service, intended for machine access, whereas Google Drive is a Google Workspace service, used by humans. This is probably the reason for the lack of integration with each other. While files can be copied back and forth between Buckets or within Drive with a simple call, there is no Google service for copying or transferring files (internally). Services like Google Cloud Storage Transfer Service or Dataflow have many connectors, but no connector for Google Drive.

To transfer files, objects always need to be downloaded from GCS and then uploaded to Drive. In the worst-case scenario, they leave the Google network only to be sent back there immediately.

In the case at hand, the goal was to move files of up to 10 GiB in order to leverage the advantage of unlimited Google Drive storage included in the Google Workspace Enterprise plan.

That's when the challenges began. While simple examples and code snippets can be quickly found through a search, there's a scarcity of concrete implementations or guidance for handling large files.

Solution: The cloud-native approach

Although there are several tools or scripts that already follow the approach mentioned above, they almost always require substantial resources, depending on the maximum amount of data. Nonetheless, every runtime environment has limitations. These could be computational power, bandwidth, or available memory, necessitating the need for scalability.

The concept of a cloud-native application "is a distributed, observable, elastic, and horizontally scalable service-of-services system that isolates its state into (a minimum of) stateful components." [Wikipedia]

However, even though scalability is a part of the cloud-native concept, the issue of these limitations persists. Without adapting the transfer concept, you would need to scale an individual instance vertically, rather than the entire system horizontally. The concept of microservices does not encourage scaling vertically until it's no longer feasible, but instead performing a task in the smallest possible increments with the most minimal resources.

No matter how much you scale vertically, either the file being transferred is too large, storage space is insufficient, or the bandwidth is inadequate.

In serverless environments like Apps Script, Cloud Function, or Cloud Run, there is also no persistent storage; the file system resides in memory, limiting files to the size of available memory.

Solution ideas

Our first instinct might be to run the script on a compute engine that has e.g. 4 CPUs and 16 GB Ram. Add a 1TB large disk and the problem is no longer there. Is this cloud native? No!

The premise must be that the solution is serverless in the cloud (i.e. close to the source and the target) and does not need completely oversized resources that have to be constantly managed and updated.

Cloud Functions

As it's a straightforward use case with simple steps and the performance is rarely needed, the decision was made to leverage Cloud Functions as the runtime and push their limitations to the edge.

Why? Cloud Functions only run when needed, they are cost-effective, fully managed, a cloud-native service, easy to invoke, and come with integrated authentication, among other benefits. However, Cloud Functions have limitations. But this limitation doesn't stem from the file size we want to transfer, but rather from the architecture itself!

So, how can I transfer relatively large files using a highly constrained system?

Byte ranges and resumable uploads

Why download the entire file at once and then upload it again? Aren't there already solutions? Yes, there are. In both services or APIs, there are options for working with file sections. This doesn't let you influence the total size of the file or the total runtime, but it does let you unerringly control the maximum size or even the max. runtime per run, and that's what we need for using the Cloud Function.

Google Cloud Storage API

The Google Cloud Storage API offers the possibility to retrieve only certain "byte ranges" of an object. This allows an object to be retrieved and stored in chunks.

asyncfunctiongcs_download_file(bucketName, fileName, targetFile, startByte, endByte) {
await gcs_init_client();

asyncfunctiondownloadFile() {
const options = {
destination: targetFile,
start: startByte,
end: endByte,
};

// Downloads the file
let content = await storage.bucket(bucketName).file(fileName).download(options);

console.log(
`gs://${bucketName}/${fileName} downloaded to ${targetFile}.`
);

return content;
}

var destFileName = await downloadFile().catch(console.error);
return destFileName;
}

Google Drive API

In the Google Drive REST API, I took advantage of the “Resumable Uploads” feature. With resumable uploads, uploads of a file can be performed in several steps. To do this, a "session" is started in an initial "empty" upload call. This call returns an "upload URL" for the completion of the started upload. This URL can be "pumped" with data until the upload is complete. The upload URL is valid for 7 days, so you can take 7 days to complete the upload.

Unfortunately, the upload works only sequentially. So you can not upload several file sections in parallel.

1. Initialization of the upload session

async function drive_upload_resumable_init(fileName, folderId, totalSize) {
let result = await client.request(
{
method: "POST",
url:
"https://www.googleapis.com/upload/drive/v3/files?uploadType=resumable&supportsAllDrives=true",
headers: {
"Content-Type": "application/json",
"X-Upload-Content-Length": totalSize
},
body: JSON.stringify({
name: fileName, parents: [folderId],
supportsTeamDrives: true,
})
});

return result.headers.location;
}

2. Upload of the file sections

asyncfunctiondrive_upload_resumable(url, sourceFile, size, startByte, endByte) {
var result = null;

try {
result = await client.request(
{
method: "PUT",
url: url,
headers: { "Content-Range": `bytes ${startByte}-${endByte}/${size}` },
body: fse.readFileSync(sourceFile)
}
)
} catch (e) {
if (e.response && e.response.status == 308) {
console.log("Status:", e.response.status, e.response.statusText);
return e.response;
} else {
console.error("Catched error", e);
}
}
console.log("Status:", result.status, result.statusText);
return result.data;
}

Cloud-Native Architecture

Based on this information, I came up with the following simple cloud-native architecture concept.

Define a chunk size
Determine the file size of the object in cloud storage.
Calculate the number of chunks needed and the ranges.
Loop as long as HTTP status 308 is received:
- Download the data of a byte range to a temporary file.
- Upload the temporary file via Google Drive Resumable Upload.
- Delete the temporary file
Loop end when HTTP status 200 is received.
Delete the source file (optional)

The goal here is that the size of the current section should not exceed the storage size of the Cloud Function.

Insights

Theory and practice are famously quite different, which is why I'd like to share my experiences with the solution and common mistakes. I also recommend a thorough reading of the documentation.

First and foremost: It works very precisely if done right. Every error is therefore a sign that something wasn't done correctly, rather than an indication that the Google APIs aren't functioning properly.

HTTP Status 308 : Error status 308 is our friend and actually not an error. It says that the partial upload was successful, but the entire upload is not yet complete. However, for our client, a return is an error, which throws an exception. Therefore, this must be caught and the 308 status in the exception must be processed.

Invalid Range: If you have uploaded 100000 bytes, but the following request returns the following headers:

Request Header: Content-Range': 'bytes 0-1999999/5528489
Response Header: range: 'bytes=0-1835007'

This mean that you have either transferred fewer bytes or fewer bytes were processed because the number of processed bytes is visible in the response header. In this case, the next request would result in the following error:

Invalid request. According to the Content-Range header, the upload offset is 2000000 byte(s), which exceeds the already uploaded size of 1835008 byte(s).

Although the temporary source file had exactly 200000 bytes and the range was specified as 0-1999999, the Drive API processed only 1835008.

One might now - as I did - come up with the idea of adjusting the start range to the end value of the return so that the upload accepts the next range. This works fine. However, in the end, you'll have a corrupted file, as the unprocessed bytes are simply missing. Depending on the file format, this issue may become apparent sooner or later.

The correct approach is to consistently work with the calculated byte offsets. If you encounter an error, it's because something was done wrong somewhere. In the above case, this becomes apparent at the latest in the next chunk.

The cause for this, I had simply overlooked. It is very important that the upload chunk size is a multiple of 256kb, so 1024 * 256 * X, because otherwise the Drive API simply truncates the bytes above it, leading to the mentioned problem. In contrast, the GCS API can return any arbitrary range.

Failed to parse Content-Range header.

If the last upload returns a "Failed to parse content-range header. error, this is because the end of the range has the same value as the size. For example

'Content-Range': 'bytes 0-262144/262144'.

Here you have to distinguish between the number of bytes (start at 1) and the range from 0 to X. For 10 bytes the range is 0-9 and the size is 10, so you have to define the header as follows. 'Content-Range': 'bytes 0-9/10'.

After identifying and fixing these problems, I was able to work with the calculated byte offsets without having to re-verify or recalculate them on each loop.

Insight: If I download 1,048,576 bytes from GCS, then the temporary file also has 1,048,576 bytes, and if I upload those 1,048,576 bytes to Drive, then 1,048,576 bytes are also processed.

Verification

Both services provide checksums in the metadata without requiring the file to be downloaded. This allows the files to be compared exactly after the transfer. One value that both services use is the MD5 hash.

The Google Cloud Storage API returns the hash in Base64 format, while the Drive API provides a HEX value. To compare both values, you can convert the GCS hash as follows:

let md5Hash = Buffer.from(object_metadata.md5Hash, 'base64').toString("hex");

Performance and limits

The runtime limit of a Cloud Function is 9 minutes (1st generation) or 60 minutes (2nd generation). The data that can be transferred in this time is our first limit.

The amount of storage for a Cloud Function can be set to 8 GB (1st generation) or up to 32 GB (2nd generation). However, this memory is then always allocated, regardless of whether the file will be 200kb or 20GB. The available memory is our second limit.

The bandwidth of the Cloud Function is unfortunately not very high, so I checked what transfer rates a Cloud Function can achieve.

For this, the transfer of a 10 GiB file in 500MiB chunks was tested. The transfer consists of the download and the following upload. In fact, twice the amount of data is sent over the network.

1st generation vs. 2nd generation

Google Cloud Function comes in two generations, a native 1st version and a 2nd version based on Cloud Run.

A Cloud Function of the first generation with 4 GB of memory manages about 1 GiB per minute. The 10 GB file could therefore not be transferred reliably in this way, as the Cloud Function runs for a maximum of 9 minutes. For smaller files up to about 7 GiB, the transfer was successful.

The second generation of Cloud Functions is based on Cloud Run. The advantage is that the time limit, analogous to Cloud Run, is 60 minutes and that you can allocate up to 32 GB of memory and also up to 8 CPUs. With Compute resources, the network throughput increases with the performance of the system, so there was hope for faster transfer speeds here. Unfortunately, I could not reconstruct such a scaling with the Cloud Functions. Regardless of whether with 1 or 8 CPUs, the transfer rate is also around 1 GiB/minute.

Our 10 GiB file could be transferred successfully with a Cloud Function of the 2nd generation.

Costs

Since the upload to Drive is Google internal traffic, you only pay for the egress from GCS (download) and the runtime of the cloud function 🙂

Outlook – Even more Cloud-Native

If you want to keep the serverless architecture, which gets by with the available limits, then you have to outsource the chunks yourself to separate executions. A cloud function that either determines the number of necessary chunks and their offsets and then creates cloud tasks that are then executed one after the other would be conceivable here.

Cloud tasks are created, which are then executed one after the other. It is important that the drive upload must take place sequentially in the correct order.
A task list is created, which is processed one by one by a regularly executed script.

If we remember the 7-day validity of the upload link, file sizes of up to 10 TB would thus be mathematically possible with a purely sequential execution in Cloud Functions.

However, there are various other limits in Google and the Google services such as the max. file size (max 5 TB file size) or upload limits (750GB/day), which means that even this solution cannot scale infinitely.

And once again it has been shown that "there is no such thing as infinite scaling."

Source code

The entire source code of the Cloud Function can be downloaded from our public GitHub repository.

Contact now

Do you have any questions about the article? Do you want to get started or continue to optimize cloud-native applications? Contact us now.

Article

Transferring Large Files From Cloud Storage to Google Drive

Challenge

Solution: The cloud-native approach

Solution ideas

Cloud Functions

1. Initialization of the upload session

2. Upload of the file sections

Cloud-Native Architecture

Insights

Failed to parse Content-Range header.

Verification

Performance and limits

1st generation vs. 2nd generation

Costs

Outlook – Even more Cloud-Native

Source code

Contact now

Services Used

Google Cloud Consulting

Continue Reading

Ippen Digital Chooses Google Workspace

Cloud Native Rockstar Award 2022

Ask a Google Cloud Consulting Trainee - Hassata

Introducing The New Google Cloud Backup And DR Solution

Let's work together

Article

Transferring Large Files From Cloud Storage to Google Drive

Challenge

Solution: The cloud-native approach

Solution ideas

Cloud Functions

1. Initialization of the upload session

2. Upload of the file sections

Cloud-Native Architecture

Insights

Failed to parse Content-Range header.

Verification

Performance and limits

1st generation vs. 2nd generation

Costs

Outlook – Even more Cloud-Native

Source code

Contact now

Share it!

Services Used

Google Cloud Consulting

Continue Reading

Ippen Digital Chooses Google Workspace

Cloud Native Rockstar Award 2022

Ask a Google Cloud Consulting Trainee - Hassata

Introducing The New Google Cloud Backup And DR Solution

Let's work together