Numpy and Tensor Interoperability

Numpy ndarray and Tensors are at the core of machine learning development. Rikai makes it effortless to work with numpy.ndarray and tensors, and automatically converts an array to the appropriate tensor format (i.e., torch.Tensor or tf.Tensor).

Work with numpy directly

Rikai makes it super easy to work with numpy array in Spark. At its core, rikai.numpy.view() enables transparently SerDe for numpy.

import numpy as np
import PIL
from pyspark.sql.functions import udf

from rikai.numpy import view
from rikai.spark.types import NDArrayType

@udf(returnType=NDArrayType())
def resize_mask(arr: np.ndarray) -> np.ndarray:
    """Directly work with native numpy array"""
    img = PIL.Image.fromarray(arr)
    resized_img = img.resize((32, 32))
    return view(np.asarray(resized_img))

df = spark.createDataFrame([{
    "id": 1,
    "image": Image("s3://foo/bar/1.png"),
    # Make a view of native numpy array in Spark DataFrame
    mask=view(np.random.rand(256, 256)),
}]).withColumn("resized", resize_mask("mask"))

df.write.format("rikai").save("s3a://bucket/path/to/dataset")

Automatically tensor conversion for Tensorflow and Pytorch

Conviently, Rikai offers pytorch and tensorflow native datasets to automatically convert numpy array into torch.Tensor or tf.Tensor.

For example, using rikai.torch.data.Dataset in pytorch:

from rikai.torch.data import Dataset

dataset = Dataset("s3://bucket/path/to/dataset")
# Compatible with the official pytorch DataLoader
loader = torch.utils.data.DataLoader(
    dataset,
    batch_size=8,
    num_workers=8
)

model = ...
model.eval()
for batch in loader:
    # data has already been converted from numpy array
    # to torch.Tensor
    print(batch)
    predictions = model(batch)

# Sample output:
# {'mask': tensor([[[0.9037, 0.9284, 0.6832, 0.5378], ..., dtype=torch.float64),
#  'id': tensor([997]),
#  'image': tensor([[[  5,   7,  52,  ...,  35,  74,  16],
#  [110,  12,  45,  ..., 101,  35,  97],
#   ...
#  [ 25,  62,  91,  ..., 114,  71,  27]]], dtype=torch.uint8)},

Rikai supports tensorflow too:

import tensorflow as tf
import tensorflow_hub

from rikai.tf.data import from_rikai

dataset = (
    from_rikai(
        "s3://bucket/to/dataset",
        output_signature=(
            tf.TensorSpec(shape=(), dtype=tf.uint8),
            tf.TensorSpec(shape=(None, None), dtype=tf.uint8),
        ),
    )
    .map(pre_processing)
    .batch(1)
    .prefetch(tf.data.AUTOTUNE)
)

model = tensorflow_hub.load("https://tfhub.dev/...")
for id, img in dataset:
    print(id, img)
    predictions = model(img)

# Sample output:
# tf.Tensor(99, shape=(), dtype=uint8) tf.Tensor(
# [[ 81  39   4 ... 111  16  80]
# ...
# [ 15  53 121 ...   5 115  18]], shape=(128, 128), dtype=uint8)

Semantic types are Tensor convertible

You might have already realized now, Semantic Types like Image, are automatically converted to tensors in the above examples. This is because many of the semantic types have implemented ToNumpy interface.

Rikai firstly convert a ToNumpy object to numpy.ndarray, and then the training framework-specific dataset classes (rikai.torch.data.Dataset and rikai.tf.data.from_rikai) convert such array into framework-specific tensor.

To give a few examples:

  • rikai.types.Image.to_numpy() converts image into a np.ndarray(..., shape=(height, width, channel), dtype=np.uint8).

  • rikai.types.Box2d.to_numpy() converts a 2-D bounding box to np.ndarray([xmin, ymin, xmax, ymax], dtype=np.float32).

  • rikai.types.Mask.to_numpy() converts a 2-D mask array (usually for Segmentation) into np.ndarray(..., shape=(height, width), dtype=np.uint8)

How to develop your tensor-convertible types

To allow Rikai dataset automatically convert your type into numpy.ndarray or tensors, you should let your class to implement the rikai.mixin.ToNumpy mixin.

from rikai.mixin import ToNumpy

class MyDataType(ToNumpy):

    __UDT__ = ...

    def to_numpy(self) -> np.ndarray:
        ...