.. toctree:: :maxdepth: 2 Quickstart ========== In this quickstart, we illustrate a user journey from data cleaning to model training to model evaluation using `PyTorch`_ and Rikai. Installation ------------ .. code-block:: bash pip install rikai[torch] Step 1. Feature Engineering --------------------------- Let's start feature engineering using `Spark`_. .. code-block:: python from pyspark.sql import SparkSession from pyspark.ml.linalg import DenseMetrix from rikai.types import Image, Box2d from rikai import numpy as np spark = ( SparkSession .builder .appName("rikai-quickstart") .getOrCreate() ) df = spark.createDataFrame( [ { "id": 1, "image": Image("s3://foo/bar/1.png"), "annotations": [ Row( text="cat", label=2, mask=np.random(size=(256, 256)), bbox=Box2d(xmin=1.0, ymin=2.0, xmax=3.0, ymax=4.0) ) ] }, ] ) df.write.format("rikai").save("my_dataset") The magic here is that Rikai maintains commonly used :doc:`Semantic Types ` , takes care of ``SerDe`` and visualization in notebooks. Additionally, Rikai community maintains a set of pre-baked connectors, such as `COCO `_ and `ROS Bag `_ When it is ready, we can submit the script via ``spark-submit`` .. code-block:: bash spark-submit \ --master yarn \ script.py Step 2. Inspect Dataset ------------------------ We can then inspect the dataset in a `Jupyter Notebook`_. .. code-block:: python df = spark.read.format("rikai").load("my_dataset") df.printSchema() df.show(5) Step 3. Train the model ----------------------- Use this dataset in `Pytorch`_ .. code-block:: python import torch import torchvision from rikai.torch.vision import Dataset from torch.utils.data import DataLoader device = torch.device("cuda") if \ torch.cuda.is_available() else torch.device("cpu") dataset = Dataset( "my_dataset", image_column="image", target_column='annotations', transform=torchvision.transforms.ToTensor(), ) data_loader = DataLoader( dataset, batch_size=8, num_workers=4, ) model.train() for epoch in range(10): for imgs, annotations in data_loader: loss_dict = model(imgs, annotations) losses = sum(loss_dict.values()) optimizer.zero_grad() losses.backward() optimizer.step() Rikai offers `MLflow`_ integration. When a model registered with `MLflow`, it will be available to SQL ML directly. .. code-block:: python import rikai.mlflow with mlflow.start_run() as run: # training loop for epoch in range(10): for imgs, annotations in data_loader: ... rikai.mlflow.pytorch.log_model(model, "model", model_type="ssd" registered_model_name="my_ssd") Once the training finishes, Model ``my_ssd`` is available for :doc:`SQL ML ` to use. .. code-block:: SQL SELECT id, ML_PREDICT(my_ssd_model, image) as detections, annotations FROM my_dataset WHERE split = 'eval' LIMIT 10 .. _Spark : https://spark.apache.org/ .. _Jupyter Notebook : https://jupyter.org/ .. _Pytorch : https://pytorch.org/ .. _Mlflow : https://mlflow.org/