.. toctree:: :maxdepth: 1 Work with Spark =============== URI Scheme ---------- Remember to prefix your S3 paths with `s3a` instead of `s3` or `s3n`. Local Spark Setup ----------------- If you're running Spark locally, you'll need to add the rikai jar when creating the Spark session: .. code-block:: python spark = ( SparkSession .builder .appName('rikai-quickstart') .config('spark.jars.packages', 'ai.eto:rikai:0.0.5') .master('local[*]') .getOrCreate() ) If you want to read/write data from/to S3, you will need additional setup: 1. Setup `AWS credentials `_ or specify them directly as Spark config 2. Add :code:`hadoop-aws` and :code:`aws-java-sdk` jars to your Spark classpath. Make sure you download versions that match. For example, if you have apache spark 3.0.1 with hadoop 2.7.4 setup, then you should use :code:`hadoop-aws v2.7.4`. You can then see on maven that this pairs with :code:`aws-java-sdk v1.7.4`. 3. Specify additional options when creating the Spark session: .. code-block:: python spark = ( SparkSession .builder .appName('rikai-quickstart') .config('spark.jars.packages', 'ai.eto:rikai:0.0.5') .config("spark.driver.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") .config("spark.executor.extraJavaOptions", "-Dcom.amazonaws.services.s3.enableV4=true") .master("local[*]") .getOrCreate() ) Note that for hadoop 2.7.x you may need to configure the aws endpoints. See `hadoop-aws `_ documentation for details. Databricks ---------- If you are using Databricks, you shouldn't need to manually configure the Spark options and classpath. Please follow `Databricks documentation `_ and install both the `python package from pypi `_ and the `jar from maven `_.