How use Modin in Azure ML?

Hello Modin team!

I am a Data Scientist. Currently I am performing deployment in Azure ML, and my data set is growing monthly. It would be great scaling my pandas implementation with Modin, indeed, Azure ML documentation mention this as a solution.

However, I do not kown any tutorial/documentation that illustrates the implementation.

Is there anyone who could suggest me any how-to, tutorial or related documentation?

Best

Hi @Jaimemosg , it is great to hear you come to Modin for scaling pandas. You can use all the pandas functionality that is scaled with Modin just by having changed import statement.

# import pandas as pd
import modin.pandas as pd

Modin is able to distribute work both within a single node and within multiple nodes in a cluster. You might want to look at the use of Modin with XGBoost in both within a single node and in a cluster here.

Please, let us know if it makes sense to you.

Dear @YarShev, thanks for replying.

What you told makes sense, but in fact, it did not worked for me. I launched a Python script step using modin[dask] or modin[ray] (I tried both), and the execution time is huge. I have a cluster with virtual machines of E series.

I was wondering if I am missing some important configuration parameters.

Your suggestion about xgboost is great! Do you know if there exists any way for fitting Lightgbm?

Thanks for your help :slight_smile:

@Jaimemosg thanks for your response, I have a couple of questions about your issue:

1.) Did you set up a Ray or Dask execution environment on the cluster first?
2.) What’s the dataset size?
3.) What types of operations are you doing on the dataset?

Hello @devin-petersohn.

I am going to answer your questions below:

  1. Well, I am using Azure ML pipelines, so I set a virtual environment as folows:
env = Environment(name="dataprep_regression_environment")
conda_deps = CondaDependencies.create(
        conda_packages=['prophet',
                        'pystan',
                        'plotly'],
        pip_packages=['modin[dask]',
                      'sklearn',
                      'azureml-defaults',
                      'azure-storage-blob',
                      'azure-keyvault-secrets',
                      'azure-identity'
                      ])
conda_deps.set_python_version("3.7")
env.python.conda_dependencies = conda_deps

However, I do not kown if additional configurations must be done, and how they must be done.

  1. The data size is approximately 28 GB, and this data set is growing.

  2. A great variety of operations are perfomed. I am concerned if some methods works, such as:

  • pd.Series.to_frame()
  • groupby
  • pd.DataFrame.to_flat_index()
  • pd.DataFrame.set_index() / pd.DataFrame.sort_index()
  • pd.to_datetime()

This questions are really useful for researching about why Modin is not working for me in Azure ML, Thank you so much.

Thanks for the context @Jaimemosg, it does seem like additional configuration is needed.

In order to use multiple VMs, you’ll need to set up a Dask execution cluster (it looks like you’re using Dask). The Dask cluster can be set up via the commands in the Dask documentation: Microsoft Azure — Dask Cloud Provider 2021.6.0+9.ge1e6a0f documentation. Once Dask execution is set up, Modin can leverage it.

This is only required for running in a cluster. If this isn’t set up, Modin will only know about the node you log into and only be able to use those resources.

Does that make sense?

I provided a link to the distributed implementation of XGBoost that we introduced to Modin some time ago. With respect to Lightgbm, we have not yet done any efforts to integrate it with Modin. Feel free to open an issue if you would like to see a distributed Lightgbm in Modin. Also, you likely want to look at integrations of Lightgbm with other distributed frameworks (link).