Bug for adding a new column to a dataframe as a series

Adding a new column as series to a dataframe in Modin, messes up the dataframe somehow, so then calling grouby.apply on it gives error. I have written a simple script to reproduce the error, attached to this post. Running the same code with Pandas works fine. I found out the workaround is converting the new column from series to list, then adding it to the dataframe solves the issue. Am I doing something wrong here?

Modin version: 0.15.1
Pandas version: '1.4.2`

import modin.pandas as pd
import ray
ray.init()

# works fine with pandas
# import pandas as pd

d = {'col1': ['a', 'b'], 'col2': [1, 2]}
df = pd.DataFrame(data=d)

# this works fine
df.groupby(by='col1').apply(lambda x: x['col2'] + 1)

# this works fine:
# df['col3'] = [3, 4]
# so the workaround for a similar situation would be to convert the new column to a list
# df['col3'] = list(pd.Series([3, 4]))

df['col3'] = pd.Series([3, 4])

# this gives error
df.groupby(by='col1').apply(lambda x: x['col2'] + 1)

ray.shutdown()

@curious thank you for describing your issue in detail. This is a known bug, Modin issue 3435: the user-defined function in groupby.apply sometimes can’t access a column. Please watch that issue for a fix.

The workarounds you describe happen to work here because they preserve the property that the Modin dataframe has only one column partition. In other words, all the columns are available in every partition. More generally, it’s probably best to do this step in in pandas:

pdf = modin_df._to_pandas()
pdf = pdf.groupby('col0').apply(lambda x: x['col2'] + 1)
modin_df = modin.pandas.DataFrame(pdf)

Please let me know whether that helps!

@mahesh Thank you for your quick response and the workaround. I will follow the issue on the Modin repo.