Transform and apply a function¶
There are many APIs that allow users to apply a function against pandas-on-Spark DataFrame such as
DataFrame.transform()
, DataFrame.apply()
, DataFrame.pandas_on_spark.transform_batch()
,
DataFrame.pandas_on_spark.apply_batch()
, Series.pandas_on_spark.transform_batch()
, etc. Each has a distinct
purpose and works differently internally. This section describes the differences among
them where users are confused often.
transform
and apply
¶
The main difference between DataFrame.transform()
and DataFrame.apply()
is that the former requires
to return the same length of the input and the latter does not require this. See the example below:
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pser):
... return pser + 1 # should always return the same length as input.
...
>>> psdf.transform(pandas_plus)
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[5,6,7]})
>>> def pandas_plus(pser):
... return pser[pser % 2 == 1] # allows an arbitrary length
...
>>> psdf.apply(pandas_plus)
In this case, each function takes a pandas Series, and pandas API on Spark computes the functions in a distributed manner as below.
In case of ‘column’ axis, the function takes each row as a pandas Series.
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pser):
... return sum(pser) # allows an arbitrary length
...
>>> psdf.apply(pandas_plus, axis='columns')
The example above calculates the summation of each row as a pandas Series. See below:
In the examples above, the type hints were not used for simplicity but it is encouraged to use them to avoid performance penalty. Please refer the API documentations.
pandas_on_spark.transform_batch
and pandas_on_spark.apply_batch
¶
In DataFrame.pandas_on_spark.transform_batch()
, DataFrame.pandas_on_spark.apply_batch()
, Series.pandas_on_spark.transform_batch()
, etc., the batch
postfix means each chunk in pandas-on-Spark DataFrame or Series. The APIs slice the pandas-on-Spark DataFrame or Series, and
then apply the given function with pandas DataFrame or Series as input and output. See the examples below:
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf + 1 # should always return the same length as input.
...
>>> psdf.pandas_on_spark.transform_batch(pandas_plus)
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pdf):
... return pdf[pdf.a > 1] # allow arbitrary length
...
>>> psdf.pandas_on_spark.apply_batch(pandas_plus)
The functions in both examples take a pandas DataFrame as a chunk of pandas-on-Spark DataFrame, and output a pandas DataFrame. Pandas API on Spark combines the pandas DataFrames as a pandas-on-Spark DataFrame.
Note that DataFrame.pandas_on_spark.transform_batch()
has the length restriction - the length of input and output should be
the same - whereas DataFrame.pandas_on_spark.apply_batch()
does not. However, it is important to know that
the output belongs to the same DataFrame when DataFrame.pandas_on_spark.transform_batch()
returns a Series, and
you can avoid a shuffle by the operations between different DataFrames. In case of DataFrame.pandas_on_spark.apply_batch()
, its output is always
treated as though it belongs to a new different DataFrame. See also
Operations on different DataFrames for more details.
In case of Series.pandas_on_spark.transform_batch()
, it is also similar with DataFrame.pandas_on_spark.transform_batch()
; however, it takes
a pandas Series as a chunk of pandas-on-Spark Series.
>>> psdf = ps.DataFrame({'a': [1,2,3], 'b':[4,5,6]})
>>> def pandas_plus(pser):
... return pser + 1 # should always return the same length as input.
...
>>> psdf.a.pandas_on_spark.transform_batch(pandas_plus)
Under the hood, each batch of pandas-on-Spark Series is split to multiple pandas Series, and each function computes on that as below:
There are more details such as the type inference and preventing its performance penalty. Please refer the API documentations.