Updating a dataframe column in spark

Looking at the new spark dataframe api, it is unclear whether it is possible to modify dataframe columns.

How would I go about changing a value in row x column y of a dataframe?

In pandas this would be df.ix[x,y] = new_value

Edit: Consolidating what was said below, you can't modify the existing dataframe as it is immutable, but you can return a new dataframe with the desired modifications.

If you just want to replace a value in a column based on a condition, like np.where:

    from pyspark.sql import functions as F

    update_func = (F.when(F.col('update_col') == replace_val, new_value)
    df = df.withColumn('new_column_name', update_func)

If you want to perform some operation on a column and create a new column that is added to the dataframe:

    import pyspark.sql.functions as F
    import pyspark.sql.types as T

    def my_func(col):
        do stuff to column here
        return transformed_value

    # if we assume that my_func returns a string
    my_udf = F.UserDefinedFunction(my_func, T.StringType())

    df = df.withColumn('new_column_name', my_udf('update_col'))

If you want the new column to have the same name as the old column, you could add the additional step:

    df = df.drop('update_col').withColumnRenamed('new_column_name', 'update_col')

While you cannot modify a column as such, you may operate on a column and return a new DataFrame reflecting that change. For that you'd first create a UserDefinedFunction implementing the operation to apply and then selectively apply that function to the targeted column only. In Python:

    from pyspark.sql.functions import UserDefinedFunction
    from pyspark.sql.types import StringType

    name = 'target_column'
    udf = UserDefinedFunction(lambda x: 'new_value', StringType())
    new_df = old_df.select(*[udf(column).alias(name) if column == name else column for column in old_df.columns])

new_df now has the same schema as old_df (assuming that old_df.target_column was of type StringType as well) but all values in column target_column will be new_value.

From: stackoverflow.com/q/29109916