Create Spark DataFrame. Can not infer schema for type:

Could someone help me solve this problem I have with Spark DataFrame?

When I do myFloatRDD.toDF() I get an error:

TypeError: Can not infer schema for type: type 'float'

I don't understand why...

Example:

    myFloatRdd = sc.parallelize([1.0,2.0,3.0])
    df = myFloatRdd.toDF()

Thanks

SparkSession.createDataFrame, which is used under the hood, requires an RDD / list of Row/tuple/list/ dict * or pandas.DataFrame, unless schema with DataType is provided. Try to convert float to tuple like this:

    myFloatRdd.map(lambda x: (x, )).toDF()

or even better:

    from pyspark.sql import Row

    row = Row("val") # Or some other column name
    myFloatRdd.map(row).toDF()

To create a DataFrame from a list of scalars you'll have to use SparkSession.createDataFrame directly and provide a schema***:

    from pyspark.sql.types import FloatType

    df = spark.createDataFrame([1.0, 2.0, 3.0], FloatType())

    df.show()

    ## +-----+
    ## |value|
    ## +-----+
    ## |  1.0|
    ## |  2.0|
    ## |  3.0|
    ## +-----+

but for a simple range it would be better to use SparkSession.range:

    from pyspark.sql.functions import col

    spark.range(1, 4).select(col("id").cast("double"))
  • No longer supported.

** Spark SQL also provides a limited support for schema inference on Python objects exposing __dict__.

*** Supported only in Spark 2.0 or later.

From: stackoverflow.com/q/32742004