How to link PyCharm with PySpark?

I'm new with apache spark and apparently I installed apache-spark with homebrew in my macbook:

    Last login: Fri Jan  8 12:52:04 on console
    user@MacBook-Pro-de-User-2:~$ pyspark
    Python 2.7.10 (default, Jul 13 2015, 12:05:58)
    [GCC 4.2.1 Compatible Apple LLVM 6.1.0 (clang-602.0.53)] on darwin
    Type "help", "copyright", "credits" or "license" for more information.
    Using Spark's default log4j profile: org/apache/spark/
    16/01/08 14:46:44 INFO SparkContext: Running Spark version 1.5.1
    16/01/08 14:46:46 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
    16/01/08 14:46:47 INFO SecurityManager: Changing view acls to: user
    16/01/08 14:46:47 INFO SecurityManager: Changing modify acls to: user
    16/01/08 14:46:47 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(user); users with modify permissions: Set(user)
    16/01/08 14:46:50 INFO Slf4jLogger: Slf4jLogger started
    16/01/08 14:46:50 INFO Remoting: Starting remoting
    16/01/08 14:46:51 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkDriver@]
    16/01/08 14:46:51 INFO Utils: Successfully started service 'sparkDriver' on port 50199.
    16/01/08 14:46:51 INFO SparkEnv: Registering MapOutputTracker
    16/01/08 14:46:51 INFO SparkEnv: Registering BlockManagerMaster
    16/01/08 14:46:51 INFO DiskBlockManager: Created local directory at /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/blockmgr-769e6f91-f0e7-49f9-b45d-1b6382637c95
    16/01/08 14:46:51 INFO MemoryStore: MemoryStore started with capacity 530.0 MB
    16/01/08 14:46:52 INFO HttpFileServer: HTTP File server directory is /private/var/folders/5x/k7n54drn1csc7w0j7vchjnmc0000gn/T/spark-8e4749ea-9ae7-4137-a0e1-52e410a8e4c5/httpd-1adcd424-c8e9-4e54-a45a-a735ade00393
    16/01/08 14:46:52 INFO HttpServer: Starting HTTP Server
    16/01/08 14:46:52 INFO Utils: Successfully started service 'HTTP file server' on port 50200.
    16/01/08 14:46:52 INFO SparkEnv: Registering OutputCommitCoordinator
    16/01/08 14:46:52 INFO Utils: Successfully started service 'SparkUI' on port 4040.
    16/01/08 14:46:52 INFO SparkUI: Started SparkUI at
    16/01/08 14:46:53 WARN MetricsSystem: Using default name DAGScheduler for source because is not set.
    16/01/08 14:46:53 INFO Executor: Starting executor ID driver on host localhost
    16/01/08 14:46:53 INFO Utils: Successfully started service '' on port 50201.
    16/01/08 14:46:53 INFO NettyBlockTransferService: Server created on 50201
    16/01/08 14:46:53 INFO BlockManagerMaster: Trying to register BlockManager
    16/01/08 14:46:53 INFO BlockManagerMasterEndpoint: Registering block manager localhost:50201 with 530.0 MB RAM, BlockManagerId(driver, localhost, 50201)
    16/01/08 14:46:53 INFO BlockManagerMaster: Registered BlockManager
    Welcome to
          ____              __
         / __/__  ___ _____/ /__
        _\ \/ _ \/ _ `/ __/  '_/
       /__ / .__/\_,_/_/ /_/\_\   version 1.5.1

    Using Python version 2.7.10 (default, Jul 13 2015 12:05:58)
    SparkContext available as sc, HiveContext available as sqlContext.

I would like start playing in order to learn more about MLlib. However, I use Pycharm to write scripts in python. The problem is: when I go to Pycharm and try to call pyspark, Pycharm can not found the module. I tried adding the path to Pycharm as follows:

cant link pycharm with spark

Then from a blog I tried this:

    import os
    import sys

    # Path for spark source folder

    # Append pyspark  to Python Path

        from pyspark import SparkContext
        from pyspark import SparkConf
        print ("Successfully imported Spark Modules")

    except ImportError as e:
        print ("Can not import Spark Modules", e)

And still can not start using PySpark with Pycharm, any idea of how to "link" PyCharm with apache-pyspark?.


Then I search for apache-spark and python path in order to set the environment variables of Pycharm:

apache-spark path:

    user@MacBook-Pro-User-2:~$ brew info apache-spark
    apache-spark: stable 1.6.0, HEAD
    Engine for large-scale data processing
    /usr/local/Cellar/apache-spark/1.5.1 (649 files, 302.9M) *
      Poured from bottle

python path:

    user@MacBook-Pro-User-2:~$ brew info python
    python: stable 2.7.11 (bottled), HEAD
    Interpreted, interactive, object-oriented programming language
    /usr/local/Cellar/python/2.7.10_2 (4,965 files, 66.9M) *

Then with the above information I tried to set the environment variables as follows:

configuration 1

Any idea of how to correctly link Pycharm with pyspark?

Then when I run a python script with the above configuration I have this exception:

    /usr/local/Cellar/python/2.7.10_2/Frameworks/Python.framework/Versions/2.7/bin/python2.7 /Users/user/PycharmProjects/spark_examples/
    Traceback (most recent call last):
      File "/Users/user/PycharmProjects/spark_examples/", line 1, in <module>
        from pyspark import SparkContext
    ImportError: No module named pyspark

UPDATE: Then I tried this configurations proposed by @zero323

Configuration 1:


conf 1


     user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1$ ls
    CHANGES.txt           NOTICE                libexec/
    LICENSE               bin/

Configuration 2:


enter image description here


    user@MacBook-Pro-de-User-2:/usr/local/Cellar/apache-spark/1.5.1/libexec$ ls
    R/        bin/      data/     examples/ python/
    RELEASE   conf/     ec2/      lib/      sbin/

With PySpark package (Spark 2.2.0 and later)

With SPARK-1267 being merged you should be able to simplify the process by pip installing Spark in the environment you use for PyCharm development.

  1. Go to File -> Settings -> Project Interpreter
  2. Click on install button and search for PySpark

enter image description here

  1. Click on install package button.

Manually with user provided Spark installation

Create Run configuration :

  1. Go to Run -> Edit configurations
  2. Add new Python configuration
  3. Set Script path so it points to the script you want to execute
  4. Edit Environment variables field so it contains at least:

    • SPARK_HOME - it should point to the directory with Spark installation. It should contain directories such as bin (with spark-submit, spark-shell, etc.) and conf (with spark-defaults.conf,, etc.)
    • PYTHONPATH - it should contain $SPARK_HOME/python and optionally $SPARK_HOME/python/lib/ if not available otherwise. some-version should match Py4J version used by a given Spark installation ( - 1.5, 0.9 - 1.6, 0.10.3 - 2.0, 0.10.4 - 2.1, 0.10.4 - 2.2, 0.10.6 - 2.3)

enter image description here

  1. Apply the settings

Add PySpark library to the interpreter path (required for code completion) :

  1. Go to File -> Settings -> Project Interpreter
  2. Open settings for an interpreter you want to use with Spark
  3. Edit interpreter paths so it contains path to $SPARK_HOME/python (an Py4J if required)
  4. Save the settings


  1. Install type annotations matching installed Spark version to get better completion and static error detection (Disclaimer - I am an author of the project).


Use newly created configuration to run your script.