Unexpected behaviour with a conditional generator expression

I was running a piece of code that unexpectedly gave a logic error at one part of the program. When investigating the section, I created a test file to test the set of statements being run and found out an unusual bug that seems very odd.

I tested this simple code:

    array = [1, 2, 2, 4, 5] # Original array
    f = (x for x in array if array.count(x) == 2) # Filters original
    array = [5, 6, 1, 2, 9] # Updates original to something else

    print(list(f)) # Outputs filtered

And the output was:

    >>> []

Yes, nothing. I was expecting the filter comprehension to get items in the array with a count of 2 and output this, but I didn't get that:

    # Expected output
    >>> [2, 2]

When I commented out the third line to test it once again:

    array = [1, 2, 2, 4, 5] # Original array
    f = (x for x in array if array.count(x) == 2) # Filters original
    ### array = [5, 6, 1, 2, 9] # Ignore line

    print(list(f)) # Outputs filtered

The output was correct (you can test it for yourself):

    >>> [2, 2]

At one point I outputted the type of the variable f:

    array = [1, 2, 2, 4, 5] # Original array
    f = (x for x in array if array.count(x) == 2) # Filters original
    array = [5, 6, 1, 2, 9] # Updates original

    print(list(f)) # Outputs filtered

And I got:

    >>> <class 'generator'>
    >>> []

Why is updating a list in Python changing the output of another generator variable? This seems very odd to me.

Python's generator expressions are late binding (see PEP 289 -- Generator Expressions) (what the other answers call "lazy"):

Early Binding versus Late Binding

After much discussion, it was decided that the first (outermost) for-expression [of the generator expression] should be evaluated immediately and that the remaining expressions be evaluated when the generator is executed.

[...] Python takes a late binding approach to lambda expressions and has no precedent for automatic, early binding. It was felt that introducing a new paradigm would unnecessarily introduce complexity.

After exploring many possibilities, a consensus emerged that binding issues were hard to understand and that users should be strongly encouraged to use generator expressions inside functions that consume their arguments immediately. For more complex applications, full generator definitions are always superior in terms of being obvious about scope, lifetime, and binding.

That means it only evaluates the outermost for when creating the generator expression. So it actually binds the value with the name array in the "subexpression" in array (in fact it's binding the equivalent to iter(array) at this point). But when you iterate over the generator the if array.count call actually refers to what is currently named array.

Since it's actually a list not an array I changed the variable names in the rest of the answer to be more accurate.

In your first case the list you iterate over and the list you count in will be different. It's as if you used:

    list1 = [1, 2, 2, 4, 5]
    list2 = [5, 6, 1, 2, 9]
    f = (x for x in list1 if list2.count(x) == 2)

So you check for each element in list1 if its count in list2 is two.

You can easily verify this by modifying the second list:

    >>> lst = [1, 2, 2]
    >>> f = (x for x in lst if lst.count(x) == 2)
    >>> lst = [1, 1, 2]
    >>> list(f)

If it iterated over the first list and counted in the first list it would've returned [2, 2] (because the first list contains two 2). If it iterated over and counted in the second list the output should be [1, 1]. But since it iterates over the first list (containing one 1) but checks the second list (which contains two 1s) the output is just a single 1.

Solution using a generator function

There are several possible solutions, I generally prefer not to use "generator expressions" if they aren't iterated over immediately. A simple generator function will suffice to make it work correctly:

    def keep_only_duplicated_items(lst):
        for item in lst:
            if lst.count(item) == 2:
                yield item

And then use it like this:

    lst = [1, 2, 2, 4, 5]
    f = keep_only_duplicated_items(lst)
    lst = [5, 6, 1, 2, 9]

    >>> list(f)
    [2, 2]

Note that the PEP (see the link above) also states that for anything more complicated a full generator definition is preferrable.

A better solution using a generator function with a Counter

A better solution (avoiding the quadratic runtime behavior because you iterate over the whole array for each element in the array) would be to count (collections.Counter) the elements once and then do the lookup in constant time (resulting in linear time):

    from collections import Counter

    def keep_only_duplicated_items(lst):
        cnts = Counter(lst)
        for item in lst:
            if cnts[item] == 2:
                yield item

Appendix: Using a subclass to "visualize" what happens and when it happens

It's quite easy to create a list subclass that prints when specific methods are called, so one can verify that it really works like that.

In this case I just override the methods __iter__ and count because I'm interested over which list the generator expression iterates and in which list it counts. The method bodies actually just delegate to the superclass and print something (since it uses super without arguments and f-strings it requires Python 3.6 but it should be easy to adapt for other Python versions):

    class MyList(list):
        def __iter__(self):
            print(f'__iter__() called on {self!r}')
            return super().__iter__()

        def count(self, item):
            cnt = super().count(item)
            print(f'count({item!r}) called on {self!r}, result: {cnt}')
            return cnt

This is a simple subclass just printing when the __iter__ and count method are called:

    >>> lst = MyList([1, 2, 2, 4, 5])

    >>> f = (x for x in lst if lst.count(x) == 2)
    __iter__() called on [1, 2, 2, 4, 5]

    >>> lst = MyList([5, 6, 1, 2, 9])

    >>> print(list(f))
    count(1) called on [5, 6, 1, 2, 9], result: 1
    count(2) called on [5, 6, 1, 2, 9], result: 1
    count(2) called on [5, 6, 1, 2, 9], result: 1
    count(4) called on [5, 6, 1, 2, 9], result: 0
    count(5) called on [5, 6, 1, 2, 9], result: 1

From: stackoverflow.com/q/54245618