Reading a huge .csv file

I'm currently trying to read data from .csv files in Python 2.7 with up to 1 million rows, and 200 columns (files range from 100mb to 1.6gb). I can do this (very slowly) for the files with under 300,000 rows, but once I go above that I get memory errors. My code looks like this:

    def getdata(filename, criteria):
        data=[]
        for criterion in criteria:
            data.append(getstuff(filename, criteron))
        return data

    def getstuff(filename, criterion):
        import csv
        data=[]
        with open(filename, "rb") as csvfile:
            datareader=csv.reader(csvfile)
            for row in datareader: 
                if row[3]=="column header":
                    data.append(row)
                elif len(data)<2 and row[3]!=criterion:
                    pass
                elif row[3]==criterion:
                    data.append(row)
                else:
                    return data

The reason for the else clause in the getstuff function is that all the elements which fit the criterion will be listed together in the csv file, so I leave the loop when I get past them to save time.

My q are:

  1. How can I manage to get this to work with the bigger files?

  2. Is there any way I can make it faster?

My computer has 8gb RAM, running 64bit Windows 7, and the processor is 3.40 GHz (not certain what information you need).

Thanks very much for any help!

You are reading all rows into a list, then processing that list. Don't do that.

Process your rows as you produce them. If you need to filter the data first, use a generator function:

    import csv

    def getstuff(filename, criterion):
        with open(filename, "rb") as csvfile:
            datareader = csv.reader(csvfile)
            yield next(datareader)  # yield the header row
            count = 0
            for row in datareader:
                if row[3] == criterion:
                    yield row
                    count += 1
                elif count:
                    # done when having read a consecutive series of rows 
                    return

I also simplified your filter test; the logic is the same but more concise.

Because you are only matching a single sequence of rows matching the criterion, you could also use:

    import csv
    from itertools import dropwhile, takewhile

    def getstuff(filename, criterion):
        with open(filename, "rb") as csvfile:
            datareader = csv.reader(csvfile)
            yield next(datareader)  # yield the header row
            # first row, plus any subsequent rows that match, then stop
            # reading altogether
            # Python 2: use `for row in takewhile(...): yield row` instead
            # instead of `yield from takewhile(...)`.
            yield from takewhile(
                lambda r: r[3] == criterion,
                dropwhile(lambda r: r[3] != criterion, datareader))
            return

You can now loop over getstuff() directly. Do the same in getdata():

    def getdata(filename, criteria):
        for criterion in criteria:
            for row in getstuff(filename, criterion):
                yield row

Now loop directly over getdata() in your code:

    for row in getdata(somefilename, sequence_of_criteria):
        # process row

You now only hold one row in memory, instead of your thousands of lines per criterion.

yield makes a function a generator function, which means it won't do any work until you start looping over it.

From: stackoverflow.com/q/17444679

Back to homepage or read more recommendations: