Fix Python – Load CSV file with PySpark

Question

Asked By – Kernael

I’m new to Spark and I’m trying to read CSV data from a file with Spark.
Here’s what I am doing :

sc.textFile('file.csv')
    .map(lambda line: (line.split(',')[0], line.split(',')[1]))
    .collect()

I would expect this call to give me a list of the two first columns of my file but I’m getting this error :

File “”, line 1, in
IndexError: list index out of range

although my CSV file as more than one column.

Now we will see solution for issue: Load CSV file with PySpark


Answer

Are you sure that all the lines have at least 2 columns? Can you try something like, just to check?:

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)>1) \
    .map(lambda line: (line[0],line[1])) \
    .collect()

Alternatively, you could print the culprit (if any):

sc.textFile("file.csv") \
    .map(lambda line: line.split(",")) \
    .filter(lambda line: len(line)<=1) \
    .collect()

This question is answered By – G Quintana

This answer is collected from stackoverflow and reviewed by FixPython community admins, is licensed under cc by-sa 2.5 , cc by-sa 3.0 and cc by-sa 4.0