Fix Python – Best way to get the max value in a Spark dataframe column

I’m trying to figure out the best way to get the largest value in a Spark dataframe column.
Consider the following example:
df = spark.createDataFrame([(1., 4.), (2., 5.), (3., 6.)], [“A”, “B”])
df.show()

Which creates:
+—+—+
| A| B|
+—+—+
|1.0|4.0|
|2.0|5.0|
|3.0|6.0|
+—+—+

My goal is to find the largest value in column A (by insp….

Fix Python – Load CSV file with PySpark

I’m new to Spark and I’m trying to read CSV data from a file with Spark.
Here’s what I am doing :
sc.textFile(‘file.csv’)
.map(lambda line: (line.split(‘,’)[0], line.split(‘,’)[1]))
.collect()

I would expect this call to give me a list of the two first columns of my file but I’m getting this error :

File “”, line 1, in
IndexError: list ….

Fix Python – Sort in descending order in PySpark

I’m using PySpark (Python 2.7.9/Spark 1.3.1) and have a dataframe GroupObject which I need to filter & sort in the descending order. Trying to achieve it via this piece of code.
group_by_dataframe.count().filter(“`count` >= 10”).sort(‘count’, ascending=False)

But it throws the following error.
sort() got an unexpected keyword argument ‘ascending’….

Fix Python – Spark Dataframe distinguish columns with duplicated name

So as I know in Spark Dataframe, that for multiple columns can have the same name as shown in below dataframe snapshot:
[
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0})),
Row(a=107831, f=SparseVector(5, {0: 0.0, 1: 0.0, 2: 0.0, 3: 0.0, 4: 0.0}), a=125….

Fix Python – How to change a dataframe column from String type to Double type in PySpark?

I have a dataframe with column as String.
I wanted to change the column type to Double type in PySpark.
Following is the way, I did:
toDoublefunc = UserDefinedFunction(lambda x: x,DoubleType())
changedTypedf = joindf.withColumn(“label”,toDoublefunc(joindf[‘show’]))

Just wanted to know, is this the right way to do it as while running
through Logis….

Fix Python – Filter Pyspark dataframe column with None value

I’m trying to filter a PySpark dataframe that has None as a row value:
df.select(‘dt_mvmt’).distinct().collect()

[Row(dt_mvmt=u’2016-03-27′),
Row(dt_mvmt=u’2016-03-28′),
Row(dt_mvmt=u’2016-03-29′),
Row(dt_mvmt=None),
Row(dt_mvmt=u’2016-03-30′),
Row(dt_mvmt=u’2016-03-31′)]

and I can filter correctly with an string value:
df[df.dt_mvmt == ’20….

Fix Python – How do I add a new column to a Spark DataFrame (using PySpark)?

I have a Spark DataFrame (using PySpark 1.5.1) and would like to add a new column.
I’ve tried the following without any success:
type(randomed_hours) # => list

# Create in Python and transform to RDD

new_col = pd.DataFrame(randomed_hours, columns=[‘new_col’])

spark_new_col = sqlContext.createDataFrame(new_col)

my_df_spark.withColumn(“hours”, s….

Fix Python – How to change dataframe column names in pyspark?

I come from pandas background and am used to reading data from CSV files into a dataframe and then simply changing the column names to something useful using the simple command:
df.columns = new_column_name_list

However, the same doesn’t work in pyspark dataframes created using sqlContext.
The only solution I could figure out to do this easily i….