These Are the Five Things Killing Your Smile

No. 4 is exercising. Learn of the everyday mistakes which research shows are secretly ruining your smile and how you can fix them.

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Creating a Spark job using Pyspark and executing it in AWS EMR

Spark is considered as one of the data processing engine which is preferable, for usage in a vast range of situations. Data Scientists and application developers integrate Spark into their own implementations in order to transform, analyze and query data at a larger scale. Functions which are most related with Spark, contain collective queries over huge data sets, machine learning problems and processing of streaming data from various sources.

PySpark is considered as the interface which provides access to Spark using the Python programming language. PySpark is basically a Python API for Spark.

Amazon Elastic MapReduce, as known as EMR is an Amazon Web Services mechanism for big data analysis and processing. This is established based on Apache Hadoop, which is known as a Java based programming framework which assists the processing of huge data sets in a distributed computing environment. EMR also manages a vast group of big data use cases, such as bioinformatics, scientific simulation, machine learning and data transformations.

Let me explain each one of the above by providing the appropriate snippets.

I’ve been mingling around with Pyspark, for the last few days and I was able to built a simple spark application and execute it as a step in an AWS EMR cluster. The following functionalities were covered within this use-case:

Let me explain each one of the above by providing the appropriate snippets.

This is where, two files from an S3 bucket are being retrieved and will be stored into two data-frames individually.

#importing necessary libaries
from pyspark import SparkContext
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import StringType
from pyspark import SQLContext
from itertools import islice
from pyspark.sql.functions import col

#creating the context
sqlContext = SQLContext(sc)

#reading the first csv file and store it in an RDD
rdd1= sc.textFile(“s3n://pyspark-test-kula/test.csv”).map(lambda line: line.split(“,”))

#removing the first row as it contains the header
rdd1 = rdd1.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)

#print the dataframe
df1.show()

targetDf.show()

rdd2 = sc.textFile(“s3n://pyspark-test-kula/test2.csv”).map(lambda line: line.split(“,”))

rdd2 = rdd2.mapPartitionsWithIndex(
lambda idx, it: islice(it, 1, None) if idx == 0 else it
)

df2 = df2.toDF([‘policyID’,’zip’,’region’,’state’])

df2.show()

innerjoineddf = df1WithoutNullVal.alias(‘a’).join(df2.alias(‘b’),col(‘b.policyID’) == col(‘a.policyID’)).select([col(‘a.’+xx) for xx in a.columns] + [col(‘b.zip’),col(‘b.region’), col(‘b.state’)])

innerjoineddf.show()

innerjoineddf.write.parquet(“s3n://pyspark-transformed-kula/test.parquet”)

Once we’re done with the above steps, we’ve successfully created the working python script which retrieves two csv files, store them in different dataframes and then merge both of them into one, based on some common column.

There after we can submit this Spark Job in an EMR cluster as a step. So to do that the following steps must be followed:

aws emr add-steps — cluster-id j-3H6EATEWWRWS — steps Type=spark,Name=ParquetConversion,Args=[ — deploy-mode,cluster, — master,yarn, — conf,spark.yarn.submit.waitAppCompletion=true,s3a://test/script/pyspark.py],ActionOnFailure=CONTINUE

If the above script has been executed successfully, it should start the step in the EMR cluster which you have mentioned. Normally it takes few minutes to produce a result, whether it’s a success or a failure. If it’s a failure, you can probably debug the logs, and see where you’re going wrong. Otherwise you’ve achieved your end goal.

Complete source-code:

Add a comment

Related posts:

Things to know about Shatavari!!

In Indian Ayurvedic medicine, a kind of asparagus plant called Shatavari has been utilized for many years. Shatavari also called satavari, satavar, or Asparagus racemosus (A. racemosus), is believed…

CBD Oil Tincture Coconut

When it comes time to completely relax your body and mind, a Just CBD Coconut Oil Tincture is one of the fastest acting products on the market. Not only are CBD tinctures a versatile method of…

CTSI is Now Integrated with Binance Pay!

Millions of Binance users and participating merchants are now able to make and receive payments in CTSI We are excited to announce that CTSI is now available to use on Binance Pay. Binance Pay is a…