HowTo: Debugging Amazon Web Services session-related applications (AWS CLI, Boto3) in Pycharm 

It’s not very hard to create an app or script in Python, which uses AWS infrastructure. We can implement it using the AWS CLI tool call from Python or Boto3 library. But because the calling of AWS functions is relying on a particular AWS-session, debugging it using a common IDE like Pycharm is not an obvious thing to make.

The particular problem is that the AWS session is encapsulated in a particular terminal session, and it’s not spread across different terminal sessions. So, If one got an already working session in the terminal, starting a debug process in Pycharm by default will create a new terminal session without an AWS session. 

The solution for this problem is to ease an AWS session creation in the particular terminal, and use Pycharm’s remote debugging functionality, which will allow using the particular terminal session.

1. AWS session creation 

To make an AWS session activation easier, one can use the bash script below:

#!/bin/bash

SESSION_FILE=~/.aws/mfa_session
PROFILE=${1:-$AWS_PROFILE}
ARGS=''
EVAL=0

unset AWS_ACCESS_KEY_ID AWS_SECRET_ACCESS_KEY AWS_SESSION_TOKEN

IDENTITY_JSON=$(aws $ARGS sts get-caller-identity)

USER_JSON=$(aws $ARGS iam get-user)

ACCOUNT=$(echo "$IDENTITY_JSON" | jq -r '.Account')
IAMUSER=$(echo "$USER_JSON" | jq -r '.User.UserName')

MFA_ARN="arn:aws:iam::$ACCOUNT:mfa/$IAMUSER"

echo -n "MFA for $MFA_ARN: " >&2
read MFA_TOKEN_CODE
echo ""

SESSION_JSON=$(aws $ARGS sts get-session-token --serial-number "$MFA_ARN" --token-code "$MFA_TOKEN_CODE")
if [ $? != 0 ]; then
        exit 1
fi

AWS_ACCESS_KEY_ID=$(echo "$SESSION_JSON" | jq -r '.Credentials.AccessKeyId')
AWS_SESSION_TOKEN="$(echo "$SESSION_JSON" | jq -r '.Credentials.SessionToken')"
AWS_SECRET_ACCESS_KEY=$(echo "$SESSION_JSON" | jq -r '.Credentials.SecretAccessKey')

export AWS_ACCESS_KEY_ID=$AWS_ACCESS_KEY_ID
export AWS_SECRET_ACCESS_KEY=$AWS_SECRET_ACCESS_KEY
export AWS_SESSION_TOKEN=$AWS_SESSION_TOKEN

echo "--- ACTIVATED ---"

Copy it to the file and run ‘source script file’. The script will ask to enter an MFA token and if it succeeds with the authorisation, it will create a session in the terminal.

2. Setting up Pycharm remote debugger

a. Enter “Edit Configurations…” menu, where one is choosing what to run.

b. Create (+) new configuration of “Python Remote Debug” type

Here you will see the Hostname and Port parameters, while the Hostname is better to leave as ‘localhost’, it’s better to change Port to something particular. For this example, I’ll use 5255 port.

After changing the port number, in the upper field one will see the command which should be used to connect to the Debug server. Like:

import pydevd_pycharm
pydevd_pycharm.settrace(‘localhost’, port=5525, stdoutToServer=True, stderrToServer=True)
 

This call can be wrapped up in try-except statements and should be added as the first call in the script. 

3. Starting the debugger with AWS session 

When you run Pycharm with Remote debugging configuration, it opens the server on a mentioned Port, and waiting till any client (as a python script) will connect to this port. 

When the script is connected to a remote debugger, the debugger will take care of running the script further, manage breakpoints, see variables, etc. But it will be used in the terminal’s context, where the client script is engaged.

a. In the terminal session, create an AWS session, using the script mentioned in the 1st step. One can use a terminal inside Pycharm

b. Start the Remote debugger configuration in Pycharm configured in the 2nd step.c. In the terminal with an active session run the pre-configured in the 2nd step python script, using default run, like “python scriptname.py”

d. The script will automatically connect to the Pycharm debugger server and one can debug the script which will be run in existing AWS session

And that’s it!

Metalib to upload Pandas data frames as CSV or Parquet files to AWS S3 + create a Hive external table to this S3 bucket

https://github.com/elegantwist/uploader_s3_hive

At my work I do this stuff a lot:

  1. Read data to Pandas data frame
  2. Save the data into AWS S3 bucket in CSV or parquet format
  3. Create an external Hive table, which should read from those files in S3

To help myself do this job, I’ve created a small meta-library, which contains basic methods which I’m using to implement this pipeline.

Issues this lib addressed:

  • A lot of handwork for type preparation for Pandas data frame
  • Automation for the process saving data from pandas to file, send the file, and then delete it locally
  • Automate script creation for ‘create’ statement for adding an external table in Hive, with types based on the basic types of the data frame

Usage:

1. The data should be in Pandas data frame

2. Init the lib as

upl = Upload_S3_HIVE(df, export_type='csv')

where

‘df’ – is a pandas data frame

‘export_type’ is a format of the saved file in s3. It could be ‘csv’ or ‘parquet’ (for saving in parquet file the arrow method is used)

3. To script a ‘create’ statement for creation of an external Hive table, use method

upl.script_hive_ext_table_create_statement(s3_bucket,
dbtablename)

where

‘s3_bucket’ is the full path to the bucket where the file will be saved

‘dbtablename’ – the name of the external table which will be using this AWS S3 bucket as a data source

For ‘csv’ format and ‘parquet’ format create statements will be different

I.e. pandas data frame:

Col1 Col2
11 Test1
21 Test2
31 Test3

With s3_bucket = “s3://export_test/test”, and dbtablename = ‘test_table’

Will be scripted as:

 CREATE EXTERNAL TABLE ext.test_table
            ( `col1` BIGINT,
              `col2` VARCHAR(512))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
STORED AS TEXTFILE
LOCATION
  's3://export_test/test'

4. To copy a data frame into AWS S3 bucket in a particular format, use this method:

upl.upload_to_s3(s3bucket)

where

‘s3_bucket’ is the full path to the bucket where the file will be saved

The method will fix type for the columns of the data frame, export the data from pandas data frame to a local file (base on the base filename), copy this file to the mentioned S3 bucket, and delete the local file afterwards.