Metalib to upload Pandas data frames as CSV or Parquet files to AWS S3 + create a Hive external table to this S3 bucket

https://github.com/elegantwist/uploader_s3_hive

At my work I do this stuff a lot:

  1. Read data to Pandas data frame
  2. Save the data into AWS S3 bucket in CSV or parquet format
  3. Create an external Hive table, which should read from those files in S3

To help myself do this job, I’ve created a small meta-library, which contains basic methods which I’m using to implement this pipeline.

Issues this lib addressed:

  • A lot of handwork for type preparation for Pandas data frame
  • Automation for the process saving data from pandas to file, send the file, and then delete it locally
  • Automate script creation for ‘create’ statement for adding an external table in Hive, with types based on the basic types of the data frame

Usage:

1. The data should be in Pandas data frame

2. Init the lib as

upl = Upload_S3_HIVE(df, export_type='csv')

where

‘df’ – is a pandas data frame

‘export_type’ is a format of the saved file in s3. It could be ‘csv’ or ‘parquet’ (for saving in parquet file the arrow method is used)

3. To script a ‘create’ statement for creation of an external Hive table, use method

upl.script_hive_ext_table_create_statement(s3_bucket,
dbtablename)

where

‘s3_bucket’ is the full path to the bucket where the file will be saved

‘dbtablename’ – the name of the external table which will be using this AWS S3 bucket as a data source

For ‘csv’ format and ‘parquet’ format create statements will be different

I.e. pandas data frame:

Col1 Col2
11 Test1
21 Test2
31 Test3

With s3_bucket = “s3://export_test/test”, and dbtablename = ‘test_table’

Will be scripted as:

 CREATE EXTERNAL TABLE ext.test_table
            ( `col1` BIGINT,
              `col2` VARCHAR(512))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
STORED AS TEXTFILE
LOCATION
  's3://export_test/test'

4. To copy a data frame into AWS S3 bucket in a particular format, use this method:

upl.upload_to_s3(s3bucket)

where

‘s3_bucket’ is the full path to the bucket where the file will be saved

The method will fix type for the columns of the data frame, export the data from pandas data frame to a local file (base on the base filename), copy this file to the mentioned S3 bucket, and delete the local file afterwards.

Leave a Reply