https://github.com/elegantwist/uploader_s3_hive
At my work I do this stuff a lot:
- Read data to Pandas data frame
- Save the data into AWS S3 bucket in CSV or parquet format
- Create an external Hive table, which should read from those files in S3
To help myself do this job, I’ve created a small meta-library, which contains basic methods which I’m using to implement this pipeline.
Issues this lib addressed:
- A lot of handwork for type preparation for Pandas data frame
- Automation for the process saving data from pandas to file, send the file, and then delete it locally
- Automate script creation for ‘create’ statement for adding an external table in Hive, with types based on the basic types of the data frame
Usage:
1. The data should be in Pandas data frame
2. Init the lib as
upl = Upload_S3_HIVE(df, export_type='csv')
where
‘df’ – is a pandas data frame
‘export_type’ is a format of the saved file in s3. It could be ‘csv’ or ‘parquet’ (for saving in parquet file the arrow method is used)
3. To script a ‘create’ statement for creation of an external Hive table, use method
upl.script_hive_ext_table_create_statement(s3_bucket, dbtablename)
where
‘s3_bucket’ is the full path to the bucket where the file will be saved
‘dbtablename’ – the name of the external table which will be using this AWS S3 bucket as a data source
For ‘csv’ format and ‘parquet’ format create statements will be different
I.e. pandas data frame:
Col1 | Col2 |
11 | Test1 |
21 | Test2 |
31 | Test3 |
With s3_bucket = “s3://export_test/test”, and dbtablename = ‘test_table’
Will be scripted as:
CREATE EXTERNAL TABLE ext.test_table
( `col1` BIGINT,
`col2` VARCHAR(512))
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ';'
STORED AS TEXTFILE
LOCATION
's3://export_test/test'
4. To copy a data frame into AWS S3 bucket in a particular format, use this method:
upl.upload_to_s3(s3bucket)
where
‘s3_bucket’ is the full path to the bucket where the file will be saved
The method will fix type for the columns of the data frame, export the data from pandas data frame to a local file (base on the base filename), copy this file to the mentioned S3 bucket, and delete the local file afterwards.