Thursday, November 26, 2020

Receiving values from a AWS Glue Job

 Recently I wanted to use AWS Glue heavily for some development work at office. AWS Glue is a managed service from AWS which comes handy in processing or computing large amounts of data. My use case was to implement an ETL (Extract - Transform - Load) work flow. Therefore there were multiple glue jobs doing different tasks varied form decompression, data processing, validating, loading, etc. Those glue jobs were managed by a single AWS Step Function in each workflows. Everything was straightforward until there was a requirement to read some values from AWS Glue Job and include them in a SNS Notification. As everybody does, I called Google for my help. I started going through documentations and stackoverflow posts. Then finally it broke my heart when I read this post.



It was evident that AWS Glue Jobs are designed not to return values.

"By definition, AWS glue is expected to work on huge amount of data and hence it is expected that output will also be huge amount of data."

So the expectation was to store the data at the end of the processing. But my use case was to read some small set of value. Following are the approaches I figured out.

1. Saving the values to a file using a Lambda function and use it when required

You can create a lambda function that would accept the values you want to store and it can either write it to a file and store it in a s3 or can put the value into a DB for future reference. Then that lambda function can be invoked within the Glue Job and pass the required values.


Lambda function

import json
import boto3

def lambda_handler(event, context):
s3_client = boto3.client('s3')
bucket_name = 'myBucket'
s3_path = 'path/to/dir/values.txt'

values_str = json.dumps(event)
print("Received values: %s" % values_str)

try:
response = s3_client.get_object(Bucket=bucket_name, Key=s3_path)
current_content = response['Body'].read().decode('utf-8')
print("Reading content from file at s3:%s key:%s" % (bucket_name, s3_path))
content = '{}, {}'.format(current_content, values_str)
except s3_client.exceptions.NoSuchKey:
print("Created a new file at s3:%s key:%s" % (bucket_name, s3_path))
content = values_str

encoded_string = content.encode("utf-8")

response = s3_client.put_object(Bucket=bucket_name, Key=s3_path, Body=encoded_string)
return {
'statusCode': 200,
'body': json.dumps(response)
}

Lambda invocation


import boto3
import json

# GLUE JOB CODE

lambda_client = boto3.client('lambda')

response = lambda_client.invoke(FunctionName='myLambdaFnName', Payload=json.dumps({
"key1": 'val',
"key2": a_value_from_glue
}))

 

2. Logging the values with a special pattern and reading it from CloudWatch

You can easily log your values. When you log make sure to include a special pattern so that you can easily extract those values. Any other service can fetch those log entries using the CloudWatch API. Refer the following example

Logging within the the Glue Job

# GLUE JOB CODE

print("[MY_SERVICE] key1: val1")
print("[MY_SERVICE] key2: %s" % a_value_from_glue)

Reading the values

import boto3

logs_client = boto3.client('logs')
repsonse = logs_client.start_query(
logGroupName='/aws-glue/jobs/output',
startTime=timestamp,
endTime=int(datetime.now().timestamp()),
queryString='fields @timestamp, @message | filter @message like /MY_SERVICE/'
)

query_id = repsonse['queryId']

query_response = None
while query_response == None or query_response['status'] == 'Running':
time.sleep(1)
query_response = logs_client.get_query_results(
queryId=query_id
)

logger.info('Received results: {}'.format(query_response['results']))

results = []
for result in query_response['results']:
timestamp = next(ele for ele in result if ele['field']=='@timestamp')['value']
message = next(ele for ele in result if ele['field']=='@message')['value'].replace('\n', '')
results.append('{}: {}'.format(timestamp, message))

print(results)


Please note that there can be a small delay when logs are pushed to CloudWatch, therefore make sure to give enough time for the logs to get pushed.

Happy coding folks!


No comments:

Post a Comment