User Data Deletion
There will be no user data retained. Therefore no need of requesting to delete your data
User Data Deletion
There will be no user data retained. Therefore no need of requesting to delete your data
Recently I wanted to use AWS Glue heavily for some development work at office. AWS Glue is a managed service from AWS which comes handy in processing or computing large amounts of data. My use case was to implement an ETL (Extract - Transform - Load) work flow. Therefore there were multiple glue jobs doing different tasks varied form decompression, data processing, validating, loading, etc. Those glue jobs were managed by a single AWS Step Function in each workflows. Everything was straightforward until there was a requirement to read some values from AWS Glue Job and include them in a SNS Notification. As everybody does, I called Google for my help. I started going through documentations and stackoverflow posts. Then finally it broke my heart when I read this post.
It was evident that AWS Glue Jobs are designed not to return values.
"By definition, AWS glue is expected to work on huge amount of data and hence it is expected that output will also be huge amount of data."
So the expectation was to store the data at the end of the processing. But my use case was to read some small set of value. Following are the approaches I figured out.
You can create a lambda function that would accept the values you want to store and it can either write it to a file and store it in a s3 or can put the value into a DB for future reference. Then that lambda function can be invoked within the Glue Job and pass the required values.
import json
import boto3
def lambda_handler(event, context):
s3_client = boto3.client('s3')
bucket_name = 'myBucket'
s3_path = 'path/to/dir/values.txt'
values_str = json.dumps(event)
print("Received values: %s" % values_str)
try:
response = s3_client.get_object(Bucket=bucket_name, Key=s3_path)
current_content = response['Body'].read().decode('utf-8')
print("Reading content from file at s3:%s key:%s" % (bucket_name, s3_path))
content = '{}, {}'.format(current_content, values_str)
except s3_client.exceptions.NoSuchKey:
print("Created a new file at s3:%s key:%s" % (bucket_name, s3_path))
content = values_str
encoded_string = content.encode("utf-8")
response = s3_client.put_object(Bucket=bucket_name, Key=s3_path, Body=encoded_string)
return {
'statusCode': 200,
'body': json.dumps(response)
}
import boto3
import json
# GLUE JOB CODE
lambda_client = boto3.client('lambda')
response = lambda_client.invoke(FunctionName='myLambdaFnName', Payload=json.dumps({
"key1": 'val',
"key2": a_value_from_glue
}))
2. Logging the values with a special pattern and reading it from CloudWatch
# GLUE JOB CODE
print("[MY_SERVICE] key1: val1")
print("[MY_SERVICE] key2: %s" % a_value_from_glue)
import boto3
logs_client = boto3.client('logs')
repsonse = logs_client.start_query(
logGroupName='/aws-glue/jobs/output',
startTime=timestamp,
endTime=int(datetime.now().timestamp()),
queryString='fields @timestamp, @message | filter @message like /MY_SERVICE/'
)
query_id = repsonse['queryId']
query_response = None
while query_response == None or query_response['status'] == 'Running':
time.sleep(1)
query_response = logs_client.get_query_results(
queryId=query_id
)
logger.info('Received results: {}'.format(query_response['results']))
results = []
for result in query_response['results']:
timestamp = next(ele for ele in result if ele['field']=='@timestamp')['value']
message = next(ele for ele in result if ele['field']=='@message')['value'].replace('\n', '')
results.append('{}: {}'.format(timestamp, message))
print(results)
![]() |
Image soure |
PS: I am making this a living document, I will update more info when I receive the confirmation and other data.I am not a financial person, so everything will be on laymen's terms and this is an alien subject for me, therefore please correct me if something is wrong