How to build training datasets using SageMaker Ground Truth

Amazon SageMaker Ground Truth is a fully managed data labeling service that helps you build high-quality training datasets for your machine learning models. It simplifies the process of labeling images, videos, and text data, making it easier to train accurate and reliable models...

Amazon SageMaker Ground Truth is a fully managed data labeling service that helps you build high-quality training datasets for your machine learning models. It simplifies the process of labeling images, videos, and text data, making it easier to train accurate and reliable models.

Before we dive in, let's clarify SageMaker Workforce. Think of it like training a dog. To teach a dog a new trick, you need to show it what to do and reward it when it gets it right. Similarly, SageMaker needs labeled data to teach models how to recognize patterns.

SageMaker Workforce is a managed service within Amazon SageMaker that allows you to hire human labelers to annotate your data easily. It provides a platform for crowdsourcing data labeling tasks, making building high-quality training datasets for machine learning models easier.

SageMaker Workforce offers three distinct workforce types to suit various labeling needs:

  1. Private Workforce: For organizations with internal teams of labelers, Private Workforce provides maximum control over the labeling process. This is ideal for projects involving sensitive data or requiring specialized domain expertise.
  2. Vendor Workforce: When external expertise is needed, Vendor Workforce offers access to a pool of skilled labelers from third-party vendors. This option is suitable for large-scale projects or tasks that require specialized knowledge.
  3. Amazon Mechanical Turk (https://www.mturk.com/): is a crowdsourcing marketplace that connects businesses with a global workforce of individuals who can complete small, discrete tasks. This is ideal for simple labeling tasks that don't require deep domain knowledge.

Let's see how the private workforce works.

Define Private Workforce

1) Login into the AWS Console and navigate to Amazon SageMaker.

2) Expand "Ground Truth" on the left menu and click on "Labeling workforce".

3) Click on the "Private" tab and then on "Create private team".

4) Make sure that "Create a private team with AWS Cognito" is selected. Give the name to the team, for example, "DemoTeam".

5) On "Add workers" section, list up to 50 workers you would like to invite. Type your organization name and email constant of the person responsible for managing the tasks.

6) (Optional) Click on "Preview invitation" to see the email template that will be sent to the workers.

7) (Optional) Pick a SNS topic to inform workers about available work.

8) Click on "Create private team". When the creation process is finished, you will be redirected to the "Private" tab of the Labeling workforce, where you can see the summary and team data.

9) If you open an email sent by AWS on the previous step, you will see a link and worker login credentials. Click on it and login as worker. You will be asked to set a new password. You will see a screen with available jobs as soon as the password is updated. There are no jobs defined yet, so it will be empty.

10) Go back to labeling workforce screen. If you refresh it, the worker will be marked as verified.

Define Labeling Job

11) To create a labeling job, we need data to process. Let's make an S3 bucket and upload a few photos of raccoons. S3 bucket names must be unique across all accounts, so for the sake of example, let's call it "labeling-demo".

12) Expand "Ground Truth" on the left menu and click on "Labeling jobs".

13) Click on "Create labeling jobs", then give the job name, and pick the S3 bucket created in step #11 for the input dataset. In the "Data type" select "Image".

14) Select or create a new IAM Role with access to the S3 bucket for SageMaker.

15) Click on "Complete data setup".

16) In the "Task Type" keep "Image Classification (Single Label)". Click "Next" at the bottom of the screen.

17) In the "Workers" section select "Private", and pick "DemoTeam" created in step #4. Update task timeouts and expiration times if default values are not suitable.

18) In "Image classification," enter the task description, such as "Find all raccoons." The more details you will share with your worker, the better results will be achieved.  

19) Click on "Create" to create "DemoJob".

20) Go back to the workers' view. New job will be listed here. You may need to wait a few minutes because AWS does not always make jobs available instantly.

21) Click on "Start working" and start labeling images from the job.

When the labeling is finished, open S3 bucket (DemoJob) from the step #11. Here, you can find that the DemoJob folder has been created and contains three subfolders. The result of the job labeling is manifets/output/output.manifest.

Try to experiment with various classification types. SageMaker supports task categories:

  • Image - Image Classification (Single-label), Image Classification (Multi-label), Bounding box, Semantic segmentation
  • Text - Text Classification (Single-label), Text Classification (Multi-label), Named entity recognition
  • Video - Clip Classification, Bounding box, Polygon, Polyline, Key Point
  • Point cloud - Advanced identification of objects in LIDAR point cloud frames
  • Custom - When you need labeling of objects not covered by standard options.

As you can see, Ground Truth is easy to set up and use. However, processing hundreds of thousands or even millions of images or videos is still a big challenge. If you are at the point where the private workforce cannot handle expected volumes, then it is time to consider Vendor Workforce or Amazon Mechanical Turk.


Looking for help? Reach me any time.

Subscribe to AWS by Vlad Frantskevich

Don’t miss out on the latest issues. Sign up now to get updates.
jamie@example.com
Subscribe