Originally posted as an answer to How can I migrate data from Google cloud storage into AWS S3 buckets?
Hadoop Distcp is a good way of moving large amounts of data between different file systems. Here are the steps I used to transfer data between Google Cloud Storage and S3 using distcp.
- I created a 3-node Hadoop cluster using Google Dataproc. If it is configured correctly, it should have access to your GCS files without having to add any additional configuration.
- Once the cluster finishes initializing, SSH into the master node and run the following command.
hadoop distcp \ http://gs://<bucket-name>/<folder>/ \ http://s3a://<aws_access_key>:<aws_secret_key>@<s3-bucket>/<folder>
This should copy the data from GCS to S3.