Copying files between buckets in S3

There is a time in every man's life when he gets involved with big data. It recently happened to me not only in my pet projects but in a real life scenario and suddenly your computer and network is not enough and everything is slow again. An obvious choice for storing data is S3 on AWS but even here you have to be smart to be able to move data around. I have written a small snippet of code to move data between buckets efficiently.

It happens sometimes that data that you need are in a different bucket than you really need them. You cannot rename the bucket sometimes data are in a different account etc. Moving files to your drive and then uploading back is not only very slow but you also pay for the band with. Much more efficient is to use Amazon’s APIs to copy files around. I think there are some commercial tools that help you with this but you can easily roll your own tool.

This copies all the files in several batches. Each batch will move 20 files in parallel. My files were all of the similar size so I did not need to worry about creating some smarter mechanism than just wait for the whole batch.

Just remember that if you are intending to use EMR later name your buckets only with alphanumeric, dashes and dots (like in the example) or you will be copying again soon because EMR does not like it one bit.

comments powered by Disqus