Recently, I tried to use Amazon S3 to store input/output files for hadoop jobs and got confused by the different S3 filesystems available on top of Apache Hadoop. Here are some notes which helped me:
The first S3-backed Hadoop filesystem was introduced in Hadoop 0.10.0
(HADOOP-574). It was called
the S3 block fileystem and it was assigned the URI scheme s3://
. In this
implementation, files are stored as blocks, just like they are in HDFS. The
files stored by this filesystem are not interoperable with other S3 tools -
what this means is that if you go to the AWS console and try to look for
files written by this filesystem, you won’t find them - instead you would find
files named something like block_-1212312341234512345
etc. Similarly, this
filesystem won’t be able to read pre-existing files on S3 since it assumes a
block-based format.
To overcome these limitations, another S3-backed filesystem was introduced in
Hadoop 0.18.0 (HADOOP-930).
It was called the S3 native filesystem and it was assigned the URI scheme
s3n://
. This filesystem lets you access files on S3 that were written with
other tools. Conversely, other tools can access files written using Hadoop.
When this filesystem was introduced, S3 had a filesize limit of 5GB and hence
this filesystem could only operate with files less than 5GB. In late 2010,
Amazon S3 introduced the Multipart Upload API
and raised the file size limit from 5GB to 5TB.
A lot of Hadoop distributions and HaaS providers added support for this quite
early and it landed in Apache Hadoop in version 2.4.0
(HADOOP-9454)
Using the S3 block file system is no longer recommended. Various
Hadoop-as-a-service providers like Qubole and
Amazon EMR go as far as mapping both
the s3://
and the s3n://
URIs to the S3 native filesystem to ensure this.