Many times there is a need to access files or interact with HDFS from Java applications or libraries. Hadoop has built in many tools in order to work or interact with HDFS – however in case you’d like to read into a content of a file remotely (e.g. retrieve the headers of a CSV/TSV file) random exceptions can occurs. One of these remote exceptions coming from the HDFS NameNode is a java.io.IOException: File /user/abc/xyz/ could only be replicated to 0 nodes, instead of 1.
Such an exception can be reproduced by the following code snippet:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
The exception looks like this:
1 2 3 4 5 6 7
Note: actually all HDFS operations fail in case of the underlying input stream does not have a readable channel (check the java.nio.channels package. RemoteBlockReader2 needs channel based inputstreams to deal with direct buffers.
Digging into details and checking the Hadoop 2.2 source code we find the followings:
org.apache.hadoop.hdfs.BlockReaderFactory you can get access to a BlockReader implementation like
org.apache.hadoop.hdfs.RemoteBlockReader2, which it is responsible for reading a single block from a single datanode.
The blockreader is retrieved in the following way:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
In order to avoid the exception and use the right version of the block reader, the followin property
conf.useLegacyBlockReader should be TRUE.
Long story short, the configuration set of a job should be set to:
Unluckily in all cases when interacting with HDFS, and the underlying input stream does not have a readable channel, you can’t use the RemoteBlockReader2 implementation, but you have to fall back to the old legacy RemoteBlockReader.
Hope this helps, SequenceIQ