Here is a simple way to move a large data set from a source system, to HDFS without having to make an intermediate copy of the data. Of course, using the Hadoop ecosystem tools like Flume or Sqoop would be preferred, but in this scenario my data nodes aren’t able to see my source system due to network zoning issues, so my edge node is the only way of connecting back to my Hadoop Cluster.
Here are the steps to complete this:
1) Establish Passwordless SSH Keys between the source and target system (If you don’t know how to do it, read this) – This is optional, but if you are automating, this makes it much easier.
2) On your source system, run the following command:
cat /dir/to/your/file.txt | ssh firstname.lastname@example.org “hadoop fs -put – /your/hdfs/location/file.txt”
(Notice the – here to accept the input from the cat command.)
3) To ensure accuracy, you can then run an md5sum on both your source and destination files. Hint: You can pipe the output of an HDFS file to do an md5sum as a simple way of doing it. Like so:
hadoop fs -cat /your/hdfs/location/file.txt|md5sum
You can also use GZIP or another form of compression on both ends of this stream which can be helpful if network latency is your biggest bottleneck.