byte-stream to key feasibility/feature request
I’m doing some benchmarking to help decide if umobj will be an appropriate tool for the CRoCCo lab to use for storing, archiving, and moving large CFD/Trubulence data sets. As they are currently being written, these data sets have many,many small and medium sized files. (Medium = 10-100 MB) Without delving into whether or not a more optimal storage scheme is appropriate, I was wondering:
Would it be feasible to create a utility that can read in a byte stream from stdin (the output of
tar
being sent to stdout) and then write it as a file/key to a UMIACS object store bucket?
My thinking is that tar
is pretty good at reading a whole bunch of files and shoving them into a streaming archive. If this streaming archive can be sent and checksummed as it’s being created you could amortize some of the network/disk time associated with both reading the files from disk and sending them over the network, and you would no longer need to checksum a million different small files, just a few larger ones.
To fetch the files from the bucket, catobj might be able to be used and piped into tar… This work flow would be similar to a back-to-back tar
with the intermediate archive living in the bucket. What do you think?