一、parquet-tools
首先考虑使用parquet-tools。根据参考文档0和参考文档1中的说法:
parquet-tools version 1.8.2 supports merge command.
其使用的命令为:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
[root@emr-header-1 ~]# hadoop jar parquet-tools-1.10.1.jar merge -help parquet-merge: Merges multiple Parquet files into one. The command doesn't merge row groups, just places one after the other. When used to merge many small files, the resulting file will still contain small row groups, which usually leads to bad query performance. usage: parquet-merge [option...] <input> [<input> ...] <output> where option is one of: --debug Enable debug output -h,--help Show this help string --no-color Disable color output even if supported where <input> is the source parquet files/directory to be merged <output> is the destination parquet file [root@emr-header-1 ~]# |
参考文档2中也明确地提到:
we strongly recommend *not* to use parquet-tools merge unless you really know what you’re doing. It is known to cause some pretty bad performance problems in some cases. The problem is that it takes the row groups from the existing file and moves them unmodified into a new file – it does *not* merge the row groups from the different files. This can actually give you the worst of both worlds – you lose parallelism because the files are big, but you have all the performance overhead of processing many small row groups.
因此使用parquet-tools并不可取。
二、Spark
直接使用Spark:
1 2 3 |
val parquetFileDF = spark.read.parquet("hdfs://emr-header-1.cluster-149038:9000/path_to_parquet_files/") val rows = parquetFileDF.coalesce(1) rows.write.parquet("hdfs://emr-header-1.cluster-149038:9000/all-in-one-parquet-directory") |
注意:all-in-one-parquet-directory是一个目录,生成的parquet文件在其内部!
参考文档:
0、https://stackoverflow.com/questions/44400331/merge-two-parquet-files-in-hdfs
1、https://community.cloudera.com/t5/Support-Questions/Merging-many-Parquet-files/td-p/48892
2、https://community.cloudera.com/t5/Support-Questions/combine-small-parquet-files/td-p/33525/page/2
转载时请保留出处,违法转载追究到底:进城务工人员小梅 » 合并多个parquet文件