0%

cdh集成minio做hdfs和minio对拷

存储层由hdfs迁移到minio,基于hadoop的distcp做数据迁移

依赖

更新aws sdk版本(option)

1
2
3
4
5
6
7
# 删除老版本的aws包
`find /opt/cdh -name '*aws*.jar' | grep hadoop | xargs -n1 rm`

# 下载aws依赖
cd /opt/cdh/lib/hadoop
wget https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk/1.7.4/aws-java-sdk-1.7.4.jar
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/2.7.7/hadoop-aws-2.7.7.jar

配置

core-site.xml新增配置

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
<property>
<name>fs.s3a.access.key</name>
<value>DYaDwXsj8VRtWYPSbr7A</value>
</property>

<property>
<name>fs.s3a.secret.key</name>
<value>z7HAEhdyseNX9AVyzDLAJzEjZChJsnAf1f7VehE</value>
</property>

<property>
<name>fs.s3a.endpoint</name>
<value>http://10.199.150.160:32030</value>
</property>

<property>
<name>fs.s3a.path.style.access</name>
<value>true</value>
</property>

<property>
<name>fs.s3a.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>

<property>
<name>fs.s3a.fast.upload</name>
<value>false</value>
</property>
<property>
<name>fs.s3a.multipart.size</name>
<value>104857600</value>
</property>
<property>
<name>fs.s3a.multipart.threshold</name>
<value>268435456</value>
</property>
<property>
<name>fs.s3a.fast.buffer.size</name>
<value>1048576</value>
</property>
<property>
<name>fs.s3a.threads.core</name>
<value>15</value>
</property>
<property>
<name>fs.s3a.threads.max</name>
<value>256</value>
</property>
<property>
<name>fs.s3a.block.size</name>
<value>33554432</value>
</property>

hadoop distcp

hdfs复制到s3a:
hadoop distcp hdfs://ha/user/geosmart/spark s3a://bucket/spark

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
2021-12-02 06:09:11,482 INFO  [main] tools.OptionsParser (OptionsParser.java:parseBlocksPerChunk(205)) - parseChunkSize: blocksperchunk false
2021-12-02 06:09:13,651 INFO [main] security.TokenCache (TokenCache.java:obtainTokensForNamenodesInternal(144)) - Got dt for hdfs://ha; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:ha, Ident: (token for geosmart: HDFS_DELEGATION_TOKEN owner=geosmart@HADOOP.COM, renewer=yarn, realUser=, issueDate=1638425353607, maxDate=1639030153607, sequenceNumber=222397, masterKeyId=432)
2021-12-02 06:09:14,068 INFO [main] tools.SimpleCopyListing (SimpleCopyListing.java:printStats(594)) - Paths (files+dirs) cnt = 28; dirCnt = 7
2021-12-02 06:09:14,068 INFO [main] tools.SimpleCopyListing (SimpleCopyListing.java:doBuildListing(389)) - Build file listing completed.
2021-12-02 06:09:14,474 INFO [main] tools.DistCp (CopyListing.java:buildListing(94)) - Number of paths in the copy list: 28
2021-12-02 06:09:15,407 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:submitJobInternal(202)) - number of splits:4
2021-12-02 06:09:15,732 INFO [main] mapreduce.JobSubmitter (JobSubmitter.java:printTokens(291)) - Submitting tokens for job: job_1634007232783_42646
2021-12-02 06:09:16,287 INFO [main] impl.YarnClientImpl (YarnClientImpl.java:submitApplication(260)) - Submitted application application_1634007232783_42646
2021-12-02 06:09:16,362 INFO [main] mapreduce.Job (Job.java:submit(1311)) - The url to track the job: http://hadoop-test-40:8088/proxy/application_1634007232783_42646/
2021-12-02 06:09:16,363 INFO [main] tools.DistCp (DistCp.java:execute(193)) - DistCp job-id: job_1634007232783_42646
2021-12-02 06:09:16,364 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1356)) - Running job: job_1634007232783_42646
2021-12-02 06:09:23,757 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1377)) - Job job_1634007232783_42646 running in uber mode : false
2021-12-02 06:09:23,761 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1384)) - map 0% reduce 0%
2021-12-02 06:09:37,366 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1384)) - map 25% reduce 0%
2021-12-02 06:11:53,468 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1384)) - map 100% reduce 0%
2021-12-02 06:12:08,928 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1395)) - Job job_1634007232783_42646 completed successfully
2021-12-02 06:12:09,010 INFO [main] mapreduce.Job (Job.java:monitorAndPrintJob(1402)) - Counters: 38
File System Counters
FILE: Number of bytes read=0
FILE: Number of bytes written=641240
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=728115665
HDFS: Number of bytes written=0
HDFS: Number of read operations=115
HDFS: Number of large read operations=0
HDFS: Number of write operations=8
S3A: Number of bytes read=0
S3A: Number of bytes written=728108596
S3A: Number of read operations=261
S3A: Number of large read operations=0
S3A: Number of write operations=5991
Job Counters
Launched map tasks=4
Other local map tasks=4
Total time spent by all maps in occupied slots (ms)=1583700
Total time spent by all reduces in occupied slots (ms)=0
Total time spent by all map tasks (ms)=395925
Total vcore-milliseconds taken by all map tasks=395925
Total megabyte-milliseconds taken by all map tasks=1621708800
Map-Reduce Framework
Map input records=28
Map output records=0
Input split bytes=464
Spilled Records=0
Failed Shuffles=0
Merged Map outputs=0
GC time elapsed (ms)=514
CPU time spent (ms)=40900
Physical memory (bytes) snapshot=2998263808
Virtual memory (bytes) snapshot=22389567488
Total committed heap usage (bytes)=9028239360
File Input Format Counters
Bytes Read=6605
File Output Format Counters
Bytes Written=0
DistCp Counters
Bytes Copied=728108596
Bytes Expected=728108596
Files Copied=28

问题

远程调试排查问题

  1. /bin/hdfs中添加agentlib远程调试排查问题

    1
    2
    DEBUG_OPTS=" -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=19999 "
    exec "$JAVA" $DEBUG_OPTS -Dproc_$COMMAND $JAVA_HEAP_MAX $HADOOP_OPTS $CLASS "$@"
  2. idea中引用hdfs的相关jar包,添加remote jvm打断点调试

    1
    -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=19999

Name or service not known

java.net.UnknownHostException: ThinkT14: ThinkT14: Name or service not known
注意修改host为ip 域名,如10.199.121.12 ThinkT14