How to Install GBase Cloud Data Warehouse (GCDW) and Configure HDFS with Kerberos Authentication

GBase database (GBase数据库) provides robust capabilities for modern data infrastructure. This guide provides the step-by-step installation of GCDW (GBase Cloud Data Warehouse) and configuring HDFS-related settings, ensuring seamless integration and optimal performance. By following this guide, you will set up a secure and efficient data warehouse environment leveraging both GBase and Hadoop technologies.

How to Install GBase Cloud Data Warehouse (GCDW) and Configure HDFS with Kerberos Authentication
GCDW Installation and HDFS Configuration with GBase Database

Table of Contents

  1. Prerequisites
  2. Resource Limit Configuration
  3. Hosts File Configuration
  4. Configuring Trust for gbase User
  5. Installing GBase Cloud Data Warehouse (GCDW)
  6. Configuring HDFS with Kerberos Authentication
  7. Updating GCluster and GNode Configuration
  8. Enabling Auto-Startup on Boot
  9. Creating a Warehouse
  10. Testing the Setup
  11. Conclusion

Prerequisites

  • Operating System: CentOS 7 or later, Red Hat Enterprise Linux 7 or later
  • Hardware Requirements:
    • CPU: Quad-core processor or higher
    • Memory: At least 16 GB RAM
    • Storage: Minimum of 500 GB free disk space
  • Network Configuration:
    • Static IP addresses for all nodes
    • Proper hostname resolution
  • User Permissions: Root or sudo access
  • Software Packages:
    • Java JDK 1.8 or higher
    • SSH installed and configured for password-less login between nodes
  • Kerberos: KDC server installed and configured
  • Hadoop Cluster: Installed and configured with Kerberos authentication

Resource Limit Configuration

Step 1: Edit /etc/security/limits.conf

Add the following configurations to set resource limits for the gbase user:

gbase soft nofile 65536
gbase hard nofile 65536
gbase soft nproc unlimited
gbase hard nproc unlimited

Step 2: Distribute the limits.conf File Across All Nodes

Use a tool like scp or a cluster management script to distribute the file:

scp /etc/security/limits.conf root@<node_ip>:/etc/security/limits.conf

Repeat this for all nodes in the cluster.

Hosts File Configuration

Step 1: Edit /etc/hosts

Add the IP addresses and hostnames for all nodes, including:

  • Primary and secondary NameNodes in the Hadoop cluster
  • All DataNodes
  • KDC domain and its corresponding hostname

Example:

192.168.1.10 namenode1.hadoop.com namenode1
192.168.1.11 namenode2.hadoop.com namenode2
192.168.1.20 datanode1.hadoop.com datanode1
192.168.1.21 datanode2.hadoop.com datanode2
192.168.1.30 kdc.hadoop.com kdc

Step 2: Distribute the /etc/hosts File Across All Nodes

Again, use scp or a cluster management script:

scp /etc/hosts root@<node_ip>:/etc/hosts

Repeat for all nodes.

Configuring Trust for gbase User

Set up SSH password-less authentication for the gbase user across all nodes.

Step 1: Generate SSH Keys for gbase User

On the primary node:

su - gbase
ssh-keygen -t rsa -b 2048

Press Enter to accept the default file location and leave the passphrase empty.

Step 2: Distribute the Public Key

Copy the public key to all nodes:

for node in node1 node2 node3; do
    ssh-copy-id -i ~/.ssh/id_rsa.pub gbase@$node
done

Installing GBase Cloud Data Warehouse (GCDW)

Step 1: Extract the Installation Package

Assuming you have the installation package at /opt/tools/gcdw-NoLicense-9.8.0.7.6-redhat8-x86_64.tar.bz2:

sudo mkdir -p /opt/gcdw
sudo tar -xf /opt/tools/gcdw-NoLicense-9.8.0.7.6-redhat8-x86_64.tar.bz2 -C /opt/gcdw/

Step 2: Set Permissions for the GCDW Directory

sudo chown -R gbase:gbase /opt/gcdw/

Step 3: Distribute and Execute SetSysEnv.py

Distribute the Script

Copy the SetSysEnv.py script to all nodes:

scp /opt/gcdw/gcinstall/SetSysEnv.py gbase@<node_ip>:/opt/gcdw/SetSysEnv.py

Execute the Script on All Nodes

Log in to each node and run:

su - gbase
python /opt/gcdw/SetSysEnv.py --dbaUser=gbase --installPrefix=/opt/gcdw --cgroup

Configuring HDFS with Kerberos Authentication

Step 1: Configure demo.options for Kerberos Authentication

Create a configuration file named demo.options in /opt/gcdw/gcinstall/ with the following content:

gcluster_instance_name=instance_name
instance_root_name=root
instance_root_password=root_password
gcdw_STORAGE_STYLE=hdfs
gcdw_HDFS_URI=hdfs://namenode.hadoop.com:8020/
gcdw_HDFS_AUTH_MODE=kerberos
gcdw_HDFS_PRINCIPAL=hdfs/[email protected]
gcdw_HDFS_KEYTAB=/etc/hdfs.keytab
gcdw_HDFS_KERBEROS_CONFIG=/etc/krb5.conf
  • Replace instance_name with your desired instance name.
  • Update namenode.hadoop.com with your NameNode’s hostname.
  • Ensure the hdfs.keytab file and krb5.conf are correctly placed and accessible.

Step 2: Execute the Deployment Script

Navigate to the installation directory:

cd /opt/gcdw/gcinstall

Run the installation script:

./gcinstall.py --silent=demo.options

Follow any on-screen prompts to complete the installation.

Step 3: Validate Installation

After installation, validate with the following command:

gcadmin account --show

Ensure that the account information displays correctly.

Updating GCluster and GNode Configuration

Step 1: Update GCluster Configuration

On all nodes, edit /opt/<node_ip>/gcluster/config/gbase_8a_gcluster.cnf:

[gbased]
gcdw_hdfs_client_timeout=600
_t_gcluster_support_cte=1
table_definition_cache=5120
table_open_cache=1280
gcluster_random_insert=1
gcluster_send_client_data_timeout=1800
group_concat_max_len=10240
gbase_hdfs_auth_mode=kerberos
gbase_hdfs_protocol=rpc
gbase_hdfs_keytab=/etc/hdfs.keytab
gbase_hdfs_principal=hdfs/[email protected]
gcdw_hdfs_namenodes=192.168.1.10,192.168.1.11|namenode1.hadoop.com,namenode2.hadoop.com
  • Replace the IP addresses and hostnames with your actual NameNode IPs and hostnames.
  • Ensure the keytab and principal are correctly specified.

Step 2: Restart GCluster Service

su - gbase
gcluster_services all restart

Step 3: Update GNode Configuration

Edit /opt/<node_ip>/gnode/config/gbase_8a_gbase.cnf:

[gbased]
gbase_loader_parallel_degree=4
gbase_parallel_degree=4
gbase_parallel_max_thread_in_pool=512
gbase_loader_read_timeout=5000
gbase_loader_max_line_length=32M
thread_stack=524288
gbase_hdfs_auth_mode=kerberos
gbase_hdfs_protocol=rpc
gbase_hdfs_keytab=/etc/hdfs.keytab
gbase_hdfs_principal=hdfs/[email protected]
gcdw_hdfs_namenodes=192.168.1.10,192.168.1.11|namenode1.hadoop.com,namenode2.hadoop.com

Step 4: Restart GBase Service

su - gbase
gbase_services all restart

Enabling Auto-Startup on Boot

Enable GCDW services to start automatically on system boot.

Step 1: Edit /etc/rc.d/rc.local

Add the following lines:

su - gbase -c "gcluster_services all start"
su - gbase -c "gcware_services all start"

Step 2: Make the Script Executable

sudo chmod +x /etc/rc.d/rc.local

Creating a Warehouse

Step 1: Use gcadmin to Create a Warehouse Template

su - gbase
gcadmin createwh e wh.xml

Step 2: Edit wh.xml

Modify the wh.xml file to include:

  • Node IPs in the <NodeList> section
  • Warehouse name and comment

Example:

<Warehouse>
    <Name>my_warehouse</Name>
    <Comment>Production Warehouse</Comment>
    <NodeList>
        <Node>192.168.1.20</Node>
        <Node>192.168.1.21</Node>
    </NodeList>
</Warehouse>

Step 3: Create the Warehouse

gcadmin createwh wh.xml

Verify that the warehouse is created successfully.

Testing the Setup

Step 1: Connect to GBase

/home/gbase/GCDW/bin/gcdw_client -h localhost -p 5432 -U gbase -W

Step 2: Create an External Table Pointing to HDFS

CREATE EXTERNAL TABLE hdfs_data (
    id INT,
    name VARCHAR(100)
)
LOCATION ('hdfs://namenode.hadoop.com:8020/user/gbase/data.csv')
FORMAT 'CSV' (DELIMITER ',');

Step 3: Load Data into HDFS

Put your data file into HDFS:

hdfs dfs -mkdir -p /user/gbase
hdfs dfs -put data.csv /user/gbase/

Step 4: Query the External Table

SELECT * FROM hdfs_data LIMIT 10;

Verify that the data is correctly retrieved from HDFS.

Step 5: Perform Data Analytics

Run aggregate functions or joins to test performance:

SELECT COUNT(*) FROM hdfs_data;

Step 6: Monitor Logs and Performance

  • Check GBase logs located in /var/log/gcdw for any errors.
  • Use Hadoop’s web interfaces to monitor HDFS and resource usage.

Conclusion

By this, you have successfully installed GBase Cloud Data Warehouse and configured it with Hadoop Distributed File System using Kerberos authentication. This setup provides a secure and scalable data management solution, ready for enterprise-level demands.

Leave a Comment

Comments

No comments yet. Why don’t you start the discussion?

    Comments