Menu Bar

Thursday, March 24, 2016

Understanding Hadoop framework and related terms

Hadoop is a distributed computing framework which is more efficient and optimal solution to solve the big-data problems.

Apache Parquet format. It is also creating tables to represent the HDFS files in Impala / Apache Hive with matching schema.

Parquet is a format designed for analytical applications on Hadoop. Instead of grouping your data into rows like typical data formats, it groups your data into columns. This is ideal for many analytical queries where instead of retrieving data from specific records, you're analyzing relationships between specific variables across many records. Parquet is designed to optimize data storage and retriveval in these scenarios. Source:http://quickstart.cloudera/#/tutorial/ingest_structured_data

Apache Sqoop [Sql + hadoop] is a tool that uses MapReduce to transfer data between Hadoop clusters and relational databases very efficiently.

 Sqoop to import the data into Hive but used Impala to query the data.
This is because Hive and Impala can share both data files and the table metadata. Hive works by compiling SQL queries into MapReduce jobs, which makes it very flexible, whereas Impala executes queries itself and is built from the ground up to be as fast as possible, which makes it better for interactive analysis. We'll use Hive later for an ETL (extract-transform-load) workload.
Sqoop structured data into HDFS, transform it into Avro file format(as Avro is a Hadoop optimized file format), and import the schema files for use when we query this data.

No comments:

Post a Comment