Reading and Writing HDFS Parquet Data (2023)

Use the PXF HDFS connector to read and write Parquet-format data. This section describes how to read and write HDFS files that are stored in Parquet format, including how to create, query, and insert into external tables that reference files in the HDFS data store.

PXF supports reading or writing Parquet files compressed with these codecs: snappy, gzip, and lzo.

PXF currently supports reading and writing primitive Parquet data types only.

Prerequisites

Ensure that you have met the PXF Hadoop Prerequisites before you attempt to read data from or write data to HDFS.

Data Type Mapping

To read and write Parquet primitive data types in Greenplum Database, map Parquet data values to Greenplum Database columns of the same type.

Parquet supports a small set of primitive data types, and uses metadata annotations to extend the data types that it supports. These annotations specify how to interpret the primitive type. For example, Parquet stores both INTEGER and DATE types as the INT32 primitive type. An annotation identifies the original type as a DATE.

Read Mapping

PXF uses the following data type mapping when reading Parquet data:

Parquet Physical Type Parquet Logical Type PXF/Greenplum Data Type
boolean Boolean
binary (byte_array) Bytea
binary (byte_array) Date Date
binary (byte_array) Timestamp_millis Timestamp
binary (byte_array) UTF8 Text
double Float8
fixed_len_byte_array Decimal Numeric
float Real
int32 int_8 Smallint
int32 Date Date
int32 Decimal Numeric
int32 Integer
int64 Decimal Numeric
int64 Bigint
int96 Timestamp

Note: PXF supports filter predicate pushdown on all parquet data types listed above, except the fixed_len_byte_array and int96 types.

PXF can read a Parquet LIST nested type when it represents a one-dimensional array of certain Parquet types. The supported mappings follow:

Parquet Data Type PXF/Greenplum Data Type
list of <boolean> Boolean[]
list of <binary> Bytea[]
list of <binary> (Date) Date[]
list of <binary> (Timestamp_millis) Timestamp[]
list of <binary> (UTF8) Text[]
list of <double> Float8[]
list of <fixed_len_byte_array> (Decimal) Numeric[]
list of <float> Real[]
list of <int32> (int_8) Smallint[]
list of <int32> (Date) Date[]
list of <int32> (Decimal) Numeric[]
list of <int32> Integer[]
list of <int64> (Decimal) Numeric[]
list of <int64> Bigint[]
list of <int96> Timestamp[]

Write Mapping

PXF uses the following data type mapping when writing Parquet data:

PXF/Greenplum Data Type Parquet Physical Type Parquet Logical Type
Bigint int64
Boolean boolean
Bpchar1 binary (byte_array) UTF8
Bytea binary (byte_array)
Date int32 Date
Float8 double
Integer int32
Numeric/Decimal fixed_len_byte_array Decimal
Real float
SmallInt int32 int_8
Text binary (byte_array) UTF8
Timestamp2 int96
Timestamptz3 int96
Varchar binary (byte_array) UTF8
OTHERS UNSUPPORTED


1Because Parquet does not save the field length, a Bpchar that PXF writes to Parquet will be a text of undefined length.
2PXF localizes a Timestamp to the current system time zone and converts it to universal time (UTC) before finally converting to int96.
3PXF converts a Timestamptz to a UTC timestamp and then converts to int96. PXF loses the time zone information during this conversion.

PXF can write a one-dimensional LIST of certain Parquet data types. The supported mappings follow:

PXF/Greenplum Data Type Parquet Data Type
Bigint[] list of <int64>
Boolean[] list of <boolean>
Bpchar[]1 list of <binary> (UTF8)
Bytea[] list of <binary>
Date[] list of <int32> (Date)
Float8[] list of <double>
Integer[] list of <int32>
Numeric[]/Decimal[] list of <fixed_len_byte_array> (Decimal)
Real[] list of <float>
SmallInt[] list of <int32> (int_8)
Text[] list of <binary> (UTF8)
Timestamp[]2 list of <int96>
Timestamptz[]3 list of <int96>
Varchar[] list of <binary> (UTF8)
OTHERS UNSUPPORTED

About Parquet Schemas and Data

Parquet is a columnar storage format. A Parquet data file contains a compact binary representation of the data. The schema defines the structure of the data, and is composed of the same primitive and complex types identified in the data type mapping section above.

A Parquet data file includes an embedded schema. You can choose to provide the schema that PXF uses to write the data to HDFS via the SCHEMA custom option in the CREATE WRITABLE EXTERNAL TABLE LOCATION clause (described below):

External Table Type SCHEMA Specified? Behaviour
writable yes PXF uses the specified schema.
writable no PXF creates the Parquet schema based on the external table definition.

When you provide the Parquet schema file to PXF, you must specify the absolute path to the file, and the file must reside on the Hadoop file system.

Creating the External Table

The PXF HDFS connector hdfs:parquet profile supports reading and writing HDFS data in Parquet-format. When you insert records into a writable external table, the block(s) of data that you insert are written to one or more files in the directory that you specified.

Use the following syntax to create a Greenplum Database external table that references an HDFS directory:

CREATE [WRITABLE] EXTERNAL TABLE <table_name> ( <column_name> <data_type> [, ...] | LIKE <other_table> )LOCATION ('pxf://<path-to-hdfs-dir> ?PROFILE=hdfs:parquet[&SERVER=<server_name>][&<custom-option>=<value>[...]]')FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import'|'pxfwritable_export')[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];

The specific keywords and values used in the Greenplum Database CREATE EXTERNAL TABLE command are described in the table below.

Keyword Value
<path‑to‑hdfs‑file> The path to the directory in the HDFS data store. When the <server_name> configuration includes a pxf.fs.basePath property setting, PXF considers <path‑to‑hdfs‑file> to be relative to the base path specified. Otherwise, PXF considers it to be an absolute path. <path‑to‑hdfs‑file> must not specify a relative path nor include the dollar sign ($) character.
PROFILE The PROFILE keyword must specify hdfs:parquet.
SERVER=<server_name> The named server configuration that PXF uses to access the data. PXF uses the default server if not specified.
<custom‑option> <custom-option>s are described below.
FORMAT ‘CUSTOM’ Use FORMATCUSTOM’ with (FORMATTER='pxfwritable_export') (write) or (FORMATTER='pxfwritable_import') (read).
DISTRIBUTED BY If you want to load data from an existing Greenplum Database table into the writable external table, consider specifying the same distribution policy or <column_name> on both tables. Doing so will avoid extra motion of data between segments on the load operation.

The PXF hdfs:parquet profile supports the following read option. You specify this option in the CREATE EXTERNAL TABLE LOCATION clause:

Read Option Value Description
IGNORE_MISSING_PATH A Boolean value that specifies the action to take when <path-to-hdfs-file> is missing or invalid. The default value is false, PXF returns an error in this situation. When the value is true, PXF ignores missing path errors and returns an empty fragment.

The PXF hdfs:parquet profile supports encoding- and compression-related write options. You specify these write options in the CREATE WRITABLE EXTERNAL TABLE LOCATION clause. The hdfs:parquet profile supports the following custom write options:

Write Option Value Description
COMPRESSION_CODEC The compression codec alias. Supported compression codecs for writing Parquet data include: snappy, gzip, lzo, and uncompressed . If this option is not provided, PXF compresses the data using snappy compression.
ROWGROUP_SIZE A Parquet file consists of one or more row groups, a logical partitioning of the data into rows. ROWGROUP_SIZE identifies the size (in bytes) of the row group. The default row group size is 8 * 1024 * 1024 bytes.
PAGE_SIZE A row group consists of column chunks that are divided up into pages. PAGE_SIZE is the size (in bytes) of such a page. The default page size is 1 * 1024 * 1024 bytes.
ENABLE_DICTIONARY A boolean value that specifies whether or not to enable dictionary encoding. The default value is true; dictionary encoding is enabled when PXF writes Parquet files.
DICTIONARY_PAGE_SIZE When dictionary encoding is enabled, there is a single dictionary page per column, per row group. DICTIONARY_PAGE_SIZE is similar to PAGE_SIZE, but for the dictionary. The default dictionary page size is 1 * 1024 * 1024 bytes.
PARQUET_VERSION The Parquet version; PXF supports the values v1 and v2 for this option. The default Parquet version is v1.
SCHEMA The absolute path to the Parquet schema file on the Greenplum host or on HDFS.

Note: You must explicitly specify uncompressed if you do not want PXF to compress the data.

Parquet files that you write to HDFS with PXF have the following naming format: <file>.<compress_extension>.parquet, for example 1547061635-0000004417_0.gz.parquet.

Example

This example utilizes the data schema introduced in Example: Reading Text Data on HDFS and adds a new column, item_quantity_per_order, an array with length equal to number_of_orders, that identifies the number of items in each order.

Column Name Data Type
location text
month text
number_of_orders int
item_quantity_per_order int[]
total_sales float8

In this example, you create a Parquet-format writable external table that uses the default PXF server to reference Parquet-format data in HDFS, insert some data into the table, and then create a readable external table to read the data.

  1. Use the hdfs:parquet profile to create a writable external table. For example:

    postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet (location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision) LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
  2. Write a few records to the pxf_parquet HDFS directory by inserting directly into the pxf_tbl_parquet table. For example:

    postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Frankfurt', 'Mar', 3, '{1,11,111}', 3956.98 );postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Cleveland', 'Oct', 2, '{3333,7777}', 96645.37 );
  3. Recall that Greenplum Database does not support directly querying a writable external table. To read the data in pxf_parquet, create a readable external Greenplum Database referencing this HDFS directory:

    postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month text, number_of_orders int, item_quantity_per_order int[], total_sales double precision) LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet') FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
  4. Query the readable external table read_pxf_parquet:

    postgres=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
     location | month | number_of_orders | item_quantity_per_order | total_sales-----------+-------+------------------+-------------------------+------------- Frankfurt | Mar | 777 | {1,11,111} | 3956.98 Cleveland | Oct | 3812 | {3333,7777} | 96645.4(2 rows)

Understanding Overflow Conditions When Writing Numeric Data

PXF uses the HiveDecimal class to write numeric Parquet data. HiveDecimal limits both the precision and the scale of a numeric type to a maximum of 38.

When you define a NUMERIC column in an external table without specifying a precision or scale, PXF internally maps the column to a DECIMAL(38, 18).

A precision overflow condition can result when:

  • You define a NUMERIC column in the external table, and the integer digit count of a value exceeds maximum precision 38. For example, 1234567890123456789012345678901234567890.12345, which has an integer digit count of 45.
  • You define a NUMERIC(<precision>) column with a <precision> greater than 38. For example, NUMERIC(55).
  • You define a NUMERIC(<precision>, <scale>) column and the integer digit count of a value is greater than <precision> - <scale>. For example, you define a NUMERIC(20,4) column and the value is 12345678901234567.12, which has an integer digit count of 19, which is greater than 20-4=16.

PXF can take one of three actions when it detects an overflow while writing numeric data to a Parquet file: round the value (the default), return an error, or ignore the overflow. The pxf.parquet.write.decimal.overflow property in the pxf-site.xml server configuration governs PXF’s action in this circumstance; valid values for this property follow:

Value PXF Action
round When PXF encounters an overflow, it attempts to round the value before writing and logs a warning. PXF reports an error if rounding fails. This may potentially leave an incomplete data set in the external system. round is the default.
error PXF reports an error when it encounters an overflow, and the transaction fails.
ignore PXF writes a NULL value. (This was PXF’s behavior prior to version 6.6.0.)

PXF always logs an warning when it detects an overflow, regardless of the pxf.parquet.write.decimal.overflow property setting.

FAQs

How do I read a Parquet file in HDFS? ›

Write & Read Parquet file from HDFS

parquet) to read the parquet files and creates a Spark DataFrame. Using spark. write. parquet() function we can write Spark DataFrame to Parquet file, and parquet() function is provided in DataFrameWriter class.

Which component is best suited to read Parquet file from HDFS? ›

If your requirement is to read the files from HDFS, you can use the HDFS processors available in the nifi-hadoop-bundle . You can use either of the two approaches: A combination of ListHDFS and FetchHDFS. GetHDFS.

What is the optimal Parquet file size HDFS? ›

The official Parquet documentation recommends a disk block/row group/file size of 512 to 1024 MB on HDFS.

Does Parquet require HDFS? ›

You don't require HDFS to read Parquet files. It is definitely not a pre-requisite. We use parquet files at Incorta for our staging tables.

How do I convert Parquet to readable format? ›

The most simple way to convert a Parquet to a CSV file in Python is to import the Pandas library, call the pandas. read_parquet() function passing the 'my_file. parquet' filename argument to load the file content into a DataFrame, and convert the DataFrame to a CSV using the DataFrame to_csv() method.

Can Parquet files be read by Excel? ›

In Excel, open the Data tab and choose From Other Sources -> From Microsoft Query. Choose the Parquet DSN. Select the option to use Query Wizard to create/edit queries.

Why is Parquet faster than CSV? ›

Parquet uses efficient data compression and encoding scheme for fast data storing and retrieval. Parquet with “gzip” compression (for storage): It is slightly faster to export than just . csv (if the CSV needs to be zipped, then parquet is much faster). Importing is about 2x times faster than CSV.

How does HDFS read and write the files? ›

HDFS follow Write once Read many models. So we cannot edit files already stored in HDFS, but we can append data by reopening the file. In Read-Write operation client first, interact with the NameNode. NameNode provides privileges so, the client can easily read and write data blocks into/from the respective datanodes.

How do you write data in Parquet format? ›

Apache Parquet is a column-oriented, open-source data file format for data storage and retrieval. It offers high-performance data compression and encoding schemes to handle large amounts of complex data. We use the to_parquet() method in Python to write a DataFrame to a Parquet file.

How many rows are in a parquet file? ›

The average file size of each Parquet file remains roughly the same at ~210MB between 50 Million to 251 Million rows before growing as the number of rows increases.

How much bigger is a parquet file compared to CSV? ›

The CSV files range from 4.7 to 7.8 times larger than parquet files.

What is the recommended maximum data size limitations for Parquet files? ›

Data Size Limitations of Parquet files

It is recommended to split parquet files that are greater than 3GB in size into smaller files of 1GB or lesser for smooth Snowflake ETL.

What is HDFS not good for? ›

Hadoop does not suit for small data. (HDFS) Hadoop distributed file system lacks the ability to efficiently support the random reading of small files because of its high capacity design. Small files are the major problem in HDFS. A small file is significantly smaller than the HDFS block size (default 128MB).

Which file format is best for HDFS? ›

The Avro file format is considered the best choice for general-purpose storage in Hadoop.

How are Parquet files stored internally? ›

Parquet files are composed of row groups, header and footer. Each row group contains data from the same columns. The same columns are stored together in each row group: This structure is well-optimized both for fast query performance, as well as low I/O (minimizing the amount of data scanned).

Is it possible to read Parquet files in chunks? ›

Parquet allows chunking, but not quite as easily as you can chunk a csv.

Is parquet file editable? ›

when we need to edit the data, in our data structures (Parquet), that are immutable. You can add partitions to Parquet files, but you can't edit the data in place. But ultimately we can mutate the data, we just need to accept that we won't be doing it in place.

Can we convert Parquet to CSV? ›

parquet” and load() is used to read parquet file. The parquet file is converted to CSV file using "spark. write. fomat("csv) function, which is provided in DataFrameWriter class, without requiring any additional package or library for convertion to CSV file format.

Can SQL read Parquet? ›

Parquet is a columnar format that is supported by many other data processing systems. Spark SQL provides support for both reading and writing Parquet files that automatically preserves the schema of the original data.

Why use Parquet over JSON? ›

APIs and websites are constantly communicating using JSON because of its usability properties such as well-defined schemas. Parquet is optimized for the Write Once Read Many (WORM) paradigm. It's slow to write, but incredibly fast to read, especially when you're only accessing a subset of the total columns.

Can SQL Server read Parquet? ›

SQL Server 2022 (16. x) can virtualize data from parquet files. This process allows the data to stay in its original location, but can be queried from a SQL Server instance with T-SQL commands, like any other table. This feature uses PolyBase connectors, and minimizes the need for ETL processes.

When should I use Parquet format? ›

Parquet is optimized to work with complex data in bulk and features different ways for efficient data compression and encoding types. This approach is best especially for those queries that need to read certain columns from a large table. Parquet can only read the needed columns therefore greatly minimizing the IO.

Is Parquet better than JSON? ›

Parquet is one of the fastest file types to read generally and much faster than either JSON or CSV.

How do I optimize Parquet files? ›

In this article, we will discuss 8 ways to optimize your queries with Parquet.
  1. 1) Use Parquet Tables with Partitioned Columns. ...
  2. 2) Use Parquet Block Size that Matches the Loading Speed of Your Data. ...
  3. 3) Use Column-wise Storage. ...
  4. 4) Don't use unnecessary columns. ...
  5. 5) Use Parquet Encodings that Support Your Data Types.
Jan 7, 2022

Why is HDFS read only? ›

A safe mode for NameNode is essentially a read-only mode for the HDFS cluster, it does not allow any modifications to file system or blocks. Normally, NameNode disables safe mode automatically at the beginning. If required, HDFS can be placed in safe mode explicitly using bin/hadoop dfsadmin -safemode command.

What are the two types of writes in HDFS? ›

There are two types of writes in HDFS: posted and non-posted write. Posted Write is when we write it and forget about it, without worrying about the acknowledgement. It is similar to our traditional Indian post. In a Non-posted Write, we wait for the acknowledgement.

How do I pull files from HDFS? ›

Hadoop Get command is used to copy files from HDFS to the local file system, use Hadoop fs -get or hdfs dfs -get , on get command, specify the HDFS-file-path where you wanted to copy from and then local-file-path where you wanted a copy to the local file system.

What language is Parquet file? ›

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Parquet is available in multiple languages including Java, C++, Python, etc...

Does Parquet file have schema? ›

Parquet file is an hdfs file that must include the metadata for the file. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. The metadata includes the schema for the data stored in the file.

Can we append data to Parquet file? ›

You can only append Data with Parquet that's why you need to convert your parquet table to Delta.

What is the maximum column size in Parquet? ›

32k Limit for Parquet columns.

Can a Parquet file have multiple tables? ›

No, this is not possible as Parquet files have a single schema.

Is Parquet a file or folder? ›

Technical TLDR. Apache parquet is an open-source file format that provides efficient storage and fast read speed.

Is it better to have one large parquet file or lots of smaller Parquet files? ›

Also larger parquet files don't limit parallelism of readers, as each parquet file can be broken up logically into multiple splits (consisting of one or more row groups). The only downside of larger parquet files is it takes more memory to create them.

What are the advantages of parquet file over CSV? ›

Apache Parquet is column-oriented and designed to provide efficient columnar storage compared to row-based file types such as CSV. Parquet files were designed with complex nested data structures in mind. Apache Parquet is designed to support very efficient compression and encoding schemes.

What tool converts Parquet to CSV? ›

Astera Centerprise is a code-free ETL tool that allows you to convert Parquet into any file format with ease. Astera Centerprise has native connectors for various file formats, including Parquet, CSV, JSON, and XML.

What is Parquet chunk size? ›

parquet. block-size parameter is 268435456 (256 MB), the same size as file system chunk sizes. In previous versions of Drill, the default value was 536870912 (512 MB).

Should I zip Parquet files? ›

But you should not that storing Parquet files in a ZIP just for compression reasons is removing a lot of benefits of the Parquet format itself. By default Parquet is already compressed with the Snappy compression code (but you can also use GZip, ZStandard, and others).

How do I control the size of my Parquet file? ›

So my solution is:
  1. Write the DataFrame to HDFS, df.write.parquet(path)
  2. Get the directory size and calculate the number of files val fs = FileSystem.get(sc.hadoopConfiguration) val dirSize = fs.getContentSummary(path).getLength val fileNum = dirSize/(512 * 1024 * 1024) // let's say 512 MB per file.

What are the three common failures in HDFS? ›

Robustness. The primary objective of HDFS is to store data reliably even in the presence of failures. The three common types of failures are NameNode failures, DataNode failures and network partitions.

What are the limitations of HDFS? ›

HDFS does not work well for Low-latency data access. Applications that require low-latency access to data, in the tens of milliseconds range, will not work well with HDFS. HDFS is optimized for delivering high throughput and this may be at the expense of latency.

Why HDFS is good for large files? ›

HDFS is made for handling large files by dividing them into blocks, replicating them, and storing them in the different cluster nodes. Thus, its ability to be highly fault-tolerant and reliable. HDFS is designed to store large datasets in the range of gigabytes or terabytes, or even petabytes.

What is the largest file size HDFS? ›

Well there is obviously a practical limit. But physically HDFS Block IDs are Java longs so they have a max of 2^63 and if your block size is 64 MB then the maximum size is 512 yottabytes.

Is HDFS good for large amount of small files? ›

HDFS is more efficient for a large number of data sets, maintained in a single file as compared to the small chunks of data stored in multiple files. As the NameNode performs storage of metadata for the file system in RAM, the amount of memory limits the number of files in HDFS file system.

What is the best file format for very large datasets? ›

ORC file format

The Optimized Row Columnar (ORC) file format provides a highly efficient way to store data. This format was designed to overcome the limitations of other file formats. It improves the overall performance when Hive (A SQL kind of interface, built on top of Hadoop) reads, writes, and processes the data.

Can Excel read Parquet files? ›

The Parquet Excel Add-In is a powerful tool that allows you to connect with live Parquet data, directly from Microsoft Excel. Use Excel to read, write, and update Parquet data files.

How is Parquet so efficient? ›

Efficient Compression

Parquet's columnar format means that columns of similar data can be compressed together, resulting in greater efficiencies. Compressed data is more cost-effective to store than raw data, so using Parquet can reduce the cost of storing large data sets.

How are Parquet files stored in HDFS? ›

A Parquet file consists of one or more row groups, a logical partitioning of the data into rows. ROWGROUP_SIZE identifies the size (in bytes) of the row group. The default row group size is 8 * 1024 * 1024 bytes. A row group consists of column chunks that are divided up into pages.

Is it possible to read parquet files in chunks? ›

Parquet allows chunking, but not quite as easily as you can chunk a csv.

What is the command to read a file in HDFS? ›

Hadoop HDFS cat Command Description:

The cat command reads the file in HDFS and displays the content of the file on console or stdout.

How to read data from parquet file in SQL? ›

The following commands are used for reading, registering into table, and applying some queries on it.
  1. Open Spark Shell. Start the Spark shell using following example $ spark-shell.
  2. Create SQLContext Object. ...
  3. Read Input from Text File. ...
  4. Store the DataFrame into the Table. ...
  5. Select Query on DataFrame.

What is the tool for Parquet file? ›

Parquet tools , such as "cat," "meta" and "schema" empower users to search the files and data for specific answers. Parquet tools can be built by users to read Parquet files. Understanding the proper tools strengthens the uses of Parquet.

What query tool for parquet files? ›

TLDR: DuckDB, a free and open source analytical data management system, can run SQL queries directly on Parquet files and automatically take advantage of the advanced features of the Parquet format. Apache Parquet is the most common “Big Data” storage format for analytics.

Is parquet machine readable? ›

ORC, Parquet, and Avro are also machine-readable binary formats, which is to say that the files look like gibberish to humans.

Is reading Parquet faster than CSV? ›

Parquet uses efficient data compression and encoding scheme for fast data storing and retrieval. Parquet with “gzip” compression (for storage): It is slightly faster to export than just . csv (if the CSV needs to be zipped, then parquet is much faster). Importing is about 2x times faster than CSV.

How do I manually edit Parquet? ›

Parquet Editor
  1. Open Parquet. Open Parquet command shows a Folder dialog to select the parquet file folder. ...
  2. Open Json. Open Json command shows a File dialog to select the desired single json file. ...
  3. Save. Save command to save/overwrite the changes to the current file. ...
  4. Save As Parquet. ...
  5. Save As Json. ...
  6. Generate Schema. ...
  7. Close. ...
  8. Exit.

How can I read my Parquet faster? ›

In this article, we will discuss 8 ways to optimize your queries with Parquet.
  1. 1) Use Parquet Tables with Partitioned Columns. ...
  2. 2) Use Parquet Block Size that Matches the Loading Speed of Your Data. ...
  3. 3) Use Column-wise Storage. ...
  4. 4) Don't use unnecessary columns. ...
  5. 5) Use Parquet Encodings that Support Your Data Types.
Jan 7, 2022

How does HDFS read and write? ›

HDFS follow Write once Read many models. So we cannot edit files already stored in HDFS, but we can append data by reopening the file. In Read-Write operation client first, interact with the NameNode. NameNode provides privileges so, the client can easily read and write data blocks into/from the respective datanodes.

How do I access data from HDFS? ›

Access the HDFS using its web UI. Open your Browser and type localhost:50070 You can see the web UI of HDFS move to utilities tab which is on the right side and click on Browse the File system, you can see the list of files which are in your HDFS. Follow the below steps to download the file to your local file system.

References

Top Articles
Latest Posts
Article information

Author: Margart Wisoky

Last Updated: 12/25/2023

Views: 6021

Rating: 4.8 / 5 (78 voted)

Reviews: 93% of readers found this page helpful

Author information

Name: Margart Wisoky

Birthday: 1993-05-13

Address: 2113 Abernathy Knoll, New Tamerafurt, CT 66893-2169

Phone: +25815234346805

Job: Central Developer

Hobby: Machining, Pottery, Rafting, Cosplaying, Jogging, Taekwondo, Scouting

Introduction: My name is Margart Wisoky, I am a gorgeous, shiny, successful, beautiful, adventurous, excited, pleasant person who loves writing and wants to share my knowledge and understanding with you.