티스토리 뷰

In this case I usually create Hive table with tsv file, and then analysis data with HQL query.

First, you need to do is create table

CREATE EXTERNAL TABLE 'database_name.table_name' (
 id bigint COMMENT 'put column exactly same with tsv file header name',
 class string,
 type string,
 access string,
 category string
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.OpenCSVSerde'
WITH SERDEPROPERTIES (
 "separatorChar" = "\t",
 "quoteChar" = "'",
 "escapeChar" = "\\"
)
STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION
    'file_location(s3, hdfs, etc)'
TBLPROPERTIES ('serialization.null.format' = '')

Second, now you could analysis with this table like this!

SELECT * FROM table_name WHERE collumn = '' LIMIT 100