[빅데이터 인프라] StreamSets Basic tutorial 시작하기(origin)

Notice

Recent Posts

Tags more

Archives

관리 메뉴

리그캣의 개발놀이터

프로그래밍 기본/서버 구축 및 관리

리그캣 2019. 2. 22. 18:44

Origin 구성하기

Streamsets은 origin, processing, destination으로 이렇게 셋으로 나뉜다.

origin이라 함은 데이터 원천을 말하는데 ETL구조에서 데이터를 다루기 위해서는

먼저 원천이 되는 데이터를 가져와야 하지 않겠는가??

Basic tutorial에서는 origin으로 로컬 데이터 .csv를 사용한다.

Origin은 pipeline에 들어오는 data를 나타냄. Origin을 구성할 때, origin system에 연결하는 방법, 처리할 data 유형 및 origin과 관련된 기타 특성을 정의.

Data Collector는 origins의 넓은 범위까지 제공. Directory origin을 사용하여 다운로드 한 샘플 CSV 파일을 처리 가능.

Canvas에 stage를 추가하려면 Pipeline Creation Help Bar에서 Select Origin > Directory를 클릭한다.
또는 Stage Library Panel에서 Directory origin을 클릭한다.
Properties panel에서 Files tab을 클릭후 아래 properties를 구성한다

Directory Property	Value
Files Directory	Directory where you saved the sample file. Enter an absolute path. We recommended: /<base directory>/tutorial/origin.
File Name Pattern	The Directory origin processes only the files in the directory that match the file name pattern. The tutorial sample file name is nyc_taxi_data.csv. Since the file is the only file in the directory, you can use something generic, like the asterisk wild card () or .csv. If you had other .csv files in the directory that you didn't want to process, you might be more specific, like this:nyc_taxi.csv. Or if you want to process files with prefixes for other cities, you might use taxi*.csv.
Read Order	This determines the read order when the directory includes multiple files. You can read based on the last-modified timestamp or file name. Because it's simpler, let's use Last Modified Timestamp.

아래와 같이 구성해주면 된다.

즉, /home/sdc~ 에서 데이터를 가져온다는 뜻이다.

Data Formats tab을 클릭 후

다음 properties를 참고하여 구성하면 된다.

Delimited Property	Description
Data Format	The data in the sample file is delimited, so select Delimited.
Delimiter Format Type	Since the sample file is a standard CSV file, use the default: Default CSV (ignores empty lines).
Header Line	The sample file includes a header, so select With Header Line.
Root Field Type	This property determines how the Data Collector processes delimited data. Use the default List-Map. This allows you to use standard functions to process delimited data. With the List root field type, you need to use delimited data functions.

대충 위를 참고하면 구성하면 아래와 같이 된다.

짠 !!! origin 세팅을 완료하였다.

다음에는 preview를 배워보겠다.

[빅데이터 인프라] StreamSets Basic tutorial 시작하기(Stream Selector) - 4 (0)	2019.02.26
[빅데이터 인프라] StreamSets Basic tutorial 시작하기(preview) - 3 (0)	2019.02.25
[빅데이터 인프라] Kafka on Mesos broker 다루기 - kafkacat 포함 (0)	2019.02.21
[빅데이터 인프라] Kafkacat 을 centos7에서 설치하기 / kafkacat install on centos7 (0)	2019.02.21
[빅데이터 인프라] StreamSets Basic tutorial 시작하기 - 1 (0)	2019.02.19

'프로그래밍 기본/서버 구축 및 관리' Related Articles

Comments