Deep Learning in Udacity Assignment 1 2016. 3. 9 A.I. Lab. 전 명 중
notMNIST There are 10 classes with A-J from different fonts. for training data (about 500,000) for test data (about 19,000) Fig 1. examples of letter “A” There are 10 classes with A-J from different fonts. URL : http://yaroslavvb.com/upload/notMNIST/
Download to the Local Machine url 주소로 부터 local 로 파일을 Download 파일 Size 를 검색을 통하여 원하는 파일이 맞는지 확인 download the dataset to local machine The data consists of characters rendered in a variety of fonts on 28 x 28 image
Uncompresse .tar.gz & labelled A through J 'notMNIST_large.tar.gz’ 'notMNIST_large.tar’ 'notMNIST_large’ os.path.splitext : 입력받은 경로를 확장자와 그 외의 부분으로 나뉨 os.path.isdir : 경로가 디렉토리인지 검사 os.path.join : OS형식에 맞도록 경로를 연결 os.listdir(root) : root 경로에 존재하는 파일이나 디렉토리들을 리스트로 반환 아래와 같은 코드 for d in sorted(os.listdir('notMNIST_large’)): if os.path.isdir(os.path.join(‘notMNIST_large’, d)): os.path.join(‘notMNIST_large’, d) result : [ notMNIST_large/A, notMNIST_large/B, notMNIST_large/C, ... notMNIST_large/J ] python os path 모듈 : http://devanix.tistory.com/298
Changing the data for manageable format (Case of A images) 0. 4. 5. ... 6. 28 각 폴더의 image들을 normalize 하고 3D-Array 형태로 구성 e.g. notMNIST_large/A 폴더의 경우, 0. -0.4 -0.5 -0.3 ... -0.6 test_folders train_folders normalizing! (feature scaling) [ notMNIST_small/A, notMNIST_small/B, notMNIST_small/C, ... notMNIST_small/J ] [ notMNIST_large/A, notMNIST_large/B, notMNIST_large/C, ... notMNIST_large/J ] astype(float) 52,909 ... 0. 0. 0. python os path 모듈 : http://devanix.tistory.com/298 img1.png 다음 image 불러서 같은 작업 반복 img1.png 0. -0.4 -0.5 -0.3 ... -0.6 28 [ notMNIST_large/A.pickle, . . . ] notMNIST_large/C notMNIST_large/B 최종 완성되면 dump! notMNIST_large/A 28
Changing the data for manageable format(maybe_pickle) maybe_pickle function Output을 저장할 폴더를 생성하고 3D-array 의 결과가 리턴되어 오면 저장 기존 폴더 name 뒤에 .pickle 을 붙여서 dataset_names 이름의 List 에 추가 e.g.) notMNIST_large/A.pickle 1 (folder = ‘notMNIST_large/A’ , min_num_images_per_class = 45000) Go to load_letter function ! python os path 모듈 : http://devanix.tistory.com/298 [ ‘notMNIST_large/A’ , ‘notMNIST_large/B’ , ‘notMNIST_large/C’ , ...
Changing the data for manageable format(load_letter) folder = ‘notMNIST_large/A’ , min_num... = 45000 1 image_files = [’image1.png', ’image2.png', ’image3.png', ... ] 3D-Array (52909, 28, 28) 0. ... 52,909 28 python os path 모듈 : http://devanix.tistory.com/298 28
Changing the data for manageable format(load_letter) [[ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 12. 11. 1. 0. 9. 5. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 15. 23. 82. 116. 61. 89. 36. 0. 1. 0. 0. 0. 0. 0. ... ] 2 ‘notMNIST_large/A’ [’image1.png', ’image2.png', ’image3.png', ... ] normalized (feature scaling) 10 ‘notMNIST_large/A/image1.png’ python os path 모듈 : http://devanix.tistory.com/298 0. 4. 5. ... 6. 28 -0.4 -0.5 -0.3 -0.6 normalizing! [[-0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.45294118 -0.45686275 -0.49607843 -0.5 -0.46470588 -0.48039216 -0.5 -0.49607843 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ] [-0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.44117647 -0.40980392 -0.17843137 -0.04509804 -0.26078431 -0.15098039 -0.35882353 -0.5 -0.49607843 -0.5 -0.5 ]
Changing the data for manageable format(load_letter) 0. -0.4 -0.5 -0.3 ... -0.6 img1.png 52,909 28 idx: 2 dataset = 3D-Array (52909, 28, 28) idx: 1 idx: 0 image_data = [[-0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.45294118 -0.45686275 -0.49607843 -0.5 -0.46470588 -0.48039216 -0.5 -0.49607843 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 ] [-0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.5 -0.44117647 -0.40980392 -0.17843137 -0.04509804 -0.26078431 -0.15098039 -0.35882353 -0.5 -0.49607843 -0.5 -0.5 ] 3 10 python os path 모듈 : http://devanix.tistory.com/298 준비해놓은 dataset matrix 에 첫번 째 layer 에 이미지 데이터 삽입 insert dataset[0, 28, 28] for문의 끝!
Changing the data for manageable format(load_letter) folder = ‘notMNIST_large/A’ , min_num... = 45000 python os path 모듈 : http://devanix.tistory.com/298 52,909 4 필요한 이미지 갯수보다 적으면 exception 발생
Changing the data for manageable format(maybe_pickle) maybe_pickle function Output을 저장할 폴더를 생성하고 3D-array 의 결과가 리턴되어 오면 저장 Come back! 2 3D-array (52914, 28, 28) python os path 모듈 : http://devanix.tistory.com/298 # pickle 모듈은 class, function, method 등의 어떠한 object에서도 적용가능하며, 기본적으로 serializing 하게 저장하기 때문에 파일처리 뿐만 아니라 네트워크 통신에도 많이 사용 # HIGHEST_PROTOCOL : 최신 저장 방식 ( binary, readable text file, etc.) * ’wb’ : opened for writing in binary mode.
Making the Training, Validate, Test Sets < For Validation, Training Set > 31,909 A.pickle A.pickle 52,909 20,000 A.pickle 1,000 A.pickle Training set 28 200,000 (10 class x 20,000) Validation set 28 notMNIST_large/A.pickle A 10,000 (10 class x 1,000) 20,000 < For Test Set > A A 1,872 872 A.pickle A.pickle Validation set Test set Training set 1,000 A.pickle 28 나머지 B, C, . . . , J 의 pickle 파일들도 분할하여 Set 으로 구성 추가적으로 y 값인 label 과 함께 생성 A = label(0), B = label(1), C = label(2), ... , J = label(9) Test set 28 notMNIST_small/A.pickle
Making the Training, Validate, Test Sets 9 99 . 1 200,000 (10 class x 20,000) 9 . 1 9 . 1 10,000 10,000 (10 class x 1,000) valid_dataset valid_labels test_dataset test_labels train_dataset train_labels train_datasets (A.pickle, B.pickle ...) 을 train_size 와 valid_size 크기만큼 생성 < print 결과 값 >
Making the Training, Validate, Test Sets . . 10,000 (10 class x 1,000) 200,000 (10 class x 20,000) valid_dataset valid_labels train_dataset train_labels pickle_files = ['notMNIST_large/A.pickle', 'notMNIST_large/B.pickle', ... ] train_size = 200,000 valid_size = 10,000 valid_dataset = 3D-array (10000, 28, 28) valid_labels = (10000,) num_classes = 10 train_dataset = 3D-array (200000, 28, 28) valid_labels = (200000,) vsize_per_class = 10000 // 10 = 1000 # 각 이미지의 종류마다 몇개의 데이터를 넣을지에 대한 개수 tsize_per_class = 200000 // 10 = 20000
Making the Training, Validate, Test Sets A.pickle 20,000 1,000 Validation set Training set 31,909 Making the Training, Validate, Test Sets valid_dataset = 3D-array (10000, 28, 28) valid_labels = (10000,) train_dataset = 3D-array (200000, 28, 28) valid_labels = (200000,) A 10,000 (10 class x 1,000) . valid_labels valid_dataset vsize_per_class = 1000 tsize_per_class = 20000 end_v = 1000, end_t = 20000 1,000개 end_l = 21000 label = 0, pickle_file = 'notMNIST_large/A.pickle' # enumerate 는 0부터 오름차순으로 auto index 값을 붙여주는 기능 # with as 는 파일 처리시, 인터프리터가 자동으로 파일을 닫아줌. (52909, 28, 28) 1,000 개의 image만 추출 추출한 데이터를 준비한 valid_dataset 에 삽입 [0:1000] 의 값을 label(0) start_v = 1000 end_v = 2000 * ’rb’ : opened for reading in binary mode.
Making the Training, Validate, Test Sets valid_dataset = 3D-array (10000, 28, 28) valid_labels = (10000,) train_dataset = 3D-array (200000, 28, 28) valid_labels = (200000,) vsize_per_class = 1000 tsize_per_class = 20000 label = 0, pickle_file = 'notMNIST_large/A.pickle' start_t = 0, end_t = 0 end_l = 21,000 letter_set = (52909, 28, 28) 이전에 letter_set 의 0~999개까지는 valid_dataset 으로 썼기 때문에 1,000~20,999 까지의 20,000개 데이터는 train_dataset 으로 사용 A.pickle 20,000 1,000 Validation set Training set 31,909 . .. start_t = 20000 end_t = 40000 200,000 A 20,000 train_dataset train_labels
Data shuffle For labels well shuffled for the training and test set. (1000,) 1000 e.g. >> # shape = ( 10 , ) # a 변수의 index 를 나타냄 >>
Save Data Save just one file ! Dictionary 형태로 저장 # about 690.8 MB