Some pretty scrappy notes that got me through my time learning Pig & HDFS

Setup the sshfs on local machine <>
sshfs -p 22 hsohn@4.26.4.XX:/ebs/user/hsohn /Users/hsohn/Documents/remoteHome/dev/ -o auto_cache,reconnect,defer_permissions,negative_vncache,volname=dev
Hadoop
For dev access, do
ssh bi@4.26.4.XX
For prod access do
ssh bi@4.26.4.XX
Run a file in pig
     pig -f query_20150917_test.pig
Run locally
     pig -x local script.pig
Retrieving Results from HDFS
     hdfs dfs -text [path / directory]*.gz
Redirect to a file
hdfs dfs -text [path / directory]*.gz > [directory / file name]
i.e. hdfs dfs -text hdfs://hadoop-prod-nn1/user/bi/jhoughton/test/part-r-00000.gz | head -n 100  > ./output.tsv
hdfs dfs -text hdfs://hadoop-prod-nn1/user/bi/jhoughton/test/*.gz | head -n 100  > ./output.tsv
hdfs dfs -text hdfs://hadoop-prod-nn1/user/bi/jhoughton/report/temporary/report_target2/*.gz > ./results/output.tsv
hdfs dfs -text hdfs://hadoop-prod-nn1/user/bi/jhoughton/report/temporary/report_target2/*.gz | head -n 1000 > ./results/output.tsv
sample data
     tracking_logs =  hdfs://hadoop-prod-nn1/user/hive/warehouse/rsyslog_optsoa_tracking/d=2015-10-07/b=2100/dummy_server/part-m-00000.gz
     targeting_logs = hdfs://hadoop-prod-nn1/user/etl/ds/optsoa_rsyslog/rsyslog_OptSOATargeting2/d=2015-10-07/2100/52.22.75.7/rsyslog_OptSOATargeting2.log.201510071444252381.gz
     persona_data = hdfs://10.10.8.5:8020/user/hive/warehouse/rsyslog_OptSOAPersona/d=2015-10-07/part-m-00262.gz.parquet
See the head of a .gz dataset
     head persona/tracking_logs.gz | less
invoke grunt
(just type “pig”)
DS hadoop cheatsheet
browse hdsf
     hdfs dfs -ls [directory / path]
     hdfs dfs -ls hdfs://hadoop-prod-nn1/user/bi/jhoughton/report/temporary/report_target2/
Copy a file from HDFS to localfsc
  • hdfs dfs -get /user/hadoop/file
copy file to local from server
     scp bi@4.26.4.55:/home/bi/jhoughton/results/output2.tsv ./Downloads/
     scp bi@4.26.4.55:/home/bi/jhoughton/results/output2.tsv /Users/john.houghton/Downloads/
     scp [server path] [local path]
copy file from local to server
scope ~/downloads/foo.txt bi@4.26.4.57:~ # to current path
scp ~/downloads/foo.txt bi@4.26.4.57:/jhoughton # to some directory
Get directory size
     hdfs dfs -du -s hdfs://10.10.8.5:8020/user/etl/ds/optsoa_rsyslog/rsyslog_OptSOATargeting2/d=2015-10-07/
Copy TO hdfs from local
(to overwrite, first rm, i.e.  hdfs dfs -rm hdfs://hadoop-prod-nn1/user/bi/jhoughton/report/temporary/lookup.txt )
 hdfs dfs -copyFromLocal /home/bi/jhoughton/pig/lookup.txt hdfs://hadoop-prod-nn1/user/bi/jhoughton/report/temporary/
Kill Job
mapred -job list (get list of active jobs to get job_id)
mapred -job kill [job_id]