How did you teach yourself data science?
Few points before answering this First I had enough quantitative background to study data science. I am Electrical Engineering grad with signal processing and wirelessmunication as major area. Also I won multiple prizes at National Mathematical Olympiads of my country which practically makes a student with higher mathematical aptitude than most people here. Second I am very curious. I love to learn new stuffs and ask questions. I like to challenge myself both physically and intellectually. Third because of my strong mathematical aptitudes and algorithmic thinking learning new programming languages seems so easy to me. I learnt R programming within 2 months and Python within 3 months. Now let talk on how I start my journey with data science First I started with MOOCs. I took one machine learning from Coursera one linear algebra from MIT one deep learning from Stanford few data science analysis courses from Datacamp etc. Second I played with few projectspleted Kaggle and Drivendatapetitions started exploring Github a lot went through many research papers etc. These helped me to learn more model real life problems and solve them from a data centric perspective. Third I started looking for jobs in the field of machine learning data analysis and data science. Finally I am a data scientist now and working on some amazing problems that might one day solve many problems of country. Thanks!
In Data Analysis, why and when would you choose one over the other: Excel, Tableau, Python or R?
Thanks for the question! I think there is some further explanation due. Namely the domain of each tool. On the one hand statistical programming tools like R and Python (to some extent specifically the pandas + NumPy bo) are useful for cleaning data performing ETL jobs and generally enriching your data. While you can do that in Excel and Tableau it is a moreplicated chore. On the other hand Tableau is a visualization software which means you can extract informarion right away from any given dataset. Excel has this ability as well but as a general purpose spreadsheet software is not mainly focused neither on visualization (although you can build operational tactical and strategic dashboards on it) nor on data analysis (it has some tools for it but is not the main focus). Like the proverbial multitool Excel can do a lot but with not much depth at least straight out of the box. That being said my personal workflow includes all of the above italic to perform any given analysis. Let me give you a brief example Once I get a dataset I clean it and look at it from within Python or R (even SQL if it happens to be a database) to get a hold of the structure outliers of data and possible answers you can retrieve from there. In R and in Python you can perform some integrity tests contingency tables frequency tables categorical analysis distribution analysis. Here Ill make my first charts. Nothing fancy just some simple histograms and distribution charts (namely barcharts). Then Ill go to Tableau or any viz software to see if there are any meaningful visual correlations within the data and if there anything outstanding on the set. Finally Ill export the dataset (full or aggregated depending on the number of records) to Excel and then create simple graphics and stories. Why Excel? Because in a corporate environment there a 99% chance the stakeholder will have it so you don have to create a PDF or a Shiny page to show your results. Copy everything in a PPT deck and youre good to go. 1 Again this is my italic preferred workflow. Sometimes is unfeasible sometimes is ideal. But you must know the strenghts and shorings of your tools; otherwise youre bound to underperform and (possibly) not answer correctly the business question.
Yes it is a lot faster than R. That why Python is replacing R in the field of data science. Know more about why Python is better than R. R vs Python is one of the mostmon but important question asked by lots of data science students. Today I am going to tell of the major difference between R and Python. We know that R and Python both are open source programming languages. Both of these languages are having a largemunity. Both of these languages are having continuous development. That is the reason these languages add new libraries and tools in their catalog. The major purpose of using R is for statistical analysis on the other hand Python provide the more general approach to data science. Both of the languages are state of the art programming language for data science. Python is one of the simplest programming languages in terms of its syntax. That why any beginner in a programming language can learn R without putting extra efforts. On the other hand R is built by statisticians that are a little bit hard to learn. There are some reasons that will help us to find out why we should not use both R and Python. R R is one of the oldest programming language developed by academics and statisticians. Res into existence in the year 1995. Now R is providing the richest ecosystem for data analysis. The R programming language is full of libraries. There are a couple of repositories also available with R. In fact CRAN is having around 12 packages. The rich variety of library makes it the first choice for statistical analysis and analytical work. Key features of R Consists of packages for almost any statistical application one can think of. CRAN currently hosts more than 1k packages. Comes equipped with excellent visualization libraries like ggplot2. Capable of standalone analyses Python On the other hand Python can do the same tasks as the R programming language does. The major features of python are data wrangling engineering web scraping and so on. Python is also having the tools that help in implementing machine learning at a large scale. Python is one of the simplest languages to maintain and it is more robust than R. Now a day Python is having the cutting edge API. This API is quite helpful in machine learning and AI. Most of the data scientist uses only five Python libraries i.e Numpy Pandas Scipy Scikit-learn and Seaborn. It is quite handy to use Python over R. Key features of Python programming language- -oriented language General Purpose Has a lot of extensions and incrediblemunity support Simple and easy to understand and learn packages like pandas numpy and scikit-learn make Python an excellent choice for machine learning activities. R or Python Usage Python has developed by Go van Rossum in 1991. Python is the most popular programming language in the world. It has the most powerful libraries for math statistic artificial intelligence and machine learning. But still python is not useful for econometrics andmunication and also for business analytics. On the other hand R is developed by academics and scientist. It is specially designed for machine learning and data science. R is having the most powerfulmunication libraries that are quite helpful in data science. In addition R is equipped with many packages that are used to perform the data mining and time series analysis. Why not use Both? Lots of people think that they can use both the programming languages at the same time. But we should prevent to use them at the same time. Majority of people are using only one of these programming languages. But they always want to have access to the capability of the language adversary. For example if you use both the language at the same time that may face some of the problems. If you use R and you want to perform some object-oriented function than you can use it on R. On the other hand Python is not suitable for statistical distributions. So that they should not use both the language at the same time because there is a mismatch of their functions. But there are some ways that will help you to use both of these languages with one another. We will talk about them in our next blog. Let have a look at theparison between R vs Python. R is more functional Python is more object-oriented. R is more functional it provides a variety of functions to the data scientist i.e Im predict and so on. Most of the work done by functions in R. On the other hand Python use classes to perform any task within the python. R has more data analysis built-in Python relies on packages. R provides the build-in data analysis for summary statistics it is supported by summary built-in functions in R. But on the other hand we have to import the statsmodel packages in Python to use this function. In addition there is also a built-in the constructor in R i.e is dataframe. On the other hand we have to import it in Python. Python also helps to do linear regression random forests with its scikit learn package. As mentioned above it also offers API for machine learning and AI. On the other hand R is having the greatest diversity of packages. R has more statistical support in general. R was created as a statistical language and it shows. statsmodels in Python and other packages provide decent coverage for statistical methods but the R ecosystem is far larger. It usually more straightforward to do non-statistical tasks in Python. With well-placed libraries like beautifulsoup and request web scraping in Python is much easier than R. This applies to other tasks that we don see closely such as saving the database deploying the Web server or running aplex selfie. There are many parallels between the data analysis workflow in both. R and Python are the clearest points of inspiration between the two (pandas were inspired by the Dataframe R Dataframe the rvest package italic was inspired by the Sundersaute) and the two ecosystems are getting stronger. It may be noted that the syntax and approach for manymon tasks in both languages are the same. Read more s
How can I read the most commonly used file formats in data science using Python?
The list ofmon file formats used by the data scientists are CSV (Comma-separated values) ordered-list Code to read CSV files in Python import pandas (filename) 2. Excel Code to read CSV files in Python import pandas (filename sheetname) 3. JSON file Code to read CSV files in Python import pandas (filename) 4. Text files file = open( r) lines = () These are the mostmon file formats used by the data scientist. I usually encounter and CSV files. There are other file formats such as ZIP HTML DOCX PDF.
What are some nice alternatives to Excel for data wrangling and data analysis?
Actually I still use Excel for data wrangling especially when I have small quantities of data or when Im importing it from PDF table where there is an need of lots of manual cleaning. On my Linuxputer I don have Excel and use LibreOffice s instead what is basically the same but free. I don like VBA and Im not willing to get too much into. There is XLWings s that provide Python interaction with Excel. I tested Google Sheet s instead of Excel coupled with Colaboratory s Google Jupyter notebook freely available on the cloud. I often work with CSV source files and directly browse it in a Jupyter Notebook s with Pandas s and correct the errors with Python or manually with a editor. If the data has to be reused for other purpose I may transfer it into a SQL database like SQLite s or PostgreSQL s . Im mainly working on metallurgy related products currently and I did not have to deal with so big datasets. For datasets where Excel of CSV files are not enough Microsoft propose nice tools in Visual Studio and Azure data studio coupled with SQL server. Another tool that we cannot forget is Tableau s . I would definitly rend it for data analysis and visualmunication rather than Excel. it is a very powerfull tool for extremly quick creation of nice graphics fromplex datasets. Although I played with the free version I have to say that I choose not to buy a licence because Python is powerfull and easy enough for me. But this tool is definitly better than Excel.
How is MS in Data Analytics from George Mason University?
The GMU MS in Data Analytics Engineering program at George Mason University is a multi-disciplinary degree that depending on your concentration assumes certain corepetencies (inputer science probability and statistics and at various levels depending on the concentration). Like any graduate degree prior preparation is key. Which concentration should you pick? Well you need to read the catalog carefully about the prereqs. Balance your desires with your background. Some prerequisites are just rmended others are enforced by PatriotWeb with the latter it's very tough to get into those classes unless you have done that prereq. Data Analytics Engineering MS s (volgenau website) Data Analytics Engineering MS (catalog) Both the students and faculty heads are very proactive about job opportunities internships and networking opportunities. There is a student-led organization called the Society of Data Analytics Engineers italic which is now a Registered Student Organization with faculty advisors. It's been a really cool bonus to the programIve met great people doing the work of data science in the industry in the DC Virginia region. horizontal-rule horizontal-rule ############# FOUNDATION REQUIREMENTS ################## AIT 58 (required for MS). I recently took AIT 58 one of the required introductory classes and it was great. Our professor had a lot of contacts in the D.C. area (business leaders government contractors etc) so we were able to hear first-hand what is really going on in the analyticsmunity right now. Also our project took us through the business side of analytics cloudputing and the creative aspect of data science. We had to build a pretty cool final project too which required some skills in building a proto for an app that did analytics. Note this idea of building a proto that does analytics on a dataset (RShiny PowerBI Tableau even Excel VBA shows up in SYST 542 as well italic ). STAT 554 Applied Statistics (either STAT 515 or 554 is required for MS). This class teaches normality assumptions p-values hypothesis testing confidence intervals ANOVA (one and two way) single and multiple regression and non-parametric models and an emphasis on I and II errors regions. Also the homework for STAT 554 is abination of handwritten and SAS language. ## NOTE I think GMU originally made STAT 554 or STAT 515 the requirment but now I think the rule changed. Check with advisor to see if only statistics concentrations are allowed to take STAT 554.## OR 531 Analytics & Decision Analysis (required for MS)This class was sort of the prequel to OR 64. It got you used to solving optimization problems in Excel solver as well as a big giant overview of the mathematical background of OR problems in general. It would help to do some YouTube video practice with Excel and linear programming. CS 54 Data Management & Mining (either CS 584 or CS 54 required) This is a great class with tough homework assignments italic that are very conceptual ( data modeling italic ) and use SQL mongodb and data mining software. Very good content and challenging material rmend you take this toward the beginning of your degree program assuming some preparation beforehand. ## NOTE know and understand Ubuntu linux distromand line## DAEN 69 Capstone Project (required for MS) team-oriented semester conducting group projects on very interesting datasets. Follows agile methodology. horizontal-rule horizontal-rule ######### PREDICTIVE ANALYTICS CONCENTRATION ############# OR 64 Practical Optimization (required for PRAN concentration). Or what I call the Python beast. This was legitimately a tough classPython web scraping sqlite3 Gurobi optimizer. OR 568 Applied Predictive Analytics (required for PRAN concentration). This is about modeling in R using R The content is mostly based off professor own lecture material and Intro to Statistical Learning by Tibshirani Hastie italic et al. Some other teachers use Applied Predictive Modeling by Max Kuhn. italic SYST 573 Decision & Risk Analysis (required for PRAN concentration). This class is based primarily on three widely-regarded decision theory books Strategic Decision Making with Spreadhseets by Kirkwood Value-Focused Thinking by Ralph Keeney and the supplementary Making Hard Decisions by Robert T. Clemen. The content is enormously practical. Note This class is also required for the M.S. Operations Research -Decision Analysis concentration. SYST 542 Decision Support Systems (required for PRAN concentration). This class covers the basics of DSS probability and Bayes Nets and case studies of DSS. There are weekly readings homework and a final project where your teames up with a DSS. The concepts apply to a broad range of engineering topicsdata analytics definitely included. SYST 58 Complex Systems Engineering Mgmt (elective for concentration). It was actually very useful since I don have much of a systems engineering backgroundit was all about systems particularly thinking about stakeholders and requirements management in mega-systems. ## NOTES ABOUT CYBER ANALYTICS CONCENTRATION italic PLEASE READ## One thing that I noticed even in my Predictive Analytics concentration was the use of a Virtual Machine (VM) with Ubuntu Linux. Learn Linux! I have yet to meet someone who has taken this concentration but due to my interest in cyber security I posted some things here that might potentially be helpful (besides the Linux manual below in the appendix). CS259D Data Mining for Cyber Security s Cybersecurity | Coursera s Kali Linux - Wikipedia s Linux Sysadmin Basics -- Course Introduction s What is CTF? An introduction to security Capture The Flagpetitions s MasonLUG ##NOTES ABOUT BUSINESS ANALYTICS CONCENTRATION italic PLEASE READ ## italic Do this sometime along with the rest of Phase and 1 below. Excel to MySQL Analytic Techniques for Business | Coursera s ###NOTES ABOUT DATA MINING CONCENTRATION italic PLEASE READ ## italic this is for programmers or people who can program without asking their friends for help every 2 seconds. usually former CS undergrads. horizontal-rule horizontal-rule ######## italic BEFORE YOU COME TO GMU italic #################### italic Do Phase and 1 before italic you begin the masters program (literally watch notes everything in these two phases). While in the masters program glance intuition italic ( read study know concepts etc italic ) phases 2 3 and 4 . Section 5 is general engineering material that could be useful for further reflection and study. Phase - Basics italic Programming in Python 3 - zyBooks s Probability Lessons by ActuarialPath - YouTube s Essence of linear algebra - YouTube s CSC 226-1 Video Lectures - Free Podcast by North Carolina State University on Apple Podcasts s (note i cannot possibly give enough praise for this whole lecture series) Lecture 3.1 Linear Algebra Review | Matrices And Vectors Machine Learning | Andrew Ng s MA321 Data Analysis with Microsoft Excel - YouTube s R statistics - YouTube s Linux Sysadmin Basics -- Course Introduction s Phase 1 - Git Statistics Basics italic So watch this video on the concepts of Git first s Learn Git Branching s (this is awesomeinteractive challenges to learn Git!!. Just got this from a friend) How to Use Version Control in Git & GitHub | Udacity s Intro to Descriptive Statistics | Udacity s Inferential Statistics Learn Statistical Analysis | Udacity s Phase 2 - Python cont Pandas Data Analysis & Web Scraping noSQL italic Intro to Data Science Online Course | Udacity s Python for Everybody - University of Michigan | Coursera s MongoDB Tutorial for Beginners - 1 - Installing Mongo s MySQL Video Tutorial REMitchell s Phase 3 - Data Mining & Machine Learning italic A Tour of Machine Learning Algorithms Data Mining with Weka - YouTube s In-depth introduction to machine learning in 15 hours of expert videos s Machine Learning by Andrew Ng (Coursera online course) - YouTube s tutorials Auton Lab s click on pdf for each lesson Machine Learning book slides ~tom Phase 3.5 Artificial Intelligence italic 1-83 Markov Logic Networks s~pedrod Probabilistic Graphical Models Berkeley AI Materials Phase 4 - Big Data & Distributed Systems & Cloud italic Apache Spark Tutorial - YouTube s CS246 | Home Distributed File Systems | Stanford University s Data Engineering on Google Cloud Platform | Coursera s Data Science at Scale | Coursera s horizontal-rule horizontal-rule Personal Observations (hard-earned lessons) italic At least from my experience there is a helpful kind of mentality that helps to understand where some of the professors areing from. What I mean is graduate school is not a vocational school so some of the classes (or all depending on your experience) are both practical and academic This video here to most of you will be a random video about making Go language scripts more pretty is actually a good example of this academic or conceptual-based mentality since it involves bringing in mathematics to the conversation of the code (Note that you will use R and Python for almost all your assignments). s Learn data cleaning csv file manipulation in Python and R both on your own machine and on the cloud with spark Learn the Cloud (AWS or GCP how it worksthe pricing setting up a cluster nodes submitting jobs etc). Learn what kinds of services are out there for specific requirements. For an easy example On Google Cloud Platform you can use CloudSQL (for small-medium datasets but BiqQuery can do SQL queries on PETABYTES in seconds). Did you know about this? Did you know that Google BigTable is basically the noSQL version of BigQuery? This kind of knowledge is what helps on a capstone project. Im sure AWS has similar services (Redshift). Get to know this kind of lingo ########### certification################# italic Certified Analytics Professional (CAPuae) -- Associate CAP s Data Engineer Certification | Google Cloud Platform s horizontal-rule horizontal-rule Appendix extra stuff italic Convex Optimization italic CVX 11 - Convex Optimization (Stanford OpenEdX) - YouTube s Machine Learning italic Machine Learning ~tom (Tom Mitchell lectures) Hacking italic How To Be A Hacker 11 Linux Hacks s s horizontal-rule
How do data analysts make the millions of raw data into a readable format?
Cleaning data which is more than just making it readable takes a lot of intuition and resourcefulness. I wish there was a prescriptive list I could talk about here but consider these questions Start by asking what does your team want ? Just because some data is available does not make it useful. In truth one of an important success factors for a data architect is to be ruthless in his triage of what data is necessary. Once you know what you want focus on those fields only. It will save you a lot of heart burn. As you gain more experience it will be your job to figure out what does the team want exactly. This is iterative with the next step. To execute on this strategy you will also have to understand what tools are available ? Is Excel sufficient (don jeer it - it a life saver in data science)? Do you need Python scripts? Do you know how to write Python scripts? Perhaps the needs is more like SAS? The more sophisticated your skill set and tools the more sophisticated can your strategy be. In most data sets there is bound to be rows that do not conform to the pattern. Then your task is to figure out how to make data rows consistent and conforming ? This takes a lot of intuition resourcefulness and judgement. The solutions youe up in this step will affect your mapping strategy in step 3 and the patterns used in step 4. You will learn to live with some acceptable loss and will realize that there could be multiple patterns in the data. Hope this helps.
Why does everyone want to learn Python?
Its not just the popularity of the language but there is still a lot of reasons why people love Python. Beginner friendly Python is one of the most beginner friendly language and hence is preferred by a lot of beginners due to its simple syntax. Not only is it simple but it has all the powerful features which a programming language needs to have. Has a ton of modules and libraries Anything you need to do in Python there is already a module for that which you can simply import and you don need to write any code form scratch. Python has modules and libraries for almost everything you would need. Rapid prototyping Python is the most favourite language of startups because Python can be used to build proto very rapidly. Meaning developers can get something built in a very short period of time. The amount of time you can save in the development process will possibly be more and cost-effective. I could go on but will need to stop here as most relevant points are covered.