Advice is intended for students (undergraduate and graduate) or any other person who is considering entering the field of data science.
Below are tips for future data scientists
- Take advantage of external data sources: tweets about your company or competitors, or data from your suppliers (eg customizable eBlast statistics on bulletins available through vendor dashboards, or by submitting a ticket)
- Nuclear phyThe sicists, mechanical engineers, and bioinformatics experts can make great scientists of data.
- State your problem correctly and use sound measurements to measure performance (relative to baseline) provided by data science initiatives.
- Use the right key performance indicators (metric keys) and the right data from the start, in any project. Changes due to bad foundations are very expensive. This requires careful analysis of your data to create useful databases.
- Fast delivery is better than extreme precision. All datasets are dirty anyway. Find the perfect compromise between perfection and quick return.
- With Big data, the strong (extreme) signals will usually sound.
- Big data is less valuable than useful data.
- Use Big data from third-party vendors for competitive intelligence.
- You can build inexpensive, large, scalable and robust tools fairly quickly, without using old-fashioned statistical science. Think of techniques without a model.
- Big data is easier and cheaper than you think. Get the right tools!
- Correlation is not causality.
- You do not have to store all your data permanently. Use intelligent compression techniques and retain statistical summaries only for older data. Do not forget to adjust your settings when your data changes, to maintain consistency for trends.
- A lot can be done without databases, especially for large data.
- Always include EDA and DOE (exploratory analysis / design of experience) from the beginning in all data science projects. Always create a data dictionary. And follow the traditional life cycle of any data science project.
- Data + models + intestinal sensations + intuition is the perfect blend. Do not remove any of these ingredients in your decision-making process.
- The data can be used for many purposes:
2.To find actionable models (stock trading, fraud detection)
3.For resale to your business customers
4.To optimize decisions and processes (operational research)
5.For investigation and discovery (IRS, litigation, fraud detection, cause analysis)
6.Machine-to-machine communication (automated bidding, automated driving)
7.Forecasts (sales forecast, growth and financial forecasts, weather forecast)
- When do you need real-time processing? When fraud detection is critical, or when processing sensitive transactional data (credit card fraud detection, 911 calls). Other than this, the delayed analysis (with a latency of a few seconds to 24 hours) is quite good.
- Make sure your sensitive data is protected. Make sure that your algorithms can not be falsified by criminal hackers or hackers (spy on your business and steal anything they can, legally or illegally, and compromise your algorithms – which results in lost revenue severe)
- Mix several patterns together to detect many patterns. Average of these models.
- Use multiple sources for the same data: your internal source and data from one or two vendors. Understand the gaps between these different sources, to get a better idea of what real numbers should be. Sometimes large differences occur when a metric definition is changed by one of the vendors or changed internally or if the data has changed (some fields are no longer tracked). A classic example is Web traffic data: use internal logs, Google Analytics and another provider (eg Accenture) to track this data.
source : datasciencecentral