![]() ![]() ![]() Keep in mind that the data we generate won’t be perfect unless we tune the out-of-the-box code. ![]() Many of the included and community providers are even localized for different regions. I’d rather use generated data where analysts can focus on how ducking awesome DuckDB is instead of how unclean the data is.Īs a bonus, using generated data allows us to create data that’s better aligned with real-world uses cases for the average analyst, as Anna Geller requests in a recent tweet.Īs a user, I would appreciate some randomly generated datasets where folks can analyze real world things like costs and revenue rather than petal lengths- Anna Geller JanuUsing Python Fakerįaker is a Python package for generating fake data, with a large number of providers for generating different types of data, such as people, credit cards, dates/times, cars, phone numbers, etc. Of course, I could clean these up, but using these records as-is makes me frequently question my SQL skills. Others have documented additional issues with dirty data. Interestingly all trips with dates in the future are posted from a single vendor (see data dictionary). │ tpep_pickup_datetime │ VendorID │ passenger_count │ fare_amount │ Based on the fare_amount for the following 5 person trip in 2098, I’d say we can safely conclude that inflation will be on a downward or lateral trend over the next 60 years. You can see here that some taxi trips were taken seriously far in the future. We’re very lucky to have this dataset, but like many data sources, the data is in need of cleaning. The DuckDB community regularly uses the NYC Taxi Data to demonstrate and test features as it’s a reasonably large set of data (billions of records) and it’s data the public understands. There is a plethora of interesting public data out there. ![]()
0 Comments
Leave a Reply. |