How can I effectively load data on Stack Overflow questions using Pandas read_clipboard?

Question

I notice a lot of Pandas questions on Stack Overflow only include a few rows of their data as text, without the accompanying code to generate/reproduce it. I am aware of the existence of read_clipboard, but I am unable to figure out how to effectively call this function to read data in many situations, such as when there are white spaces in the header names, or Python objects such as lists in the columns.

How can I use pd.read_clipboard more effectively to read data pasted in unconventional formats that don't lend themselves to easy reading using the default arguments? Are there situations where read_clipboard comes up short?

Made sure main was the right place to post this by asking [a question on meta](https://meta.stackoverflow.com/questions/403870/should-my-canonical-about-read-clipboard-usage-in-pandas-be-on-meta-or-main/403875?noredirect=1#comment815449_403875). — cs95, Dec 20 '20 at 10:47

score 9 · Accepted Answer · edited Jul 31 '21 at 18:39

`read_clipboard`: Beginner's Guide

read_clipboard is truly a saving grace for anyone starting out to answer questions in the Pandas tag. Unfortunately, pandas veterans also know that the data provided in questions isn't always easy to grok into a terminal due to various complications in the format of the data posted.

Thankfully, read_clipboard has arguments that make handling most of these cases possible (and easy). Here are some common use cases and their corresponding arguments.

Common Use Cases

read_clipboard uses read_csv under the hood with white space separator, so a lot of the techniques for parsing data from CSV apply here, such as

parsing columns with spaces in the data
- use sep with regex argument. First, ensure there are at least two spaces between columns and at most one consecutive white space inside the column's data itself. Then you can use sep=r'\s{2,}' which means "separate columns by looking for at least two consecutive white spaces for the separator" (note: engine='python' is required for multicharacter or regex separators):
```
 df = pd.read_clipboard(..., sep=r'\s{2,}', engine='python')
```
  Also see How do you handle column names having spaces in them when using pd.read_clipboard?.
reading a series instead of DataFrame
- use squeeze=true, you would likely also need header=None if the first row is also data.
```
 s = pd.read_clipboard(..., header=None, squeeze=True)
```
  Also see Could there be an easier way to use pandas read_clipboard to read a Series?.
loading data with custom header names
- use names=[...] in conjunction with header=None and skiprows=[0] to ignore existing headers.
```
 df = pd.read_clipboard(..., header=None, names=['a', 'b', 'c'], skiprows=[0])
```
loading data without any headers
- use header=None
set one or more columns as the index
- use index_col=[...] with the appropriate label or index
parsing dates
- use parse_dates with the appropriate format. If parsing datetimes (i.e., columns with date separated by timestamp), you will likely also need to use sep=r'\s{2,}' while ensuring your columns are separated by at least two spaces.

See this answer by me for a more comprehensive list on read_csv arguments for other cases not covered here.

Caveats

read_clipboard is a Swiss Army knife. However, it

cannot read data in prettytable/tabulate formats (IOW, borders make it harder)
- See Reading in a pretty-printed/formatted dataframe using pd.read_clipboard? for solutions to tackle this.
cannot correctly parse MultIndexes unless all elements in the index are specified.
- See Copying MultiIndex dataframes with pd.read_clipboard? for solutions to tackle this.
cannot ignore/handle ellipses in data
- my suggested method is to manually remove ellipses before printing
cannot parse columns of lists (or other objects) as anything other than string. The columns will need to be converted separately, as shown in How do you read in a dataframe with lists using pd.read_clipboard?.
cannot read text from images (so please don't use images as a means to share your data with folks, please!)

score 2 · Answer 2 · edited Jul 31 '21 at 18:37

2

The one weakness of this function is that it doesn't capture contents of Ctrl + C if the copy is performed from a PDF file. Testing it this way results in an empty read.

But by using a regular text editor, it goes just fine. Here is an example using randomly typed text:

>>> pd.read_clipboard()
Empty DataFrame
Columns: [sfsesfsdsxcvfsdf]
Index: []

edited Jul 31 '21 at 18:37

Peter Mortensen

30,738
21
105
131

answered Dec 20 '20 at 11:23

etch_45

792
1
6
21

1

Hopefully no one is going around pasting pdfs of their data for folks to use while answering questions here, but this is a general concern. Well, this, and being unable to parse text from images, which are all too commonly posted on this site. :-( If you're up to make a tesseract extension that does this, you'll be doing a pretty big service. – cs95 Dec 20 '20 at 11:24
That's something to look into. I have never worked with tesseract but always up to learn something new. Thanks @cs95! – etch_45 Dec 20 '20 at 11:27

How can I effectively load data on Stack Overflow questions using Pandas read_clipboard?

2 Answers2

`read_clipboard`: Beginner's Guide

Common Use Cases

Caveats

Linked

How can I effectively load data on Stack Overflow questions using Pandas read_clipboard?

2 Answers2

read_clipboard: Beginner's Guide

Common Use Cases

Caveats

Linked

`read_clipboard`: Beginner's Guide