Notebooks may be exported to a range of static formats, including HTML (for example, for blog posts), reStructuredText, LaTeX, PDF, and slide shows, via the nbconvert command. Furthermore, any.ipynb notebook document available from a public URL can be shared via the Jupyter Notebook Viewer. This service loads the notebook document from the.
-->HDInsight Spark clusters provide kernels that you can use with the Jupyter Notebook on Apache Spark for testing your applications. A kernel is a program that runs and interprets your code. The three kernels are:
Jupyter Notebooks to Publishable PDF via LaTeX. Markdown and LaTeX can change your notebook from plain code to an interesting paper that people would love to read.Notebook Link:http://nbviewer.ipython.org/.
- PySpark - for applications written in Python2.
- PySpark3 - for applications written in Python3.
- Spark - for applications written in Scala.
In this article, you learn how to use these kernels and the benefits of using them.
Prerequisites
An Apache Spark cluster in HDInsight. For instructions, see Create Apache Spark clusters in Azure HDInsight.
Create a Jupyter Notebook on Spark HDInsight
From the Azure portal, select your Spark cluster. See List and show clusters for the instructions. The Overview view opens.
From the Overview view, in the Cluster dashboards box, select Jupyter Notebook. If prompted, enter the admin credentials for the cluster.
Note
You may also reach the Jupyter Notebook on Spark cluster by opening the following URL in your browser. Replace CLUSTERNAME with the name of your cluster:
https://CLUSTERNAME.azurehdinsight.net/jupyter
Select New, and then select either Pyspark, PySpark3, or Spark to create a notebook. Use the Spark kernel for Scala applications, PySpark kernel for Python2 applications, and PySpark3 kernel for Python3 applications.
A notebook opens with the kernel you selected.
Benefits of using the kernels
Here are a few benefits of using the new kernels with Jupyter Notebook on Spark HDInsight clusters.
Preset contexts. With PySpark, PySpark3, or the Spark kernels, you don't need to set the Spark or Hive contexts explicitly before you start working with your applications. These contexts are available by default. These contexts are:
sc - for Spark context
sqlContext - for Hive context
So, you don't have to run statements like the following to set the contexts:
Instead, you can directly use the preset contexts in your application.
Cell magics. The PySpark kernel provides some predefined 'magics', which are special commands that you can call with
%%
(for example,%%MAGIC
<args>
). The magic command must be the first word in a code cell and allow for multiple lines of content. The magic word should be the first word in the cell. Adding anything before the magic, even comments, causes an error. For more information on magics, see here.The following table lists the different magics available through the kernels.
Magic Example Description help %%help
Generates a table of all the available magics with example and description info %%info
Outputs session information for the current Livy endpoint configure %%configure -f
{'executorMemory': '1000M'
,'executorCores': 4
}Configures the parameters for creating a session. The force flag ( -f
) is mandatory if a session has already been created, which ensures that the session is dropped and recreated. Look at Livy's POST /sessions Request Body for a list of valid parameters. Parameters must be passed in as a JSON string and must be on the next line after the magic, as shown in the example column.sql %%sql -o <variable name>
SHOW TABLES
Executes a Hive query against the sqlContext. If the -o
parameter is passed, the result of the query is persisted in the %%local Python context as a Pandas dataframe.local %%local
a=1
All the code in later lines is executed locally. Code must be valid Python2 code no matter which kernel you're using. So, even if you selected PySpark3 or Spark kernels while creating the notebook, if you use the %%local
magic in a cell, that cell must only have valid Python2 code.logs %%logs
Outputs the logs for the current Livy session. delete %%delete -f -s <session number>
Deletes a specific session of the current Livy endpoint. You can't delete the session that is started for the kernel itself. cleanup %%cleanup -f
Deletes all the sessions for the current Livy endpoint, including this notebook's session. The force flag -f is mandatory. Note
In addition to the magics added by the PySpark kernel, you can also use the built-in IPython magics, including
%%sh
. You can use the%%sh
magic to run scripts and block of code on the cluster headnode.Auto visualization. The Pyspark kernel automatically visualizes the output of Hive and SQL queries. You can choose between several different types of visualizations including Table, Pie, Line, Area, Bar.
Parameters supported with the %%sql magic
The %%sql
magic supports different parameters that you can use to control the kind of output that you receive when you run queries. The following table lists the output.
Parameter | Example | Description |
---|---|---|
-o | -o <VARIABLE NAME> | Use this parameter to persist the result of the query, in the %%local Python context, as a Pandas dataframe. The name of the dataframe variable is the variable name you specify. |
-q | -q | Use this parameter to turn off visualizations for the cell. If you don't want to autovisualize the content of a cell and just want to capture it as a dataframe, then use -q -o <VARIABLE> . If you want to turn off visualizations without capturing the results (for example, for running a SQL query, like a CREATE TABLE statement), use -q without specifying a -o argument. |
-m | -m <METHOD> | Where METHOD is either take or sample (default is take). If the method is take , the kernel picks elements from the top of the result data set specified by MAXROWS (described later in this table). If the method is sample, the kernel randomly samples elements of the data set according to -r parameter, described next in this table. |
-r | -r <FRACTION> | Here FRACTION is a floating-point number between 0.0 and 1.0. If the sample method for the SQL query is sample , then the kernel randomly samples the specified fraction of the elements of the result set for you. For example, if you run a SQL query with the arguments -m sample -r 0.01 , then 1% of the result rows are randomly sampled. |
-n | -n <MAXROWS> | MAXROWS is an integer value. The kernel limits the number of output rows to MAXROWS. If MAXROWS is a negative number such as -1, then the number of rows in the result set isn't limited. |
Example:
The statement above does the following actions:
- Selects all records from hivesampletable.
- Because we use -q, it turns off autovisualization.
- Because we use
-m sample -r 0.1 -n 500
, it randomly samples 10% of the rows in the hivesampletable and limits the size of the result set to 500 rows. - Finally, because we used
-o query2
it also saves the output into a dataframe called query2.
Considerations while using the new kernels
Whichever kernel you use, leaving the notebooks running consumes the cluster resources. With these kernels, because the contexts are preset, simply exiting the notebooks doesn't kill the context. And so the cluster resources continue to be in use. A good practice is to use the Close and Halt option from the notebook's File menu when you're finished using the notebook. The closure kills the context and then exits the notebook.
Where are the notebooks stored?
If your cluster uses Azure Storage as the default storage account, Jupyter Notebooks are saved to storage account under the /HdiNotebooks folder. Notebooks, text files, and folders that you create from within Jupyter are accessible from the storage account. For example, if you use Jupyter to create a folder myfolder
and a notebook myfolder/mynotebook.ipynb, you can access that notebook at /HdiNotebooks/myfolder/mynotebook.ipynb
within the storage account. The reverse is also true, that is, if you upload a notebook directly to your storage account at /HdiNotebooks/mynotebook1.ipynb
, the notebook is visible from Jupyter as well. Notebooks remain in the storage account even after the cluster is deleted.
Note
HDInsight clusters with Azure Data Lake Storage as the default storage do not store notebooks in associated storage.
The way notebooks are saved to the storage account is compatible with Apache Hadoop HDFS. If you SSH into the cluster you can use the file management commands:
Command | Description |
---|---|
hdfs dfs -ls /HdiNotebooks | # List everything at the root directory – everything in this directory is visible to Jupyter from the home page |
hdfs dfs –copyToLocal /HdiNotebooks | # Download the contents of the HdiNotebooks folder |
hdfs dfs –copyFromLocal example.ipynb /HdiNotebooks | # Upload a notebook example.ipynb to the root folder so it's visible from Jupyter |
Jupyter Notebook To Latex
Whether the cluster uses Azure Storage or Azure Data Lake Storage as the default storage account, the notebooks are also saved on the cluster headnode at /var/lib/jupyter
.
Supported browser
Jupyter Notebooks on Spark HDInsight clusters are supported only on Google Chrome.
Suggestions
The new kernels are in evolving stage and will mature over time. So the APIs could change as these kernels mature. We would appreciate any feedback that you have while using these new kernels. The feedback is useful in shaping the final release of these kernels. You can leave your comments/feedback under the Feedback section at the bottom of this article.
Next steps
-->Learn how to run your Jupyter notebooks directly in your workspace in Azure Machine Learning studio. While you can launch Jupyter or JupyterLab, you can also edit and run your notebooks without leaving the workspace.
For information on how to create and manage files, including notebooks, see Create and manage files in your workspace.
Important
Features marked as (preview) are provided without a service level agreement, and it's not recommended for production workloads. Certain features might not be supported or might have constrained capabilities.For more information, see Supplemental Terms of Use for Microsoft Azure Previews.
Prerequisites
- An Azure subscription. If you don't have an Azure subscription, create a free account before you begin.
- A Machine Learning workspace. See Create an Azure Machine Learning workspace.
Edit a notebook
To edit a notebook, open any notebook located in the User files section of your workspace. Click on the cell you wish to edit. If you don't have any notebooks in this section, see Create and manage files in your workspace.
You can edit the notebook without connecting to a compute instance. When you want to run the cells in the notebook, select or create a compute instance. If you select a stopped compute instance, it will automatically start when you run the first cell.
When a compute instance is running, you can also use code completion, powered by Intellisense, in any Python notebook.
You can also launch Jupyter or JupyterLab from the notebook toolbar. Azure Machine Learning does not provide updates and fix bugs from Jupyter or JupyterLab as they are Open Source products outside of the boundary of Microsoft Support.
More Images For Latex In Jupyter Notebook Python »
Focus mode
Use focus mode to expand your current view so you can focus on your active tabs. Focus mode hides the Notebooks file explorer.
In the terminal window toolbar, select Focus mode to turn on focus mode. Depending on your window width, the tool may be located under the ... menu item in your toolbar.
While in focus mode, return to the standard view by selecting Standard view.
Code completion (IntelliSense)
IntelliSense is a code-completion aid that includes many features: List Members, Parameter Info, Quick Info, and Complete Word. With only a few keystrokes, you can:
- Learn more about the code you're using
- Keep track of the parameters you're typing
- Add calls to properties and methods
Insert code snippets (preview)
Use Ctrl+Space to trigger IntelliSense code snippets. Scroll through the suggestions or start typing to find the code you want to insert. Once you insert code, tab through the arguments to customize the code for your own use.
These same snippets are available when you open your notebook in VS Code. For a complete list of available snippets, see Azure Machine Learning VS Code Snippets.
You can browse and search the list of snippets by using the notebook toolbar to open the snippet panel.
From the snippets panel, you can also submit a request to add new snippets.
Clean your notebook (preview)
Over the course of creating a notebook, you typically end up with cells you used for data exploration or debugging. The gather feature will help you produce a clean notebook without these extraneous cells.
- Run all of your notebook cells.
- Select the cell containing the code you wish the new notebook to run. For example, the code that submits an experiment, or perhaps the code that registers a model.
- Select the Gather icon that appears on the cell toolbar.
- Enter the name for your new 'gathered' notebook.
The new notebook contains only code cells, with all cells required to produce the same results as the cell you selected for gathering.
Save and checkpoint a notebook
Azure Machine Learning creates a checkpoint file when you create an ipynb file.
In the notebook toolbar, select the menu and then File>Save and checkpoint to manually save the notebook and it will add a checkpoint file associated with the notebook.
Every notebook is autosaved every 30 seconds. AutoSave updates only the initial ipynb file, not the checkpoint file.
Select Checkpoints in the notebook menu to create a named checkpoint and to revert the notebook to a saved checkpoint.
Export a notebook
In the notebook toolbar, select the menu and then Export As to export the notebook as any of the supported types:
- Notebook
- Python
- HTML
- LaTeX
The exported file is saved on your computer.
Run a notebook or Python script
To run a notebook or a Python script, you first connect to a running compute instance.
If you don't have a compute instance, use these steps to create one:
- In the notebook or script toolbar, to the right of the Compute dropdown, select + New Compute. Depending on your screen size, this may be located under a ... menu.
- Name the Compute and choose a Virtual Machine Size.
- Select Create.
- The compute instance is connected to the file automatically. You can now run the notebook cells or the Python script using the tool to the left of the compute instance.
If you have a stopped compute instance, select Start compute to the right of the Compute dropdown. Depending on your screen size, this may be located under a ... menu.
Only you can see and use the compute instances you create. Your User files are stored separately from the VM and are shared among all compute instances in the workspace.
View logs and output
Use notebook widgets to view the progress of the run and logs. A widget is asynchronous and provides updates until training finishes. Azure Machine Learning widgets are also supported in Jupyter and JupterLab.
Explore variables in the notebook
On the notebook toolbar, use the Variable explorer tool to show the name, type, length, and sample values for all variables that have been created in your notebook.
Select the tool to show the variable explorer window.
Navigate with a TOC
On the notebook toolbar, use the Table of contents tool to display or hide the table of contents. Start a markdown cell with a heading to add it to the table of contents. Click on an entry in the table to scroll to that cell in the notebook.
Change the notebook environment
The notebook toolbar allows you to change the environment on which your notebook runs.
These actions will not change the notebook state or the values of any variables in the notebook:
Action | Result |
---|---|
Stop the kernel | Stops any running cell. Running a cell will automatically restart the kernel. |
Navigate to another workspace section | Running cells are stopped. |
These actions will reset the notebook state and will reset all variables in the notebook.
Action | Result |
---|---|
Change the kernel | Notebook uses new kernel |
Switch compute | Notebook automatically uses the new compute. |
Reset compute | Starts again when you try to run a cell |
Stop compute | No cells will run |
Open notebook in Jupyter or JupyterLab | Notebook opened in a new tab. |
Add new kernels
Use the terminal to create and add new kernels to your compute instance. The notebook will automatically find all Jupyter kernels installed on the connected compute instance.
Use the kernel dropdown on the right to change to any of the installed kernels.
Status indicators
An indicator next to the Compute dropdown shows its status. The status is also shown in the dropdown itself.
Color | Compute status |
---|---|
Green | Compute running |
Red | Compute failed |
Black | Compute stopped |
Light Blue | Compute creating, starting, restarting, setting Up |
Gray | Compute deleting, stopping |
An indicator next to the Kernel dropdown shows its status.
Color | Kernel status |
---|---|
Green | Kernel connected, idle, busy |
Gray | Kernel not connected |
Find compute details
Find details about your compute instances on the Compute page in studio.
Useful keyboard shortcuts
Similar to Jupyter Notebooks, Azure Machine Learning Studio notebooks have a modal user interface. The keyboard does different things depending on which mode the notebook cell is in. Azure Machine Learning Studio notebooks support the following two modes for a given code cell: command mode and edit mode.
Command mode shortcuts
A cell is in command mode when there is no text cursor prompting you to type. When a cell is in Command mode, you can edit the notebook as a whole but not type into individual cells. Enter command mode by pressing ESC
or using the mouse to select outside of a cell's editor area. The left border of the active cell is blue and solid, and its Run button is blue.
Shortcut | Description |
---|---|
Enter | Enter edit mode |
Shift + Enter | Run cell, select below |
Control/Command + Enter | Run cell |
Alt + Enter | Run cell, insert code cell below |
Control/Command + Alt + Enter | Run cell, insert markdown cell below |
Alt + R | Run all |
Y | Convert cell to code |
M | Convert cell to markdown |
Up/K | Select cell above |
Down/J | Select cell below |
A | Insert code cell above |
B | Insert code cell below |
Control/Command + Shift + A | Insert markdown cell above |
Control/Command + Shift + B | Insert markdown cell below |
X | Cut selected cell |
C | Copy selected cell |
Shift + V | Paste selected cell above |
V | Paste selected cell below |
D D | Delete selected cell |
O | Toggle output |
Shift + O | Toggle output scrolling |
I I | Interrupt kernel |
0 0 | Restart kernel |
Shift + Space | Scroll up |
Space | Scroll down |
Tab | Change focus to next focusable item (when tab trap disabled) |
Control/Command + S | Save notebook |
1 | Change to h1 |
2 | Change to h2 |
3 | Change to h3 |
4 | Change to h4 |
5 | Change to h5 |
6 | Change to h6 |
Edit mode shortcuts
Edit mode is indicated by a text cursor prompting you to type in the editor area. When a cell is in edit mode, you can type into the cell. Enter edit mode by pressing Enter
or using the mouse to select on a cell's editor area. The left border of the active cell is green and hatched, and its Run button is green. You also see the cursor prompt in the cell in Edit mode.
Using the following keystroke shortcuts, you can more easily navigate and run code in Azure Machine Learning notebooks when in Edit mode.
Shortcut | Description |
---|---|
Escape | Enter command mode |
Control/Command + Space | Activate IntelliSense |
Shift + Enter | Run cell, select below |
Control/Command + Enter | Run cell |
Alt + Enter | Run cell, insert code cell below |
Control/Command + Alt + Enter | Run cell, insert markdown cell below |
Alt + R | Run all cells |
Up | Move cursor up or previous cell |
Down | Move cursor down or next cell |
Control/Command + S | Save notebook |
Control/Command + Up | Go to cell start |
Control/Command + Down | Go to cell end |
Tab | Code completion or indent (if tab trap enabled) |
Control/Command + M | Enable/disable tab trap |
Control/Command + ] | Indent |
Control/Command + [ | Dedent |
Control/Command + A | Select all |
Control/Command + Z | Undo |
Control/Command + Shift + Z | Redo |
Control/Command + Y | Redo |
Control/Command + Home | Go to cell start |
Control/Command + End | Go to cell end |
Control/Command + Left | Go one word left |
Control/Command + Right | Go one word right |
Control/Command + Backspace | Delete word before |
Control/Command + Delete | Delete word after |
Control/Command + / | Toggle comment on cell |
Troubleshooting
- If you can't connect to a notebook, ensure that web socket communication is not disabled. For compute instance Jupyter functionality to work, web socket communication must be enabled. Ensure your network allows websocket connections to *.instances.azureml.net and *.instances.azureml.ms.
- When compute instance is deployed in a private link workspace it can be only be accessed from within virtual network. If you are using custom DNS or hosts file please add an entry for < instance-name >.< region >.instances.azureml.ms with private IP address of workspace private endpoint. For more information see the custom DNS article.